this is aaronland

carbonite and brochureware

LEAKED: Carbonite | Tech News Today

Technology news and reviews image

The following was originally published on the Museums and the Web 2012 conference website. It was co-written with Ryan Donahue, then of the George Eastman House. At the time this was written I was about to start a new role at the Cooper Hewitt and actively working on the Parallel Flickr project. I can't remember the details (thirteen years later as I write this) but by the time we finally finished this paper we actually disagreed quite strongly about what the best, or at least most realistic, approach would be. The slides for the talk that accompanied this paper follow the text of the paper itself.

Introduction

The digital turn, whether it be in photography, motion pictures, literature or other pursuits, has forever changed the ways and means by which museums collect, interpret and disseminate objects. Historically, museums have dealt primarily with objects that, in some way, are intuitively graspable from observation. Arrowheads are pointy, photographs are plainly seen, and writing read (with varying degrees of difficulty).

Digital information, on the other hand, is many times more difficult to intuitively grasp, and in many cases impossible to grasp intuitively. This has lead to a shifting approach in collection management: collect materials before they are scattered to the hard drives and floppy discs of history. With such a shift in approach, museums must become adept at preserving digital objects and ephemera in their original context: in many cases, a website. We examine one site of particular interest and complexity: Flickr.

Preserving Flickr has long been a popular subject of conversation among Flickr staff, alumni, and the museum community. Various tools and first steps have been taken, but no one to date has addressed the seemingly impossible task of preservation.

The authors of this paper are very familiar with its subject. Aaron Straup Cope is a Flickr alumni, and Ryan Donahue is the Manager of Collection Information, Digital Assets and Web Development at George Eastman House, an early partner institution of the Flickr Commons. Through our knowledge and consultation with museum professionals, we shall address:

Journey Into Self

“The Internet is Disappearing...”  The Economist Feb 5th 1994

“Though primarily concerned with Leyda's ‘squirming, seemingly formless larvae,’ Rick Prelinger has long been fascinated (and occasionally troubled) by the process by which this raw material is reworked into documentary cinema.” -– Leo Goldsmith, Recycling Programs

“History has always been lossy.” – Aaron Straup Cope, over drinks probably

Archiving web-based documents, whether in isolation or within the context of a larger parent website, has been a part of digital archival practice in many institutions for the better part of the last decade. Early archives were built manually by saving HTML documents and related media (such as video or images) and by using an archival web crawler, such as HERETRIX (https://webarchive.jira.com/wiki/display/Heritrix/Heritrix) for search. For sites of greater complexity, such as blogs and wikis, application-specific archival tools remain nascent. Those that do exist tend to export their data as XML or similar data serialization formats and can be as elementary as direct database exports or as complex as hierarchical documents with multiple, sometimes overlapping, schemas.

Among the six hundred million plus websites on the internet (Netcraft, 2012), there is a distinct subset of large, dynamic and community-focused social websites (like Flickr, Facebook or Twitter, for example) which will mandate a dramatic shift in archival practice and thought, particularly when it comes to digital preservation. Given their scale and complexity, previously established methodologies in digital preservation are ill-suited for the task of creating meaningful archives of these types of websites; preserving a perfect mirror image, or even any sort of representative sample of these websites remains impractical given their size and dynamic nature.

The challenge, then, becomes how to best to preserve these sites in order to provide for scholarly inquiry, create a historical record and enable curatorial interpretation without archiving the web application in whole.

Our paper addresses this problem using, but not limiting itself to, the photo-sharing website Flickr as a case-study (a sort of patient-zero). We describe the challenges faced and outline a practical approach for the preservation of the top-tier of large, dynamic, and social websites, and a set of best practices going forward. Not all archiving projects will face the kinds of challenges discussed here but it is hoped that by tackling an extreme case like Flickr we can hint at a way forward for other archiving projects, both large and small, in the future.

A Few Uncomfortable (and Inconvenient) Truths

Or, for our intents and purposes, things that museums are not presently able to do for large, dynamic websites include:

Methodologies of Archiving

The “Brochureware” method

The term “brochureware” itself is not used as a pejorative but rather meant to reflect the static and bounded, in size and scope, nature of the archive and the number of objects in a collection. This methodology involves archiving a site using minimal documentary means: screenshots, descriptive narratives and a limited amount of static content from a site itself. It is the earliest, and most popular, methodology employed in contemporary practice and well-suited for small websites with little or no interactivity or dynamic content.

The archive ends up being a family album of sorts, containing snapshots into a website’s life. A unique benefit in this approach is that it is universally applicable across the web. This methodology is employed by the Internet Archive and many of the standard web archiving tools presently available. It is generally impractical for Flickr, which was designed as a series of “small tools for self-organizing” communities in an exploration of photography as a social object and as a medium, in its entirety. Neither of these is a concept that lends itself readily to being “snapshot-ed.”

The “Formaldehyde” method

Using this approach an archive would systematically pre-render, or "bake" all the dynamic pages and possible interactions of a website and host it as a read-only website. This assumes that viewers continue to use web browsers and that the archive in question operate a web server to serve those pages, but the HTML and HTTP standards are both well-documented and relatively easy to implement assuming a time when the “Web” is no longer ubiquitous or nearly so.

Like the “brochureware” method, this is impractical for archiving a website like Flickr. The biggest problem with using this methodology with Flickr is the “singular perspective” problem. Sites like Flickr provide customized content for users based on age, country of origin and other user characteristics, along with restrictions on that content dictated by the relationships between a photographer and other Flickr users. Trying to preserve a website like Flickr using the “formaldehyde” method would manifest itself as an incomplete archive, particularly when archived from a controlled locale (for example, Germany and China each have government mandated restrictions on who can see specific content and when), or in sites where the social graph is used to filter content. Although theoretically possible with a site like Flickr, the number of combined pages (particularly search results) suggest that this may be an unmanageable and impractical approach. [2]

The “Carbonite” method

“In some locales, carbonite was also used to preserve the bodies of the dead… Han Solo was frozen in carbonite on Darth Vader's orders to test whether a Human could survive being encased in carbonite before his plan to freeze Luke Skywalker. The modifications to the freezing chamber were successful in that Han lived through the freezing process and entered into suspended animation.” – Carbonite - Wookieepedia, the Star Wars Wiki

Carbonite is a fictional metal alloy, from the popular Star Wars movies, which when mixed with tibanna gas (also fictional) and compressed can be frozen into blocks for transport. This methodology borrows its name from the practice of “freeze-drying” the entire stack used to create a website (databases and other servers along with the configurations, operating systems, user accounts and all other files) in to a software-based virtual machine which can later be reactivated for research or exhibition purposes. This can be thought as being akin to a research library with gated access to original materials. The piece itself would no longer be able to be updated but the work that once existed can still be preserved.

The principle difficulty with this methodology is the efficacy of making an archive accessible in a read-only state. While Flickr itself was built around the idea of “feature flags” to allow the ability to easily disable specific pieces of functionality (disabling uploads without having to disable the entire site, for example), not all sites are designed in this way. This then leads to the potential challenge of whether an archive should take it upon itself to modify a copy for exhibition and access. Even then, institutions would then have to revisit the age-old question of whether modification would materially change the nature of the website being archived.

The “Carbonite” method is often the approach adopted by design studios who struggle with how to preserve their work when a client stops supporting it or the software used to view it is no longer supported.

A concrete example of this problem and approach is the Flickr “Explore Clock” designed and built for Flickr by Stamen Design in 2009. The clock was developed as part of the introduction of video as a supported media on Flickr. Users were asked to add a special kind of “machine” tag to their videos indicating what time the video was taken and then submit their work to a moderated group for approval. (Unlike digital photos there is no single metadata standard that video cameras follow and so relatively simple things like the date and time a video was shot are absent from the files themselves.) The videos were then displayed on an interactive Flash-based timeline, developed by Stamen, that retrieved records using the Flickr API.

In January 2012 Flickr announced that it would remove the Explore Clock from the site because “its user experience is complicated and it’s not core to our product offering” (http://blog.flickr.net/en/2012/01/13/start-the-new-year-fresh/). How then should Stamen, or a cultural heritage institution, preserve this work? The most effective means would be archive the results of the Flickr API calls themselves (typically XML files sent over HTTP), for the dates that the Clock was operational, as static files on another server and then update the Flash application to request and retrieve its data from this updated endpoint. Whether the application was modified to request the actual files or call another program that mimics the Flickr API and processes those files is outside the scope of this discussion. The salient point is that the work in question has few enough moving pieces (it is a self-contained application that interacts with Flickr using a set of bounded criteria that can itself be stored as static files) that the “carbonite” method is applicable.

The “Dressmaker” method

Using this approach the raw component parts of a site (databases, user accounts, etc.) are stored but no attempt is made to reactivate the site itself, and no attempt to create a single canonical index is made. It is assumed that the archive itself is treated in the same way that a dressmaker might use (and reuse and combine) a single pattern with different fabrics.

It shares characteristics of the “formadelhyde” method in that it presumes that as much raw data will be preserved, preferably as plain text documents, but does not assume that on their own they will constitute any real of meaningful representation of the thing being archive.

The archive itself is not an artifact, nor can it be for all the reasons discussed above. Rather, it is an attempt to preserve things in a way that forces the archive itself to focus on the process in which a website was created and thrived rather than on any single representation or artifact.

By virtue of the way (the "how") that these sites would be archived, putting together an interpretation (curating) is not simply a question of arranging things (artifacts) but of many possible new things, or derivatives, that are the sampling of the whole. The interpretation of the archive manifests itself as something like but not the same as the thing being preserved.

Of the four methodologies described, this is the best suited to a website as large and far-reaching as Flickr.

Anti-Museums

“Ephemeral films were generally produced to fulfill specific objectives at specific times, and most often were not considered to be of value afterwards. In retrospect, they provide unparalleled evidence of the visual appearance and ambiance of their time, and function as rich, evocative, and often entertaining documentation of the American past. … Frequently offering more than just evidence, ephemeral films document past persuasions and anxieties.  They show us not only how we were, but how we were supposed to be." – Prelinger Archives, Collection Summary

Flickr, by its very nature, has always been a kind of anti-museum. This was always its strength and the breadth of photos always outweighed the inevitable tidal wave of potentially mediocre submissions that comes from letting "anyone" participate. By extension it becomes essentially impossible to assign any sort of curatorial guidance, or policy, when it comes to archiving. Insofar as a policy exists the short version would be: Yes!

The long version is: Yes, so long as it, and all the associated metadata, has been or is subsequently blessed as "public" by the photo owner. This is more of a practical matter than a philosophical stance since managing permissions rapidly becomes a significant technical and resource, not to mention ethical, challenge that falls outside the remit of cultural heritage institutions.

Archiving a website like Flickr is a massive challenge for any one institution to undertake. A more practical approach might be for an institution to:

There is an important point to make about community contributions: Not all contributions can, or should, be assigned the same weight or value; however, institutions should note that trying to create an archive of a community-based website without also allowing the same community to participate runs the risk of antagonizing the users, fostering suspicion and potentially encouraging competing archival copies. Institutions need to be aware of this new form of cultural sensitivity when dealing with community-driven websites, and it’s in an institution’s best interest to not alienate said community.

Museums and libraries and archives are no longer the only institutions capable of collecting, housing and organizing cultural heritage. What the Internet has demonstrated is that it is possible for communities of interest to self-organize around a topic and in a relatively short span of time produce bodies of work that sometimes rival traditional scholars in their depth and usually exceed them in their breadth.

That is not to say, however, that museums, libraries and archives should abstain from collecting materials. As history has proven repeatably, any archival system dependent upon external entities making sound archiving decisions is fundamentally flawed. What this finding does encourage is cooperation between institutions and the users (in Flickr’s case, the small self-organizing communities) that constitute the blood of socially-driven communities, particularly in the area of cataloging. Using primary sources isn’t necessarily news, but in the context of socially-driven websites, it can be overlooked.

Cataloging is the place where museums and the larger public and communities of enthusiasts are meeting and being forced to find common ground. The opportunity facing museums, and by extension, museum studies, today is to how to use and shape a participation in that cataloging process: to imagine museums not simply as archive of a considered past but also a zone of safekeeping for future considerations.

parallel-flickr

The application “parallel-flickr” is a useful proof-of-concept implementation of some of these ideas. It is designed to archive one or more, users’ Flickr photos and to generate a database backed website that honors the viewing permissions users have chosen for their photos (on Flickr). Additionally photos that have been “favorited” by a user, on Flickr, are also archived. This allows each individual instance of parallel-flickr to be slightly rough around the edges because they will contain truncated copies of other user’s photo streams: a reflection of the way users have experienced Flickr and other Flickr users.

The project was started in 2010 and provides a tangible example of some of the issues and suggestions addressed in this paper: something to “click” on to see what does and does not work and to provide a concrete framework around which discussion can continue. It uses the “dressmaker” approach, storing photos and metadata files separate from the application itself and using a series of tools to harvest and index that data into a standalone web application. It remains in active development, which means that not all the issues inherent in archiving a site like Flickr, even for a small and tightly controlled network of users, have been addressed or solved.

parallel-flickr is an open source project using standard “LAMP” (Linux, Apache, MySQL and PHP) and the Solr document indexer and a series of command line tools for fetching and indexing flat files described above. Minus the optional Solr requirement, parallel-flickr has been shown to run on ‘plain vanilla’ shared-hosting providers. If Solr is configured it provides the following functionality:

parallel-flickr archives the following data:

Except for the original photos these are not the items that are necessarily displayed to the visitors but are the raw "ingredients" that exist independent of any institution’s specific cataloging system or curator's interpretation.

A single photo with one comment would then demand that the following files (or objects) be archived:

# the original photo
/photos/234/676/884/343/234676884343_s4gfdJX54.jpg # basic metadata about the photo
/photos/234/676/884/343/234676884343_s4gfdJX54_i.json # comments on the photo
photos/234/676/884/343/234676884343_s4gfdJX54_c.json

# basic metadata about the photo owner and the photo commentor
/users/663/538/253/663/538/253@N00.json
/users/335/274/3/3352743@N01.json
	

Flat files have a number of advantages as a primary archive:

Technically, there is nothing special or required about the JSON format. Its main advantages, as of this writing, are: that it is a plain-text format which can be easily converted; it assumes the UTF-8 character encoding by default; it already has wide and popular support in most programming languages, particularly JavaScript; it would allow an archive to expose IDs and Urls on the web and make them accessible to third-party clients using JavaScript and the Cross Origin Resource Sharing (CORS) standard.

Likewise, assuming available disk space, there is nothing in the code base that would prevent an institution from modifying the application so that it will also generate metadata serialization in other formats, most notably XML, which still enjoys widespread adoption and use among cultural heritage organizations.

What has been described so far is not an indexing system or cataloguing methodology. These pieces are highly specific to the collecting institution, and are best suited to conform to institutional standards for cataloging. Decoupling the archival elements from the cataloging record enforces a separation of concerns that should permit more and faster iterating.

Cataloging born digital element does pose a significant shift in the availability of cataloging data (Johnston, 2010), particularly descriptive metadata. Whereas with Flickr, some pieces of data map nicely (titles, descriptions and keywords, in particular), others do not map cleanly. Is a photo considered authored by the uploading user? Are they the attributed author? While Flickr’s terms of service may require account holders to upload only their own materials (except the institutions in the Flickr Commons, that is), many users upload content that is not their own. Once the hard mapping decisions are made, however, cataloging can be a mostly automatic process.

Here's a layer cake of the whole stack with the various pieces we're talking about are identified:

The “dressmaker” approach presumes that an archive will be given form inside the larger context of a museum’s cataloging system. To assume otherwise places an unrealistic burden on institutions by asking them to replace the large and complex systems already in place to manage their collections. On the other hand, the dressmaker approach does presume that an archive consists of at least three distinct components:

Why is this approach better? It is more flexible and both easier and less expensive than alternatives to adapt. It is ultimately more reliable, despite the cost of upfront work to “process” files. Every programming interface can process text files while specialized databases are a software and hardware dependency nightmare waiting to happen.

Finally it lays bare the fact that by forcing the data to conform to the particular system(s) in which it will live, it is changed. The “dressmaker” methodology simply acknowledges this fact by separating the archive from the manifestation while also providing a template for a multiplicity of manifestations.

Future Steps

As of this writing there is no centralized database of parallel-flickr instances. Each application exists as an island unto itself, often archiving many of the same photos. Partly this is a design decision meant to honor the old adage “many copies keep things safe.” Partly it is a reflection of the fact that simply archiving a small set of a user’s photos presents enough challenges and complexities that a centralized registry is currently out of scope. If, however, such a service were created, it would be well within the scope of the parallel-flickr project to allow an instance to register itself globally.

Similarly, federated search of multiple instances of parallel-flickr is not yet possible. In theory it is absolutely possible given that parallel-flickr uses Solr for search indexing and Solr already has a robust and tested protocol for handling queries across multiple remote indexes. If implemented in concert with the above-imagined registry of parallel-flickr applications, it could provide not only a novel, but ultimately, practical way for a collection of institutions to archive a site the size of Flickr without forcing any single organization to bear the burden alone.

It is important to note that the federated search scenario described does not in any way address the problem of distributed permissions. As a practical necessity, distributed search would have to limit itself to only those photos that are public to everyone.

Finally, although it may be tempting to imagine taking an application like parallel-flickr and creating an abstract framework with which many different kinds of site could be archived, it is not recommended. The purpose of parallel-flickr is not to provide a generic archiving interface for social websites but instead to provide a working demonstration of a larger conceptual approach towards the problem.

References

Archiving Flickr and Other Sites of Interest to Museums (the talk)

Technology news and reviews image

These are the slides for the talk that accompanied the paper Preserving Flickr and Other Sites of Interest to Museums, co-written with Ryan Donahue and presented at the Museums and the Web 2012 conference in San Diego. Unfortunately there were never any notes for the slides written out in long-form but, as mentioned, there's an entire paper on the subject if you want to know more.

Technology news and reviews image
Technology news and reviews image
Technology news and reviews image
Technology news and reviews image
Technology news and reviews image
Technology news and reviews image
Technology news and reviews image
Technology news and reviews image
Technology news and reviews image
Technology news and reviews image
Technology news and reviews image
Technology news and reviews image
Technology news and reviews image
Technology news and reviews image
Technology news and reviews image
Technology news and reviews image
Technology news and reviews image
Technology news and reviews image
Technology news and reviews image
Technology news and reviews image