LEAKED: Carbonite | Tech News Today

The following was originally published on the Museums and the Web 2012 conference website. It was co-written with Ryan Donahue, then of the George Eastman House. At the time this was written I was about to start a new role at the Cooper Hewitt and actively working on the Parallel Flickr project. I can't remember the details (thirteen years later as I write this) but by the time we finally finished this paper we actually disagreed quite strongly about what the best, or at least most realistic, approach would be. The slides for the talk that accompanied this paper follow the text of the paper itself.

Introduction

The digital turn, whether it be in photography, motion pictures, literature or other pursuits, has forever changed the ways and means by which museums collect, interpret and disseminate objects. Historically, museums have dealt primarily with objects that, in some way, are intuitively graspable from observation. Arrowheads are pointy, photographs are plainly seen, and writing read (with varying degrees of difficulty).

Digital information, on the other hand, is many times more difficult to intuitively grasp, and in many cases impossible to grasp intuitively. This has lead to a shifting approach in collection management: collect materials before they are scattered to the hard drives and floppy discs of history. With such a shift in approach, museums must become adept at preserving digital objects and ephemera in their original context: in many cases, a website. We examine one site of particular interest and complexity: Flickr.

Preserving Flickr has long been a popular subject of conversation among Flickr staff, alumni, and the museum community. Various tools and first steps have been taken, but no one to date has addressed the seemingly impossible task of preservation.

The authors of this paper are very familiar with its subject. Aaron Straup Cope is a Flickr alumni, and Ryan Donahue is the Manager of Collection Information, Digital Assets and Web Development at George Eastman House, an early partner institution of the Flickr Commons. Through our knowledge and consultation with museum professionals, we shall address:

The expectations of the community, that is to say, our assertions about the quality and characteristics of the information that is needed for preservation, which data is nice to have, and what is unlikely to be particularly important in the future, or even confusing.
The state of Flickr today: What preservation mechanisms exist, the availability of programatic access to data via the Flickr API, and any ancillary issues pertaining to Flickr as it is today.
The technical specifications of preserving Flickr: storage size, medium, formats, etc. and the rough costs of archiving when leveraging advances in archival storage, such as in-line block-level de-duplication, embedded metadata and non-relational data stores.
The formulation of a strategy for preservation that may be applied to other such sites of interest, integrating the foundational work of the OAIS model.

Journey Into Self

“The Internet is Disappearing...” – The Economist Feb 5th 1994

“Though primarily concerned with Leyda's ‘squirming, seemingly formless larvae,’ Rick Prelinger has long been fascinated (and occasionally troubled) by the process by which this raw material is reworked into documentary cinema.” -– Leo Goldsmith, Recycling Programs

“History has always been lossy.” – Aaron Straup Cope, over drinks probably

Archiving web-based documents, whether in isolation or within the context of a larger parent website, has been a part of digital archival practice in many institutions for the better part of the last decade. Early archives were built manually by saving HTML documents and related media (such as video or images) and by using an archival web crawler, such as HERETRIX (https://webarchive.jira.com/wiki/display/Heritrix/Heritrix) for search. For sites of greater complexity, such as blogs and wikis, application-specific archival tools remain nascent. Those that do exist tend to export their data as XML or similar data serialization formats and can be as elementary as direct database exports or as complex as hierarchical documents with multiple, sometimes overlapping, schemas.

Among the six hundred million plus websites on the internet (Netcraft, 2012), there is a distinct subset of large, dynamic and community-focused social websites (like Flickr, Facebook or Twitter, for example) which will mandate a dramatic shift in archival practice and thought, particularly when it comes to digital preservation. Given their scale and complexity, previously established methodologies in digital preservation are ill-suited for the task of creating meaningful archives of these types of websites; preserving a perfect mirror image, or even any sort of representative sample of these websites remains impractical given their size and dynamic nature.

The challenge, then, becomes how to best to preserve these sites in order to provide for scholarly inquiry, create a historical record and enable curatorial interpretation without archiving the web application in whole.

Our paper addresses this problem using, but not limiting itself to, the photo-sharing website Flickr as a case-study (a sort of patient-zero). We describe the challenges faced and outline a practical approach for the preservation of the top-tier of large, dynamic, and social websites, and a set of best practices going forward. Not all archiving projects will face the kinds of challenges discussed here but it is hoped that by tackling an extreme case like Flickr we can hint at a way forward for other archiving projects, both large and small, in the future.

A Few Uncomfortable (and Inconvenient) Truths

Or, for our intents and purposes, things that museums are not presently able to do for large, dynamic websites include:

Assume the burden of capital costs for the initial setup and future scaling to accommodate traffic if an archive is preserved as a live and functioning site. Large and complex websites are generally built using a combination of bespoke hardware as well as off-the-shelf components. Flickr, for example, maintains approximately 500 servers to power its search functionality alone.
Access source code and other proprietary intellectual property (IP) as well as data licensed from third-party vendors. Even if a large for-profit company offered to release their source code (that may or may not continue to have financial value) to a cultural heritage institution, the possibility would remain open that much of the data used to power an application may exist under a separate and nontransferable contract. [1]
Provide human capital to devote to preservation of one site. In almost all cases large and dynamic websites have migrated away from simple “frameworks” for which there may be available market expertise and have evolved into complex pieces of custom software. Maintaining Flickr presently employs a full-time engineering staff of approximately 15 people, not including customer support staff (or, what’s left of customer support staff), marketing, etc. It is important to note that this number is unusually low for the industry. Other websites of similar size and scope have engineering teams numbering between 70 (Etsy) to 300 (Twitter) to the thousands (Facebook). At Flickr, it generally took 6 to 12 months for a new engineer to feel confident with the code base and even then it only with the part of the code they worked on directly. Few, if any, cultural institutions have staffs of that size to devote to any one project or acquisition.

Methodologies of Archiving

The “Brochureware” method

The term “brochureware” itself is not used as a pejorative but rather meant to reflect the static and bounded, in size and scope, nature of the archive and the number of objects in a collection. This methodology involves archiving a site using minimal documentary means: screenshots, descriptive narratives and a limited amount of static content from a site itself. It is the earliest, and most popular, methodology employed in contemporary practice and well-suited for small websites with little or no interactivity or dynamic content.

The archive ends up being a family album of sorts, containing snapshots into a website’s life. A unique benefit in this approach is that it is universally applicable across the web. This methodology is employed by the Internet Archive and many of the standard web archiving tools presently available. It is generally impractical for Flickr, which was designed as a series of “small tools for self-organizing” communities in an exploration of photography as a social object and as a medium, in its entirety. Neither of these is a concept that lends itself readily to being “snapshot-ed.”

The “Formaldehyde” method

Using this approach an archive would systematically pre-render, or "bake" all the dynamic pages and possible interactions of a website and host it as a read-only website. This assumes that viewers continue to use web browsers and that the archive in question operate a web server to serve those pages, but the HTML and HTTP standards are both well-documented and relatively easy to implement assuming a time when the “Web” is no longer ubiquitous or nearly so.

Like the “brochureware” method, this is impractical for archiving a website like Flickr. The biggest problem with using this methodology with Flickr is the “singular perspective” problem. Sites like Flickr provide customized content for users based on age, country of origin and other user characteristics, along with restrictions on that content dictated by the relationships between a photographer and other Flickr users. Trying to preserve a website like Flickr using the “formaldehyde” method would manifest itself as an incomplete archive, particularly when archived from a controlled locale (for example, Germany and China each have government mandated restrictions on who can see specific content and when), or in sites where the social graph is used to filter content. Although theoretically possible with a site like Flickr, the number of combined pages (particularly search results) suggest that this may be an unmanageable and impractical approach. [2]

The “Carbonite” method

“In some locales, carbonite was also used to preserve the bodies of the dead… Han Solo was frozen in carbonite on Darth Vader's orders to test whether a Human could survive being encased in carbonite before his plan to freeze Luke Skywalker. The modifications to the freezing chamber were successful in that Han lived through the freezing process and entered into suspended animation.” – Carbonite - Wookieepedia, the Star Wars Wiki

Carbonite is a fictional metal alloy, from the popular Star Wars movies, which when mixed with tibanna gas (also fictional) and compressed can be frozen into blocks for transport. This methodology borrows its name from the practice of “freeze-drying” the entire stack used to create a website (databases and other servers along with the configurations, operating systems, user accounts and all other files) in to a software-based virtual machine which can later be reactivated for research or exhibition purposes. This can be thought as being akin to a research library with gated access to original materials. The piece itself would no longer be able to be updated but the work that once existed can still be preserved.

The principle difficulty with this methodology is the efficacy of making an archive accessible in a read-only state. While Flickr itself was built around the idea of “feature flags” to allow the ability to easily disable specific pieces of functionality (disabling uploads without having to disable the entire site, for example), not all sites are designed in this way. This then leads to the potential challenge of whether an archive should take it upon itself to modify a copy for exhibition and access. Even then, institutions would then have to revisit the age-old question of whether modification would materially change the nature of the website being archived.

The “Carbonite” method is often the approach adopted by design studios who struggle with how to preserve their work when a client stops supporting it or the software used to view it is no longer supported.

A concrete example of this problem and approach is the Flickr “Explore Clock” designed and built for Flickr by Stamen Design in 2009. The clock was developed as part of the introduction of video as a supported media on Flickr. Users were asked to add a special kind of “machine” tag to their videos indicating what time the video was taken and then submit their work to a moderated group for approval. (Unlike digital photos there is no single metadata standard that video cameras follow and so relatively simple things like the date and time a video was shot are absent from the files themselves.) The videos were then displayed on an interactive Flash-based timeline, developed by Stamen, that retrieved records using the Flickr API.

In January 2012 Flickr announced that it would remove the Explore Clock from the site because “its user experience is complicated and it’s not core to our product offering” (http://blog.flickr.net/en/2012/01/13/start-the-new-year-fresh/). How then should Stamen, or a cultural heritage institution, preserve this work? The most effective means would be archive the results of the Flickr API calls themselves (typically XML files sent over HTTP), for the dates that the Clock was operational, as static files on another server and then update the Flash application to request and retrieve its data from this updated endpoint. Whether the application was modified to request the actual files or call another program that mimics the Flickr API and processes those files is outside the scope of this discussion. The salient point is that the work in question has few enough moving pieces (it is a self-contained application that interacts with Flickr using a set of bounded criteria that can itself be stored as static files) that the “carbonite” method is applicable.

The “Dressmaker” method

Using this approach the raw component parts of a site (databases, user accounts, etc.) are stored but no attempt is made to reactivate the site itself, and no attempt to create a single canonical index is made. It is assumed that the archive itself is treated in the same way that a dressmaker might use (and reuse and combine) a single pattern with different fabrics.

It shares characteristics of the “formadelhyde” method in that it presumes that as much raw data will be preserved, preferably as plain text documents, but does not assume that on their own they will constitute any real of meaningful representation of the thing being archive.

The archive itself is not an artifact, nor can it be for all the reasons discussed above. Rather, it is an attempt to preserve things in a way that forces the archive itself to focus on the process in which a website was created and thrived rather than on any single representation or artifact.

By virtue of the way (the "how") that these sites would be archived, putting together an interpretation (curating) is not simply a question of arranging things (artifacts) but of many possible new things, or derivatives, that are the sampling of the whole. The interpretation of the archive manifests itself as something like but not the same as the thing being preserved.

Of the four methodologies described, this is the best suited to a website as large and far-reaching as Flickr.

Anti-Museums

“Ephemeral films were generally produced to fulfill specific objectives at specific times, and most often were not considered to be of value afterwards. In retrospect, they provide unparalleled evidence of the visual appearance and ambiance of their time, and function as rich, evocative, and often entertaining documentation of the American past. … Frequently offering more than just evidence, ephemeral films document past persuasions and anxieties. They show us not only how we were, but how we were supposed to be." – Prelinger Archives, Collection Summary

Flickr, by its very nature, has always been a kind of anti-museum. This was always its strength and the breadth of photos always outweighed the inevitable tidal wave of potentially mediocre submissions that comes from letting "anyone" participate. By extension it becomes essentially impossible to assign any sort of curatorial guidance, or policy, when it comes to archiving. Insofar as a policy exists the short version would be: Yes!

The long version is: Yes, so long as it, and all the associated metadata, has been or is subsequently blessed as "public" by the photo owner. This is more of a practical matter than a philosophical stance since managing permissions rapidly becomes a significant technical and resource, not to mention ethical, challenge that falls outside the remit of cultural heritage institutions.

Archiving a website like Flickr is a massive challenge for any one institution to undertake. A more practical approach might be for an institution to:

Begin by archiving a known subset of Flickr relevant to an institution’s mandate, archiving particular individual users or groups or interest. George Oates, who created the Flickr Commons, has suggested that one approach would be for an institution participating in the Commons to archive not only their photos and metadata but also the accounts of those users who have added a tag, a comment, or a note: in short, those users who have made themselves known and inserted themselves in the life of a Flickr photo. This would provide not only a window on to the archive itself but allow the means to offer greater insight in to the lives and motivations of those users who participated in the Commons. This methodology could be employed across a number of metrics.
Enable individual users to contribute their photos to an institution's existing archive. This would require institutions to:
Create a web application that uses Flickr's delegated authentication API to list, and retrieve, the photos and corresponding metadata for a user.
Allow a user to select photos and videos from their account to approve for archiving, or enable them to approve their entire archive.
Develop a click-through legal agreement allowing the institution to archive and potentially republish said user's selected work publicly regardless of that works current permissions settings.
Add the selected works to a queue for approval and archiving at the collecting institution.

There is an important point to make about community contributions: Not all contributions can, or should, be assigned the same weight or value; however, institutions should note that trying to create an archive of a community-based website without also allowing the same community to participate runs the risk of antagonizing the users, fostering suspicion and potentially encouraging competing archival copies. Institutions need to be aware of this new form of cultural sensitivity when dealing with community-driven websites, and it’s in an institution’s best interest to not alienate said community.

Museums and libraries and archives are no longer the only institutions capable of collecting, housing and organizing cultural heritage. What the Internet has demonstrated is that it is possible for communities of interest to self-organize around a topic and in a relatively short span of time produce bodies of work that sometimes rival traditional scholars in their depth and usually exceed them in their breadth.

That is not to say, however, that museums, libraries and archives should abstain from collecting materials. As history has proven repeatably, any archival system dependent upon external entities making sound archiving decisions is fundamentally flawed. What this finding does encourage is cooperation between institutions and the users (in Flickr’s case, the small self-organizing communities) that constitute the blood of socially-driven communities, particularly in the area of cataloging. Using primary sources isn’t necessarily news, but in the context of socially-driven websites, it can be overlooked.

Cataloging is the place where museums and the larger public and communities of enthusiasts are meeting and being forced to find common ground. The opportunity facing museums, and by extension, museum studies, today is to how to use and shape a participation in that cataloging process: to imagine museums not simply as archive of a considered past but also a zone of safekeeping for future considerations.

parallel-flickr

The application “parallel-flickr” is a useful proof-of-concept implementation of some of these ideas. It is designed to archive one or more, users’ Flickr photos and to generate a database backed website that honors the viewing permissions users have chosen for their photos (on Flickr). Additionally photos that have been “favorited” by a user, on Flickr, are also archived. This allows each individual instance of parallel-flickr to be slightly rough around the edges because they will contain truncated copies of other user’s photo streams: a reflection of the way users have experienced Flickr and other Flickr users.

The project was started in 2010 and provides a tangible example of some of the issues and suggestions addressed in this paper: something to “click” on to see what does and does not work and to provide a concrete framework around which discussion can continue. It uses the “dressmaker” approach, storing photos and metadata files separate from the application itself and using a series of tools to harvest and index that data into a standalone web application. It remains in active development, which means that not all the issues inherent in archiving a site like Flickr, even for a small and tightly controlled network of users, have been addressed or solved.

parallel-flickr is an open source project using standard “LAMP” (Linux, Apache, MySQL and PHP) and the Solr document indexer and a series of command line tools for fetching and indexing flat files described above. Minus the optional Solr requirement, parallel-flickr has been shown to run on ‘plain vanilla’ shared-hosting providers. If Solr is configured it provides the following functionality:

Search by location (including a per-user gazetteer)
Search by user
Search by date
Search by tag
General search

parallel-flickr archives the following data:

The photos themselves. This means at a minimum the original photos. One or more of the resized (or "thumbnail") versions may be worth preserving if they are cropped and/or otherwise resampled. The cost of storing the files themselves may outweigh their immediate value and most can be resized and cached on the fly. By default though, the site captures both the original photo and the display thumbnail (which has a maximum size of 640 pixels in either dimension).
A plain text copy of the basic metadata for that photo, separate from any EXIF data that will already be stored in the original photo. In the case of Flickr this would be the response body of the flickr.photos.getInfo API call. This would include properties like photo title, description, tags and location information. It does not contain the associations that a photo is part of, like groups, sets or galleries.
A plain text copy of the comments or discussion threads associated with a photo. In the case of Flickr this would be the response body of the flickr.photos.comments.getList API method.
A plain text copy of the basic metadata for the photo owner and any other users who may have commented on a photo. In the case of Flickr this would be the response body of the flickr.people.getInfo API call.

Except for the original photos these are not the items that are necessarily displayed to the visitors but are the raw "ingredients" that exist independent of any institution’s specific cataloging system or curator's interpretation.

A single photo with one comment would then demand that the following files (or objects) be archived:

# the original photo
/photos/234/676/884/343/234676884343_s4gfdJX54.jpg # basic metadata about the photo
/photos/234/676/884/343/234676884343_s4gfdJX54_i.json # comments on the photo
photos/234/676/884/343/234676884343_s4gfdJX54_c.json

# basic metadata about the photo owner and the photo commentor
/users/663/538/253/663/538/253@N00.json
/users/335/274/3/3352743@N01.json

Flat files have a number of advantages as a primary archive:

They are supported across a wide number of infrastructures and file systems and they compress well.
They can be easily batch processed for further investigation as time, resources and technical facilities become available. In some ways this is one of key points of the exercise: not to paint oneself into a corner surrounded by the quicksand of any one cataloguing solution. It is nice to imagine that there is a single generic framework for storing this stuff but there isn't. The closest you'll get is to liken the problem to dress making and patterns. That is, you don't make the thing itself out of the pattern but use it as a template.
They can be easily batch processed by a wider audience of people with a broader range of programming skills.
They allow for the various pieces of data to live in multiple sources. For example, an institution may choose to keep the metadata files on a local network so that they can more easily run batch scripts on them, but store the photos themselves on a third party service like Amazon's S3 storage product.
They can be (more) easily federated (and consequently searched) across institutions for all the reasons outlined above.
They can if necessary, simply be printed out and catalogued as paper-based books.

Technically, there is nothing special or required about the JSON format. Its main advantages, as of this writing, are: that it is a plain-text format which can be easily converted; it assumes the UTF-8 character encoding by default; it already has wide and popular support in most programming languages, particularly JavaScript; it would allow an archive to expose IDs and Urls on the web and make them accessible to third-party clients using JavaScript and the Cross Origin Resource Sharing (CORS) standard.

Likewise, assuming available disk space, there is nothing in the code base that would prevent an institution from modifying the application so that it will also generate metadata serialization in other formats, most notably XML, which still enjoys widespread adoption and use among cultural heritage organizations.

What has been described so far is not an indexing system or cataloguing methodology. These pieces are highly specific to the collecting institution, and are best suited to conform to institutional standards for cataloging. Decoupling the archival elements from the cataloging record enforces a separation of concerns that should permit more and faster iterating.

Cataloging born digital element does pose a significant shift in the availability of cataloging data (Johnston, 2010), particularly descriptive metadata. Whereas with Flickr, some pieces of data map nicely (titles, descriptions and keywords, in particular), others do not map cleanly. Is a photo considered authored by the uploading user? Are they the attributed author? While Flickr’s terms of service may require account holders to upload only their own materials (except the institutions in the Flickr Commons, that is), many users upload content that is not their own. Once the hard mapping decisions are made, however, cataloging can be a mostly automatic process.

Here's a layer cake of the whole stack with the various pieces we're talking about are identified:

Exhibition (NO)
Interpretation (NO)
Indexing, searching, cataloging (SORT OF)
Actual bespoke work or emergent tools (SORT OF)
Archiving (YES)

The “dressmaker” approach presumes that an archive will be given form inside the larger context of a museum’s cataloging system. To assume otherwise places an unrealistic burden on institutions by asking them to replace the large and complex systems already in place to manage their collections. On the other hand, the dressmaker approach does presume that an archive consists of at least three distinct components:

The manifestation(s) of an archive, whether it is the data stored in a DAM or other CMS;
The tools and processes used to create those manifestations;
As much of the raw data or materials as can realistically be captured by law or practicalities outside of any particular site logic.

Why is this approach better? It is more flexible and both easier and less expensive than alternatives to adapt. It is ultimately more reliable, despite the cost of upfront work to “process” files. Every programming interface can process text files while specialized databases are a software and hardware dependency nightmare waiting to happen.

Finally it lays bare the fact that by forcing the data to conform to the particular system(s) in which it will live, it is changed. The “dressmaker” methodology simply acknowledges this fact by separating the archive from the manifestation while also providing a template for a multiplicity of manifestations.

Future Steps

As of this writing there is no centralized database of parallel-flickr instances. Each application exists as an island unto itself, often archiving many of the same photos. Partly this is a design decision meant to honor the old adage “many copies keep things safe.” Partly it is a reflection of the fact that simply archiving a small set of a user’s photos presents enough challenges and complexities that a centralized registry is currently out of scope. If, however, such a service were created, it would be well within the scope of the parallel-flickr project to allow an instance to register itself globally.

Similarly, federated search of multiple instances of parallel-flickr is not yet possible. In theory it is absolutely possible given that parallel-flickr uses Solr for search indexing and Solr already has a robust and tested protocol for handling queries across multiple remote indexes. If implemented in concert with the above-imagined registry of parallel-flickr applications, it could provide not only a novel, but ultimately, practical way for a collection of institutions to archive a site the size of Flickr without forcing any single organization to bear the burden alone.

It is important to note that the federated search scenario described does not in any way address the problem of distributed permissions. As a practical necessity, distributed search would have to limit itself to only those photos that are public to everyone.

Finally, although it may be tempting to imagine taking an application like parallel-flickr and creating an abstract framework with which many different kinds of site could be archived, it is not recommended. The purpose of parallel-flickr is not to provide a generic archiving interface for social websites but instead to provide a working demonstration of a larger conceptual approach towards the problem.

References

Cross Origin Resource Sharing (CORS) http://www.w3.org/TR/cors/
Flickr, Expore Clock http://www.flickr.com/explore/clock
Flickr, Feature Flags (Flickr) http://code.flickr.com/blog/2009/12/02/flipping-out/
Goldsmith, L. (2011) http://www.movingimagesource.us/articles/recycling-programs-20110817
Internet Memory Web Archiving Survey (http://internetmemory.org/images/uploads/Web_Archiving_Survey.pdf)
Johnston, L. (2010) “The Importance of Content Standards in Digital Libraries” in Eric Lease Morgan, et al. Designing, Implementing, and Maintaining Digital Library Services and Collections with MyLibrary, “Part IV. Content standards” http://dewey.library.nd.edu/mylibrary/manual/ch/ch05.html Consulted 31 March 2012
maciej (2012). "The Five Stages of Hosting,” Pinboard Blog http://blog.pinboard.in/2012/01/the_five_stages_of_hosting/ Consulted 31 March 2012.
Netcraft (2012). “March 2012 Web Server Survey” http://news.netcraft.com/archives/category/web-server-survey/ Consulted 31 March 2012.
Prelinger Archives (2010). Prelinger Archives, Lost Landscapes of Detroit Collection Summary http://www.prelinger.com/prelarch.html, http://www.archive.org/details/LostLandscapesOfDetroit2010
Ridge, M. (2012) “Brochureware, aggregators and the messy middle: what's the point of a museum website?” http://openobjects.blogspot.com/2012/02/brochureware-aggregators-and-messy.html

This blog post is full of links.

#archive-paper

Archiving Flickr and Other Sites of Interest to Museums (the talk)

These are the slides for the talk that accompanied the paper Preserving Flickr and Other Sites of Interest to Museums, co-written with Ryan Donahue and presented at the Museums and the Web 2012 conference in San Diego. Unfortunately there were never any notes for the slides written out in long-form but, as mentioned, there's an entire paper on the subject if you want to know more.