JISC Beginner's Guide to Digital Preservation » LOC http://blogs.ukoln.ac.uk/jisc-bgdp ...creating a pragmatic guide to digital preservation for those working on JISC projects Wed, 19 Dec 2012 10:33:11 +0000 en-US hourly 1 http://wordpress.org/?v=3.5.2 Update on the LOC Twitter Archive http://blogs.ukoln.ac.uk/jisc-bgdp/2011/06/03/update-on-the-loc-twitter-archive/?utm_source=rss&utm_medium=rss&utm_campaign=update-on-the-loc-twitter-archive http://blogs.ukoln.ac.uk/jisc-bgdp/2011/06/03/update-on-the-loc-twitter-archive/#comments Fri, 03 Jun 2011 11:43:22 +0000 Marieke Guy http://blogs.ukoln.ac.uk/jisc-bgdp/?p=760 It’s all been very quiet on the Twitter front at the Library of Congress since their announcement last year so it was good to see an update written by Audrey Watters from the O’Reilly Radar. The article entitled How the Library of Congress is building the Twitter archive is a write up by Audrey following a conversation with Martha Anderson, the head of the LOC’s National Digital Information Infrastructure and Preservation Program (NDIIP), and Leslie Johnston, the manager of the NDIIP’s Technical Architecture Initiatives. It gives us a little insight into how the LOC is dealing with the challenges and opportunities of archiving digital data of this kind.

The article cites the biggest challenges as the size of the archive (we are now producing 140 million tweets per day!), the composition of a tweet (a JSON file with a lot of Twitter metadata) and the layers of complexity (e.g. dealing with all the url links).

Dealing with these complexities efficiently is big work.

This requires a significant technological undertaking on the part of the library in order to build the infrastructure necessary to handle inquiries, and specifically to handle the sorts of inquiries that researchers are clamoring for….Expectations also need to be set about exactly what the search parameters will be — this is a high-bandwidth, high-computing-power undertaking after all.

No decision has been made yet on which tools to use but the library is “testing the following in various combinations: Hive, ElasticSearch, Pig, Elephant-bird, HBase, and Hadoop“.

We wait with bated breath!

For those who like analogies Martha Anderson has just written an interesting post on how saving digital information is a lot like jazz. In Digital Preservation Jazz Martha talks about the creative, diverse, and collaborative nature of digital preservation.

]]>
http://blogs.ukoln.ac.uk/jisc-bgdp/2011/06/03/update-on-the-loc-twitter-archive/feed/ 0