JISC Beginner's Guide to Digital Preservation

…creating a pragmatic guide to digital preservation for those working on JISC projects

Update on the LOC Twitter Archive

Posted by Marieke Guy on 3rd June 2011

It’s all been very quiet on the Twitter front at the Library of Congress since their announcement last year so it was good to see an update written by Audrey Watters from the O’Reilly Radar. The article entitled How the Library of Congress is building the Twitter archive is a write up by Audrey following a conversation with Martha Anderson, the head of the LOC’s National Digital Information Infrastructure and Preservation Program (NDIIP), and Leslie Johnston, the manager of the NDIIP’s Technical Architecture Initiatives. It gives us a little insight into how the LOC is dealing with the challenges and opportunities of archiving digital data of this kind.

The article cites the biggest challenges as the size of the archive (we are now producing 140 million tweets per day!), the composition of a tweet (a JSON file with a lot of Twitter metadata) and the layers of complexity (e.g. dealing with all the url links).

Dealing with these complexities efficiently is big work.

This requires a significant technological undertaking on the part of the library in order to build the infrastructure necessary to handle inquiries, and specifically to handle the sorts of inquiries that researchers are clamoring for….Expectations also need to be set about exactly what the search parameters will be — this is a high-bandwidth, high-computing-power undertaking after all.

No decision has been made yet on which tools to use but the library is “testing the following in various combinations: Hive, ElasticSearch, Pig, Elephant-bird, HBase, and Hadoop“.

We wait with bated breath!

For those who like analogies Martha Anderson has just written an interesting post on how saving digital information is a lot like jazz. In Digital Preservation Jazz Martha talks about the creative, diverse, and collaborative nature of digital preservation.

Tags: ,
Posted in Archiving | Comments Off

Treasuring Twitter

Posted by Marieke Guy on 6th September 2010

My article on Treasuring Twitter: The Why and How of Preserving Tweets has now been published in FUMSI.

FUMSI is the successor to FreePint and publishes articles aimed at helping information professionals do their work.

Posted in articles | Comments Off

Preserving your Tweets

Posted by Marieke Guy on 28th May 2010

Recently there has been lots of talk about preserving Tweets, especially since the Library of Congress agreed to take on the archive.

I’ve just written an article for FUMSI on this (probably out in September). FUMSI are the FreePint people who publish tips and articles to help information professionals do their work.

One area I looked at was the tools available to archive tweets. We wrote quite a lot about this on the JISC PoWR project. Here is a taster of the most current tools out there:

  • Print Your Twitter service which creates a PDF file of an accounts tweets.
  • WordPress Lifestream plugin which allows you to integrate Twitter with your blog and so archive using blog capabilities.
  • What the Hashtag allows you to create an HTML archive and RSS feed based on a hashtag.
  • Tweetdoc service allows you to create a PDF file that brings together all the tweets from a particular event or search term.
  • Twappr Keeper allows users to create a notebook of tweets for a hashtag.
  • The Archivist Desktop is a desktop application that runs on your local system and allows you to archive tweets for later data-mining and analysis for any given search.

Other approaches include:

  • Searching Twemes the site mashes Twitter with Flickr, Delicious and other services.
  • Searching FriendFeed brings back much older tweets than Twitter but you are reliant on users being members of the service
  • Subscribing to certain Twitter feeds by email and then applying an email filter to them.

Some of these tools are covered in more detail in the JISC PoWR blog post: Tools For Preserving Twitter Posts.

Anyone got any other suggestions?

Posted in Archiving, Web | Comments Off