The article cites the biggest challenges as the size of the archive (we are now producing 140 million tweets per day!), the composition of a tweet (a JSON file with a lot of Twitter metadata) and the layers of complexity (e.g. dealing with all the url links).
Dealing with these complexities efficiently is big work.
This requires a significant technological undertaking on the part of the library in order to build the infrastructure necessary to handle inquiries, and specifically to handle the sorts of inquiries that researchers are clamoring for….Expectations also need to be set about exactly what the search parameters will be — this is a high-bandwidth, high-computing-power undertaking after all.
No decision has been made yet on which tools to use but the library is “testing the following in various combinations: Hive, ElasticSearch, Pig, Elephant-bird, HBase, and Hadoop“.
We wait with bated breath!
For those who like analogies Martha Anderson has just written an interesting post on how saving digital information is a lot like jazz. In Digital Preservation Jazz Martha talks about the creative, diverse, and collaborative nature of digital preservation.
]]>FUMSI is the successor to FreePint and publishes articles aimed at helping information professionals do their work.
]]>I’ve just written an article for FUMSI on this (probably out in September). FUMSI are the FreePint people who publish tips and articles to help information professionals do their work.
One area I looked at was the tools available to archive tweets. We wrote quite a lot about this on the JISC PoWR project. Here is a taster of the most current tools out there:
Other approaches include:
Some of these tools are covered in more detail in the JISC PoWR blog post: Tools For Preserving Twitter Posts.
Anyone got any other suggestions?
]]>