See for future updates and for a searchable version of this blog.
It remains only for me to thank the many talented students who have worked here over the years whilst studying for undergraduate degrees, masters’ degrees or PhDs. It has been a pleasure and an honour.
Image credit:
]]>“MTP is a big improvement over USB mass storage — for devices with lots of internal memory, a manufacturer no longer needs to come up with some hard partition between the USB mass storage and internal storage. Instead, they are all in one partition, with MTP providing access to the directory of media files that would normally be available through USB mass storage. This means there is no longer a need for apps on SD card for such devices, because what used to be the ‘internal SD card’ is in the same partition as where applications are stored. The storage on your device can be used for either applications or media, depending on what you want to put on it. You aren’t stuck with how much space the manufacturer decided to leave for the two areas.[...Also,] this means that the media storage doesn’t need to be unmounted from Android when it is being accessed through the PC”
Problem is, Linux MTP is a little problematic. For example, the Nook I bought recently doesn’t work out of the box on my Slackware-current install despite having the latest libmtp – the device turned out to be read-only. Initially I had to use ADB to copy files onto it, but we found a simple solution. If your MTP devices turn out to be read-only on Linux, then there’s a strong possibility that the problem is with libMTP. Specifically, the devices may need to be listed in this file – music-players.h. For this you need the vendor number and the device ID. To get these, plug the device in via USB and type ‘lsusb’. Find the relevant line:
Bus 001 Device 036: ID 2080:0006 Barnes & Noble
The first of those (2080) is the vendor ID. The second is the device ID. Adding these to libmtp and recompiling resolved the Nook HD issue for me; maybe it will work for you too. Pretty soon libmtp will contain these devices as standard (the HD+ is already in there), but then there’ll be another crop of Android devices with similar MTP problems around in a minute, so it’s worth remembering the trick anyway.
]]>I like the idea of it, not so much for development work but for data analysis and worked demonstration. For something to share it certainly beats a command-line session.
It looks rather like this:
I encountered it during an e-Humanities workshop yesterday and I’m already itching to play with it again.
Getting it to install on Slackware involved a few steps. There are probably better ways, but here’s how I did it, based more or less on
Step 0: optional. Open sbopkg as root. Search for and install python3. Version 2.6 works with the ipython-notebook but 3 likes sets and UTF8 better.
Step 1: curl -O
Step 2: sudo python
Step 3: easy_install pip
Step 4: Now that you have pip, you can follow the latter part of the blog post above:
Open sbopkg as root. Search for and install zeromq.
For the moment, install libpgm-5.1.116~dfsg from source (tar xvfz libpgm-5.1.116~dfsg.tar.gz; ./configure; make; make install).
pip install ipython
pip install pyzmq pip install tornado pip install --upgrade ipython pip install numpy pip install matplotlib Follow the configuration instructions described in the blog post.]]>
PIMMS is a software stack designed to support the development and collection of metadata. Initially, it was developed to “document the provenance of computer simulations of real world earth system processes”[*], but as is traditional for almost any infrastructure designed to support different types of experiment, the thought has occurred that the approach may be more broadly generalisable. It’s designed to allow for direct user (=scientist) involvement and the platform consists of, essentially, a sequence of stages of form-filling and design processes, each of which fulfil a purpose:
Unusually and rather charmingly, PIMMS uses mindmapping as a design tool, viewing it as accessible to users. Whilst PIMMS clearly contains elements of the thinking that underlies UML-based design and uses UML vocabulary and tools in places, UML is ‘useful within a development team’, says Charlotte Pascoe, the PIMMS project manager, but it is not meant for end-users.
PIMMS counts among its potential benefits an increase in sheer quantity, quality and consistency of metadata provided. The underlying methods and processes can, in theory at least, also be generalised. A mindmap could be built for any domain, parsed into a formal data structure, automagically built or compiled into a web form and applied on your metadata. The process for building a PIMMS form goes more or less as follows.
If this sounds somewhat familiar, folks, it is because the concepts underlying PIMMS have a long and honourable background in software engineering. Check out the Web Semantics Design Method (De Troyer et al, 2008), which specifies the following process for engineering a web application – my own comments in parentheses to the right:
WSDM, as described here, owes much to the waterfall model of software engineering (although, one would assume, there is nothing stopping subsequent iteration through phases) – see for example Ochoa et al (2006). To my eyes, the PIMMS metadata development process would appear to implement about half of WSDM in a less analytical and more user-centric model, encouraging direct input from the scientists likely to use the data.
The distinction, primarily, is in the implementation design and implementation phase; the PIMMS infrastructure compiles your conceptual design/structure, as represented in the mind map you have developed, into an XML structure from which PIMMS can build user-facing forms. After that compilation phase, further implementation work is essentially cosmetic, presentational work such as skinning the form. PIMMS removes the majority of implementation decisions from the user by making them in advance. Much as SurveyMonkey consciously limits the user’s design vocabulary to elements that may be useful for your average survey, PIMMS essentially offers a constrained vocabulary of information types and widgets.
I don’t make the comparison between PIMMS and SurveyMonkey lightly. The PIMMS project itself uses the terminology of ‘questionnaires’. PIMMS-based forms have a lot in common with SurveyMonkey, too; incrementally developing the form whilst still retaining your previously collected data is not a straightforward operation. That may be a good thing – that way, you know which version of the input questionnaire your data came from – but on the other hand, incremental tinkering can sometimes be a useful design approach too…
The day continues. The sun subsides and the room is cooling fast. The geographers in the room, climate modellers of anything from the Jurassic to the Quaternary, go through a worked example of developing a PIMMS questionnaire. They discover a minor problem: the dates in the PIMMS forms don’t represent the usage of dates in palaeoclimate research, which are measured in ‘ka’ – thousands of years. This is a problem inherited from the UM, the Met. Office Unified Model [numerical modelling system].
Faster than you could say ‘Precambrian’, we are out of time. There has not been a chance to view the generated metadata collection form in use, which I regret slightly as it is the most common scenario in which the attendees will work with the system. Still, it was a worthwhile day. Workshop attendees have voiced an interest in working with the software in future. As for me, after this glimpse into the future of palaeoclimate data management, I find myself thinking back to my past life in web engineering. I wonder whether, like palaeoclimatologists, research data managers could develop their expectations of the future by exploring the literature of times past…
De Troyer, O., Casteleyn, S. and Plessers, P. (2008). WSDM: Web Semantics Design Method. In: Rossi et al (eds.), Web Engineering: Modeling and Implementing Web Applications. [see]
Sergio F. Ochoa, José A. Pino, Luis A. Guerrero, César A. Collazos (2006). SSP: A Simple Software Process for Small-Size Software Development Projects. IFIP Workshop on Advanced Software Engineering 2006: 94-107
While unfortunately the hardware of the Raspberry Pi is almost unchangeable (short of the size of the SD card used), this is more than made up by the choice of operating systems. In true hacking fashion, several operating systems have sprung up, each doing different things. Here are a selection:
1) Raspbian “Wheezy”
Raspbian is based on the Debian kernel, and is the recommended start point for beginners to the Raspberry Pi. It boots to a command prompt by default, but pre-installed is LXDE – a lightweight X11 manager. Other tools included include the Midori web browser, and all the development tools you’d expect on a Linux system, including Python and Java compilers. Of course, since it’s a Debian installation, new software is a doddle to install using the package manager. Within minutes I had set up VLC and was playing 1080p video with no problems.
2) Arch Linux ARM
Arch Linux is extremely popular with the modders and tweakers of the Raspberry Pi community. Its no-frills approach centres on “simplicity and full control for the end-user”. By default, no X11 server is included – it is up to the user to decide which (if any) they would like. Obviously, this distribution is not recommended for those with little to no Linux knowledge.
3) RaspBMC
On the other end of the scale, RaspBMC is totally different to either of the distributions mentioned above. When you use this distribution to boot the Raspberry Pi, it becomes a fully-fledged home media centre, with the ability to play films, music and even YouTube videos. RaspBMC is based on the very popular XBMC, a cross-platform media centre that is used by countless people worldwide.
One of the main reasons that the Raspberry Pi came about was to teach children in schools about electronics and programming. As such the GPIO pins can be used to interact with code and give sensor readings to programs. Unfortunately, in Raspbian at least, the Python modules for interacting with the GPIO pins are not included by default. Instructions for installing them are given here. A popular way to interface the Raspberry Pi is a simple ribbon cable and a prototyping board, which will let you try out many different combinations before settling on something more permanent. One of the peripherals that has generated the most buzz lately is a camera module featured here which would pave the way to features such as image recognition for navigation, or more multimedia capabilities.
As with most things, however, there are a few drawbacks, but what else did you expect from a machine costing £25/$35? The biggest caveat for me was initially the lack of hardware MPEG-2 decoding, which meant my whole library of movies would have to be transcoded to h.264 for smooth playback on the device. However, the Raspberry Pi Foundation has now released licenses for roughly £2.50 for MPEG-2 and £1.50 for VC-1. The other gripe that some may have is the lack of expandable RAM, as it is all contained within the CPU. Such users may find the VIA APC or cubieboard a little more suitable for their use, however, for pure value for money and form factor, the Raspberry Pi is hard to beat.
Edit (1/11/12) – As of October 15th, the Raspberry Pi now ships with 512MB RAM, making it an even more attractive proposition for its price point.
1) Streaming video to an icecast server.
Once you have the icecast server set up this is actually shockingly easy to do. Set up a mountpoint on the server side, in your icecast.xml setup (/usr/local/etc/icecast.xml by default):
for example.
Now, on the client side (which could be anything from Windows to Linux to MacOS, because VLC is cross-platform, but this example is Windows), try
C:\Users\Em>”c:\Program Files (x86)\VideoLAN\VLC\vlc.exe” “C:\Users\Public\Videos\My Video.wmv” –sout=#transcode{vcodec=theo,vb=800,scale=1,acodec=vorb,ab=128,channels=2,samplerate=44100}: std{access=shout,mux=ogg,}
It should transcode on the fly into Ogg Vorbis/Theora and throw it at your icecast server. Viewers who go to should be able to view it from there. Note that you can change various settings on the transcode process (for example scale=0.5, vb=400), so you can reduce the network bandwidth required, for example, but that paradoxically reducing some of these settings will actually increase the time taken for the transcoding process, so it can result in the transcode getting laggier than it was already.
Why transcode? Well, icecast only handles a limited format set. It’s really designed for audio data, not audiovisual. It’ll handle pretty well anything in an Ogg wrapper, though, and it is free. So if you want to stream video with Icecast, transcoding will probably be involved somewhere.
2) Streaming from a DVD (previously recorded event)
One would expect this to be as simple as
“c:\Program Files (x86)\VideoLAN\VLC\vlc.exe” dvdsimple:///E:/#1
but as it happens this seldom works, and the reason is the reaction time. Icecast is contacted with a header as soon as the streaming process begins. If it takes too long to get the DVD spun up and begin the process of streaming, icecast simply times out on you, leaving an error message along the lines of ‘ WARN source/get_next_buffer Disconnecting source due to socket timeout’.
Having tested this on various platforms, I find that the following string: “vlc dvdsimple:///dev/dvd –sout=’#transcode{vcodec=theo,vb=200,scale=0.4,theora-quality=10,fps=12,acodec=vorb,ab=48,channels=2}:std{access=shout,mux=ogg,}’ –sout-transcode-audio-sync –sout-transcode-deinterlace” works very well in some cases. Apparently the DVD drive I first tested this with is just unusually slow. This DVD, being homegrown, doesn’t require libdvdcss to view/transcode.
3) Streaming with ffmpeg2theora
Bit of a Linux solution, this one. Install libvpx, libkate, scons and ffmpeg (all available as Slackbuilds for those who are that way inclined). Install ffmpeg2theora. Install libshout and oggfwd.
Then: try a command line along the lines of the following:
ffmpeg2theora /source/material/in/ffmpeg/readable/format.ext -F 16 -x 240 -c 1 -A 32 –speedlevel 2 -o /dev/stdout - | oggfwd server_port password /test2.ogg
Obviously the output of this is not exactly high-quality; it’s been resized to a width of 240 pixels, audio has been reduced in quality, framerate’s been reduced to 16. But all these configuration options can be played with. Here’s a useful help page:
Having called this a Linux solution, it’s worth pointing out that ffmpeg2theora is available for Windows ( and that oggfwd/ezstream ( have been used successfully on Windows as well. It’s also worth noting that, again, VLC can do the ogg/theora encoding too (and has done since 2006)- it’s just a question of seeing what’s better optimised for your purpose on your platform.
Note also that in this instance no username is needed, and the password used in this case is that set in the ‘<source-password>’ directive in icecast.xml.
4) Streaming without icecast
Icecast is a useful default solution if you want to broadcast your event/recording to multiple people across the web. It’s also useful because, operating via HTTP, it doesn’t suffer from the sort of firewall/router problems that UDP-based video streaming, for example, typically encounters. On the other hand, if you’re streaming across a local LAN (for example, into the next room), there’s (usually) no network border police to get in your way — and VLC does also offer a direct VLC-to-VLC HTTP-based streaming solution. Unlike Icecast, though, it’s not ideal for one-to-many broadcast.
The Videolan documentation has a graphical explanation of this setup:
5) Mixing video for streaming
An obvious application to test in this context is FreeJ. Sadly it’s a bit of a pain to compile as it doesn’t seem to have been touched for a while. You’ll need to use the following approach for configuring the code:
CXXFLAGS=-D__STDC_CONSTANT_MACROS ./configure –enable-python –enable-perl –enable-java –disable-qt-ui
Typing ‘make’ will result in : error: ‘snprintf’ was not declared in this scope. Add #include <stdio.h> to any files afflicted in this way.
You then come across a crop of errors resulting from changes in recent ffmpeg. Some of these can be resolved with a patch, the rest, you’re better off going to the git repository rather than trying a stable version.
In principle you probably want to enable-qt-gui, but since it doesn’t currently compile I have left it as an exercise for some other day.
And once you have FreeJ working, you need to read the tutorial. Note this advice regarding addition of an audio track to FreeJ output.
In this case an interface and super implementation provided a default set of methods for dealing with the processing of individual web elements written to work in the majority of cases. It can be thought of as standard web behavior. Subclasses of this Strategy were provided to account for software differences.
We currently support the following strategies -:
One of the main tasks which Archiver performs is to make any links which appear in HTML, CSS or JavaScript files relative to the homepage of the website so they are not absolute links. The JSoup plugin for Java was especially useful in this case as it allows the detection of a specified tag in the HTML file. JSoup also uses a Jquery type syntax to select the different elements from the HTML e.g. “#” is used to select an ID and “.” is used to select a class. JSoup also allows invalid HTML which is useful doesn’t prevent a site from being fully archived if there are mistakes in the markup. For the CSS and JavaScript, Regex was used to create expressions in the specified format for a CSS or JavaScript link, this could then be used to find and change the links. Alongside making links relative, Archiver also adds each link which it finds to the list of files to be added into the archive folder. After archiving recursively a zip file is served up to the user.
While existing solutions are available none of them provide the comprehensive rewriting capabilities of Archiver. All the user has to do is point the webapp at a site, choose a strategy and deploy the resulting zip.
Archiver also produces a README file which provides details of all the files which have been included in the archive and lists any errors such as missing pages.
Code is available from
While this is working code it has not received sufficient testing which is obviously vital for this type of project. With that in mind we would love to hear your feedback.
Unlike previous Storm hack days that have had a theme, this one was open ended for the developers to develop anything they wished. They have had good success in their previous hack days resulting in some of the hacks being turned into finished products and released on Apple’s App Store, such as Spyhunt and Shaken created by local software development company Riot.
At the hack day I teamed up with fellow Ruby developer and hardware hacker Paul Leader (who just happens to work at Storm). We had borrowed a receipt printer from Mike Ellis (organiser of Bath Digital Festival) with the intention of plumbing it up to the internet in order to print out tweets from the conference as a physical takeaway memento for festival goers.
Working from a highly complicated wiring diagram, we attempted to connect the printer to the internet. Unfortunately for us after many hours in the morning trying to get this to work, we eventually gave up and had lunch. One of my fellow attendees sums this up quite nicely on her blog.
“I also spent a large part of the day sat next to Paul and Julian who were attempting to turn an old receipt printer into a tweet printer – sadly, they couldn’t get it to work, which was a shame – but it was interesting to see the processes and patience they both possessed to get to the desired result (or at least close to it).”
As is the way with most events the wifi during the morning wasn’t quite up to par, so the other 60+ developers in the room found it hard to implement the ideas they wanted to build. After a lunch the wifi was going strong and people started hacking again, I mainly spent the afternoon, finding out what others were working on, and also worked on a twitter text analysis tool with another at attendee.
I think the day went really well, I spoke to some interesting people and thought the event was well organised.
The Managing Research Data hack day in Manchester was part of the JISC call by the same name being run by Simon Hodson. Although technically I am not part of any of the projects in the MRD call, I was still asked to attend. The hack day was actually a hack two days, with the room we were in open until the last person left.
After a morning of talks about various projects on the MRD call and various other data related presentations, it was time to start/join a team and brainstorm some ideas. I joined forces with Nick Jackson and Harry Newton of Lincoln University and Nick Syrotiuk of Mimas. The idea of our project came from Joss Winn which he had got from an academic at Lincoln. The basic idea was to create a system whereby an academic could see the outputs of all the research projects not just in their department, but across theirs and every other university.
To get started we first chose a project name from a random name generator, and then I created a GitHub project for it. The project would now and forever be know as Project Rainbow Beam. Built onto of MongoDB I created a simple Sinatra web app to accept a JSON payload which would then be added to the Mongo database. We soon realised that the incoming JSON data need to by sanitised, I volunteered. As I was now chief of sanitisation, Nick J, rewrote the front end using a PHP framework called Codeigniter. To keep enable optimum developer communication we created a chatroom on Campfire, as we were using Campfire, it seemed a good idea to hook GitHub to the chat room, so that every time we pushed code, Campfire would play a Vuvuzela on all of our computers.
Skip to many hours later, Nick J and I were the last to go to bed having been up many hours hacking away at the project.
By mid to late morning day two, we had a fully Bootstrapped website, documentation, api endpoint, data sanitizer, and live feed which was updated via Pusher.
At the hack event, it was decided to vote on all the hack projects that had been going on to see which one would win a further two days development work. With the developers being whisked away to a hotel and given two days to make their project better. Unfortunately we didn’t win this, although our project was well received. The prize of getting two more days to work on their project went to the BitTorrent group whose idea was to use BitTorrent and SWORD to move large research data sets around.
These two events were very different, and were targeting very different audiences. However the common thread they shared was they were meant for developers. They both did well in catering for developer needs, coffee, wifi, and electricity. It was great to be part of these two events, I learned a lot and met lots of great people. I look forward to the next hack day to find a new challenge to work on.
]]>Over the past few years I have had my fair share of tricky data management opportunities. There was the financial transaction database that had no keys or indexes and had to be pieced back together by getting old source code releases, finding the bugs and reversing the incorrect values. There was the GB’s of web log records that needed cross referencing and finally, analysing free text marketing responses for patterns.
This was all a warm up for my current opportunity. With this task I have all the issues at once. I have the scale, with 22 million odd items. I know it’s not enormous for 2012, but it is far from easily manageable. I have the lack of consistent relationships and the final piece of the puzzle lack of data quality.
What I wouldn’t give for a nice enumeration of types, something concrete to go on. Take dates for example. For decades it has been the norm to store dates in ISO format, or at least something that can be converted back and forth. If I am really lucky I get ISO dates, a lot of the time I get something that isn’t defined but recognisable and can convert to ISO from, for example, ‘moon cycles since equinox format’ ™. Often though, I get user typed input, not from the same person and not even the same system. Entering dates like “about the middle of last year” is guaranteed to anger even your most friendly neighbourhood developer.
Taken individually it isn’t that big a deal. However producing results in reasonable web- response speeds for 22 million records, grouping by, counting and cross referencing on thousands of possible groups on standard hardware is eluding me. If you can make them dance this way I would love to hear from you.
I’m sure I am not the only one who has trouble keeping up with the latest and newest technology releases. There are lots of exciting new cool apps and services that I don’t really have the time to investigate due to the sheer quantity. Sometimes I just don’t have the patience to coax a demo app out of the latest beta release, constantly cross referencing against error messages. Finally, my least favourite, there is the kind of technology forced along by big business marketing.
Data storage is huge business, particularly the other side of the Atlantic. There are vast sums of money at stake and even corporate survival can depend on the success or failure of given products. I’m no Commie, I don’t mind this in principal, but the amount of positioning, media attention and misinformation that then surrounds the products makes it very hard to separate the hype and the substance.
A couple of weeks ago I was deliberating, remembering all the great things I have heard about NoSQL. Maybe the NoSQL people have a point and now I have a solid use case where my RDBMS is not suiting me. Up until this point I had discounted alternatives to my RDBMS on the grounds that any storage solution was moving bits around on a disc and that the same rules applied. Like all performance computing it is a game of caching. Keep indexes in memory and look to disc as little as possible. Indexes and disc space usage are always going to be more or less equal leaving any performance improvements to the implementation, hardware and possibly some new algorithms. RDBMS design was based on set theory and predicate logic, or to put it another way, Maths. Very little has changed to satisfy my scepticism with regards the speed and scale increases promised by NoSQL movement. Even the idea that there is seen to be a movement worries me. I mean, it’s hardly suffrage, anti-war or civil rights is it?
Up until now I have been talking about NoSQL as a single entity. Of course this is just one of the misleading factors. For some reason, lots of substantially different technologies have been lumped under one umbrella. Maybe the daunting numbers necessitated this; maybe it was because it was felt they could survive better as a combined opponent to RDBMS. The majority of them share some common themes but thinking of them as a single entity is particularly unhelpful. In fact several of the so called NoSQL solutions have more in common with your SQL RDBMS than each other. Two of these notable exceptions are CouchDB and Neo4 which offer ACID compliance.
From Wikipedia the generally accepted types of NoSQL solution are -:
Document store, graph, Key-value store, multivalue, object, RDF, tabular and tuples
Having read a few articles about the various NoSQL solutions, it seems that each author had decided to group them up and talk about the groups in some way, for each offering possible scenarios where each are useful. So far it has been easy to pick holes in every one of the lists, in some cases because they are out of date but mostly because even in these sub types the feature sets can still be very different. For this reason I shall approach this from a slightly different angle. Firstly I shall talk about common themes to most (but not all) of the NoSQL solutions, then follow up with a few types of software and which specific NoSQL products would be useful.
Earlier I mentioned that I couldn’t really see how you can develop a significantly faster comparable version of a storage solution. In the case of the majority of the NoSQL products, the main selling point is horizontal scalability. To put it another way, it is easier to deploy over lots of load balanced clusters giving the performance gains. DBRMS’s do not scale as easily in this manner.
The reason for this is that all good RDBMS’s are at least approaching ACID compliance. In essence, this is your guarantee that data you store is consistent and will be there when you want it. With ACID comes the concept of transactions which are important for many real world tasks, and without them bank transfers would vanish, nuclear missiles would launch. The locking required does not work as easily over RDBMS clusters due to the inevitable latency.
Having said that there are many cases where this isn’t important. You could maintain the consistency at the application level. It gets increasingly harder to maintain with increasing system complexity but it is far from impossible. Alternatively read only data sources are a good candidate or maybe you just don’t care. If the odd ‘Like’ or +1 goes missing the sun will still rise the next day. In addition I should probably point out that most people tend to agree that NoSQL means ‘Not Only SQL’. For reasons discussed, in most cases it would represent part of a given solution. A fast NoSQL solution would work well as a client facing readable resource to a large complex dataset.
A relaxed or in some cases entirely non-existent schema is another selling point. This for me is the key difference. So many times my model has altered slightly and various null checks have crept into my code. You can easily see how in some cases a very relaxed schema would be a nice thing to have.
Computing as a commodity has been a big driver behind many of these products. It isn’t hard to see the value of being able to easily spin up a few more database clusters over the Christmas busy period with little fuss. This is a key feature of how the horizontal scalability can be a massively appealing part of these solutions. Taking this further some products have an emphasis on distribution. For example you could have a country or regional presence in a datacentre where for example UK residents are served by one cluster/shard and Australian by another. Maybe you can offload your Black Friday North American rush to your Pacific Rim cluster where it is 2am.
If you though there were a lot of NoSQL options then you are in for treat when you start looking for a CMS. It seems that every developer, has at some point, started coding their own CMS. It isn’t hard to see that document stores are particularly suited to this task. Almost the entire focus is around the document. Taking a real world example, MongoDB and Etsy demonstrate a nice scenario for this use case. On Etsy you have various sellers all over the globe creating product pages. Some might have shipping restrictions, photos, size guides, linked products or any number of combinations. With MongoDB and a relaxed schema, a product page could be a single document with just the relevant categories embedded. I am willing to bet they don’t use it for their payment systems though.
Memcached is probably the most common and famous example of caching in the NoSQL world*. Notoriously thousands of memchached nodes allow us all to keep up with the interesting happenings on Facebook. These are typically used in front of a backing data store and provide most recently used hash based caching and runs entirely from RAM. I think they key here is understanding that it can be used as part of a massive infrastructure rather than being something particularly revolutionary.
If you aren’t Facebook or similar and thinking of adding one memchached box to the font a box or two, you might be better off exploring other routes first.
*Other k-v stores are available.
A relaxed and adaptable schema during software development has obvious benefits.
The most interesting type of solution in my opinion is the Graph database and oddly this seems to be the direction that receives the least attention. I have had a number of problems where I needed to view data from various angles at different times and the relational approach just didn’t work. I was constantly creating temporary tables of underlying data from different directions which became hard to maintain. Expressed as a graph I can see that it could be far easier to work with. Again the concept of data as a graph is hardly new but I am about to trial Neo4j as a solution to my current problem so I shall report back with my findings.
The likes of Hadoop MapReduce can be suited to analytics. Typically reporting makes it into the code at a much later stage and can be easily forgotten. I have seen many systems spending most of their cycles calculating the nightly sales reports with increasingly complicated SQL queries over their perfectly normalised data sets. Aggregating, result summarisation and general querying can be guaranteed with real time performance. Google, despite trying to replace it, are using a version of this behind the scenes to provide your search results. It clearly scales.
It is a point worth labouring, that the key is in picking the right tool for your data. Slightly less obviously it is about how you need to reference that data, not only today but in the future.
Being a developer I had itchy keyboard fingers and didn’t quite get around to researching thoroughly before I trialled MongoDB. Seemingly it was a good match for my data with a relaxed schema but there probably isn’t a worse match for my need to referencing the data. Lesson learnt until the next time. Had I not experimented though I would not have had the joy of expressing my MapReduce functions inside a Mongo query using JavaScript. Whose idea was that?
I am still evaluating Hadoop, the pinup for NoSQL. I think there is a lot of potential here for MapReduce in my batch operations, a clear fit, but there is considerable set-up overhead. The Hadoop umbrella has also become quite sizeable in its own regard so I expect there is some more value in this area. Neo4j is also looking very promising. It is a Graph based ACID database and as such stands out. Relationships are treated, according to the documentation, as first class citizens so I am taking a look at this next. My only concern is how it performs with ad-hoc queries. Failing all this I will go back to multi pass batch processing on my RDBMS with plenty of caching for good measure. It’s not elegant, but it works.