UKOLN Dev » md269

Archiving Old Dynamic Websites

md269 — Tue, 29 May 2012 13:55:42 +0000

Archiver is a web based application written in Java using the Spring Framework. Its main use is to produce static versions of websites which are no longer updated with new content. This is important for many reasons; firstly, dynamic sites will typically target specific software dependencies. As time passes certain dependencies will no longer receive patches, posing a security risk, or updates, meaning it may no longer function with the latest operating systems and code environments. This is almost always a problem for system administrators who sometimes end up maintaining software that can be tens of years old on outdated systems. Archiver offers an easy means of creating a static versions of of these dynamic sites that can deployed simply to a standard web server, typically, but not limited to, Apache.

Websites will run using different software; some will be written in plain HTML and CSS whilst others will run on CMS or WIKI platforms such as WordPress, MediaWiki or Drupal. Each of these methods will provide a slightly different way of performing some tasks, writing certain elements in HTML/CSS etc and laying out structure. After analysing the problem it was clear that we would need to target specific software in order to provide a high quality solution. For this reason, the ‘Strategy’ design pattern was chosen.

In this case an interface and super implementation provided a default set of methods for dealing with the processing of individual web elements written to work in the majority of cases. It can be thought of as standard web behavior. Subclasses of this Strategy were provided to account for software differences.

We currently support the following strategies -:

Basic (Static websites for migration)
Drupal
Media Wiki
WordPress

One of the main tasks which Archiver performs is to make any links which appear in HTML, CSS or JavaScript files relative to the homepage of the website so they are not absolute links. The JSoup plugin for Java was especially useful in this case as it allows the detection of a specified tag in the HTML file. JSoup also uses a Jquery type syntax to select the different elements from the HTML e.g. “#” is used to select an ID and “.” is used to select a class. JSoup also allows invalid HTML which is useful doesn’t prevent a site from being fully archived if there are mistakes in the markup. For the CSS and JavaScript, Regex was used to create expressions in the specified format for a CSS or JavaScript link, this could then be used to find and change the links. Alongside making links relative, Archiver also adds each link which it finds to the list of files to be added into the archive folder. After archiving recursively a zip file is served up to the user.

While existing solutions are available none of them provide the comprehensive rewriting capabilities of Archiver. All the user has to do is point the webapp at a site, choose a strategy and deploy the resulting zip.

Archiver also produces a README file which provides details of all the files which have been included in the archive and lists any errors such as missing pages.

Code is available from https://bitbucket.org/ukolnisc/archiver/src

While this is working code it has not received sufficient testing which is obviously vital for this type of project. With that in mind we would love to hear your feedback.

Some are more NoSQL than others

md269 — Fri, 23 Mar 2012 16:49:56 +0000

I’m no SQL Expert

Over the past few years I have had my fair share of tricky data management opportunities. There was the financial transaction database that had no keys or indexes and had to be pieced back together by getting old source code releases, finding the bugs and reversing the incorrect values. There was the GB’s of web log records that needed cross referencing and finally, analysing free text marketing responses for patterns.

This was all a warm up for my current opportunity. With this task I have all the issues at once. I have the scale, with 22 million odd items. I know it’s not enormous for 2012, but it is far from easily manageable. I have the lack of consistent relationships and the final piece of the puzzle lack of data quality.

What I wouldn’t give for a nice enumeration of types, something concrete to go on. Take dates for example. For decades it has been the norm to store dates in ISO format, or at least something that can be converted back and forth. If I am really lucky I get ISO dates, a lot of the time I get something that isn’t defined but recognisable and can convert to ISO from, for example, ‘moon cycles since equinox format’ ™. Often though, I get user typed input, not from the same person and not even the same system. Entering dates like “about the middle of last year” is guaranteed to anger even your most friendly neighbourhood developer.

Taken individually it isn’t that big a deal. However producing results in reasonable web- response speeds for 22 million records, grouping by, counting and cross referencing on thousands of possible groups on standard hardware is eluding me. If you can make them dance this way I would love to hear from you.

NoSQL Hype vs. Substance

I’m sure I am not the only one who has trouble keeping up with the latest and newest technology releases. There are lots of exciting new cool apps and services that I don’t really have the time to investigate due to the sheer quantity. Sometimes I just don’t have the patience to coax a demo app out of the latest beta release, constantly cross referencing against error messages. Finally, my least favourite, there is the kind of technology forced along by big business marketing.

Data storage is huge business, particularly the other side of the Atlantic. There are vast sums of money at stake and even corporate survival can depend on the success or failure of given products. I’m no Commie, I don’t mind this in principal, but the amount of positioning, media attention and misinformation that then surrounds the products makes it very hard to separate the hype and the substance.

A couple of weeks ago I was deliberating, remembering all the great things I have heard about NoSQL. Maybe the NoSQL people have a point and now I have a solid use case where my RDBMS is not suiting me. Up until this point I had discounted alternatives to my RDBMS on the grounds that any storage solution was moving bits around on a disc and that the same rules applied. Like all performance computing it is a game of caching. Keep indexes in memory and look to disc as little as possible. Indexes and disc space usage are always going to be more or less equal leaving any performance improvements to the implementation, hardware and possibly some new algorithms. RDBMS design was based on set theory and predicate logic, or to put it another way, Maths. Very little has changed to satisfy my scepticism with regards the speed and scale increases promised by NoSQL movement. Even the idea that there is seen to be a movement worries me. I mean, it’s hardly suffrage, anti-war or civil rights is it?

Some are more NoSQL than others

Up until now I have been talking about NoSQL as a single entity. Of course this is just one of the misleading factors. For some reason, lots of substantially different technologies have been lumped under one umbrella. Maybe the daunting numbers necessitated this; maybe it was because it was felt they could survive better as a combined opponent to RDBMS. The majority of them share some common themes but thinking of them as a single entity is particularly unhelpful. In fact several of the so called NoSQL solutions have more in common with your SQL RDBMS than each other. Two of these notable exceptions are CouchDB and Neo4 which offer ACID compliance.

From Wikipedia the generally accepted types of NoSQL solution are -:

Document store, graph, Key-value store, multivalue, object, RDF, tabular and tuples

Having read a few articles about the various NoSQL solutions, it seems that each author had decided to group them up and talk about the groups in some way, for each offering possible scenarios where each are useful. So far it has been easy to pick holes in every one of the lists, in some cases because they are out of date but mostly because even in these sub types the feature sets can still be very different. For this reason I shall approach this from a slightly different angle. Firstly I shall talk about common themes to most (but not all) of the NoSQL solutions, then follow up with a few types of software and which specific NoSQL products would be useful.

Speed and ACID

Earlier I mentioned that I couldn’t really see how you can develop a significantly faster comparable version of a storage solution. In the case of the majority of the NoSQL products, the main selling point is horizontal scalability. To put it another way, it is easier to deploy over lots of load balanced clusters giving the performance gains. DBRMS’s do not scale as easily in this manner.

The reason for this is that all good RDBMS’s are at least approaching ACID compliance. In essence, this is your guarantee that data you store is consistent and will be there when you want it. With ACID comes the concept of transactions which are important for many real world tasks, and without them bank transfers would vanish, nuclear missiles would launch. The locking required does not work as easily over RDBMS clusters due to the inevitable latency.

Having said that there are many cases where this isn’t important. You could maintain the consistency at the application level. It gets increasingly harder to maintain with increasing system complexity but it is far from impossible. Alternatively read only data sources are a good candidate or maybe you just don’t care. If the odd ‘Like’ or +1 goes missing the sun will still rise the next day. In addition I should probably point out that most people tend to agree that NoSQL means ‘Not Only SQL’. For reasons discussed, in most cases it would represent part of a given solution. A fast NoSQL solution would work well as a client facing readable resource to a large complex dataset.

NoSchema

A relaxed or in some cases entirely non-existent schema is another selling point. This for me is the key difference. So many times my model has altered slightly and various null checks have crept into my code. You can easily see how in some cases a very relaxed schema would be a nice thing to have.

Commodity computing

Computing as a commodity has been a big driver behind many of these products. It isn’t hard to see the value of being able to easily spin up a few more database clusters over the Christmas busy period with little fuss. This is a key feature of how the horizontal scalability can be a massively appealing part of these solutions. Taking this further some products have an emphasis on distribution. For example you could have a country or regional presence in a datacentre where for example UK residents are served by one cluster/shard and Australian by another. Maybe you can offload your Black Friday North American rush to your Pacific Rim cluster where it is 2am.

UC1: Online Store/CMS/Blog

If you though there were a lot of NoSQL options then you are in for treat when you start looking for a CMS. It seems that every developer, has at some point, started coding their own CMS. It isn’t hard to see that document stores are particularly suited to this task. Almost the entire focus is around the document. Taking a real world example, MongoDB and Etsy demonstrate a nice scenario for this use case. On Etsy you have various sellers all over the globe creating product pages. Some might have shipping restrictions, photos, size guides, linked products or any number of combinations. With MongoDB and a relaxed schema, a product page could be a single document with just the relevant categories embedded. I am willing to bet they don’t use it for their payment systems though.

UC2: Caching

Memcached is probably the most common and famous example of caching in the NoSQL world*. Notoriously thousands of memchached nodes allow us all to keep up with the interesting happenings on Facebook. These are typically used in front of a backing data store and provide most recently used hash based caching and runs entirely from RAM. I think they key here is understanding that it can be used as part of a massive infrastructure rather than being something particularly revolutionary.

If you aren’t Facebook or similar and thinking of adding one memchached box to the font a box or two, you might be better off exploring other routes first.

*Other k-v stores are available.

UC3: Development

A relaxed and adaptable schema during software development has obvious benefits.

UC4: Graph Data

The most interesting type of solution in my opinion is the Graph database and oddly this seems to be the direction that receives the least attention. I have had a number of problems where I needed to view data from various angles at different times and the relational approach just didn’t work. I was constantly creating temporary tables of underlying data from different directions which became hard to maintain. Expressed as a graph I can see that it could be far easier to work with. Again the concept of data as a graph is hardly new but I am about to trial Neo4j as a solution to my current problem so I shall report back with my findings.

UC5: Analytics

The likes of Hadoop MapReduce can be suited to analytics. Typically reporting makes it into the code at a much later stage and can be easily forgotten. I have seen many systems spending most of their cycles calculating the nightly sales reports with increasingly complicated SQL queries over their perfectly normalised data sets. Aggregating, result summarisation and general querying can be guaranteed with real time performance. Google, despite trying to replace it, are using a version of this behind the scenes to provide your search results. It clearly scales.

I’m no NoSQL Expert

It is a point worth labouring, that the key is in picking the right tool for your data. Slightly less obviously it is about how you need to reference that data, not only today but in the future.

Experiences

Being a developer I had itchy keyboard fingers and didn’t quite get around to researching thoroughly before I trialled MongoDB. Seemingly it was a good match for my data with a relaxed schema but there probably isn’t a worse match for my need to referencing the data. Lesson learnt until the next time. Had I not experimented though I would not have had the joy of expressing my MapReduce functions inside a Mongo query using JavaScript. Whose idea was that?

I am still evaluating Hadoop, the pinup for NoSQL. I think there is a lot of potential here for MapReduce in my batch operations, a clear fit, but there is considerable set-up overhead. The Hadoop umbrella has also become quite sizeable in its own regard so I expect there is some more value in this area. Neo4j is also looking very promising. It is a Graph based ACID database and as such stands out. Relationships are treated, according to the documentation, as first class citizens so I am taking a look at this next. My only concern is how it performs with ad-hoc queries. Failing all this I will go back to multi pass batch processing on my RDBMS with plenty of caching for good measure. It’s not elegant, but it works.

CSS3 – Better Late Than Never

md269 — Mon, 28 Nov 2011 14:32:56 +0000

CSS 1/2

Back in the year 2000, we, and by that I mean anyone who wrote web-pages, were bungling around with CSS. I was being told tables were evil, I need to separate code (content) and style and that CSS was the answer to my prayers. While the second was clearly something to aim for, it became obvious that CSS at the time wasn’t really capable of helping me achieve this. Don’t get me wrong, it was a step in the right direction and rightfully consigned laying out sites in tables as a thing of the past. We were encouraged to lay elements out as divs and most of the time it was possible to get at least most of the way into true separation between style and code, for static sites at least.

Dynamic Sites

One problem I repeatedly encountered was when rendering out lists of elements of undetermined size into a tabular style. Maybe I wanted odd elements as a different colour, or my boss might come in and want the last element in cornflower blue to match his Tuesday tie. It was mostly possible but was a pain and tied the code into the visuals in several places.

CSS3 to the rescue

Finally eleven years later CSS3 comes to the rescue with some fancy new selectors (there were a few in CSS2). Here are some good ones I have used already.

This will add ‘>>’ at the end of any external links automatically. Repeat for https if required.

a[href^="http://"]:not([href*="www.domain.com"]):after { 
    content: " >>"; 
}

This will highlight any links that open in a new window.

a[target^="_blank"] { 
    color:#855; 
}

It also allows you to write code that is almost (other than the class name) completely detached from the visuals. This lays out divs in columns of three.


/*Loop in your language of choice*/ {
    Content
}

.splitThird {height: auto; clear: both; }

.splitThird div:nth-child(3n+1) {
    float:left; width: 33%; clear:left; margin-bottom: 20px;
}

.splitThird div:nth-child(3n+2) {
    float: left; width: 33%; clear: none;
}

.splitThird div:nth-child(3n+3){
    float: right; width: 33%;
}

RepUK

md269 — Fri, 11 Nov 2011 10:11:18 +0000

About

The interest in exploiting the content to be found in institutional repositories is growing. At the same time, there is a range of possible uses for a central cache of metadata records held by institutional repositories.

RepUK harvests the metadata UK OpenDOAR registered repositories via OAI-PMH and presents the results in the form of visualisations. At the time of writing 152 repositories are havested. The 1.3 Million records and 26 Million elements are processed on a national and individual basis. In addition the metadata is presented as a faceted search based on Solr indexes.

See it in action View the source

Technologies

RepUK is a Java application running in Apache Tomcat, on a Spring 2.5, Spring Batch, Ibatis, MySQL and Solr stack.

Screenshots

WebFont

md269 — Sun, 11 Sep 2011 21:08:00 +0000

Font support in browsers has always been a bit hit and miss. While support is improving, the only real way to have nice crisp fonts, particularly in the larger sizes, is to alias your own images in your graphics editor of choice and embed them. This is very time consuming. For this reason I created WebFont that mimics many paid services and allows inclusion of fonts on a page by using a dynamic url. It can be embedded into existing projects or run as a standalone service.

View the source at bitbucket

This snippet is used to fetch the '152' on the RepUK homepage.

As this is updated dynamically it is something that would be impossible to recreate by any other means.

JQuery Plugin: Polling with timeouts

md269 — Sun, 10 Jul 2011 15:46:37 +0000

Polls a server resource at specified timeout interval until abortTimeOut is reached with options for testing responses for success.

This was useful for allowing a machine a certain amount of time to wake from hibernate before aborting.

Download

 /** 
 * PollingWithTimeouts - Polls a server resource at specified timeout 
 * interval until abortTimeOut is reached
 *
 * Settings -:
 * url -          the url to ajax call
 * method -       get/post (defaults to get)
 * sendData -     array of values to be passed in - 
 *                e.g. {name: "foo", something: "else"}
 * timeout -      timeout in ms between each poll (default 1000)
 * type -         json/text/xml what you are expecting as a 
 *                response (default json)
 * abortTimeOut - how long in total before we completely give up in 
 *                milliseconds (default 60000) to have no abort 
 *                timeout dont pass in abortCallBack or testCallBack
 *
 *
 * Usage -:
 *
 * 1) As a never ending poll
 * $(document).ready(function(){
 *       $.PollingWithTimeouts({
 *              url : '/foo/',
 *              sendData:  {foo: 'foo', etc: 'etc'}
 *       });
 * })
 *
 * 2) As a never ending poll with a pollCallBack (stock ticker for example)
 * $(document).ready(function(){
 *       $.PollingWithTimeouts({
 *              url : '/foo/',
 *              sendData:  {foo: 'foo', etc: 'etc'}
 *       },
 *       function(pollCallBackData) { alert(pollCallBackData); }
 *   );
 * })
 *
 * 3) As a poll with an (optional) pollCallBack that aborts after the 
 * specified time calling back to the (optional) abortCallBack
 * 
 * $(document).ready(function(){
 *       $.PollingWithTimeouts({
 *              url : '/foo/',
 *              sendData:  {foo: 'foo', etc: 'etc'},
 *              abortTimeOut: 60000
 *       },
 *       function(pollCallBackData) { alert(pollCallBackData); },
 *       function(abortCallBackData) { alert(abortCallBackData); },
 *   );
 * })
 *
 * 4) As a poll with a call back that aborts after the specified time 
 * calling back to the abort function isCompletedCallBackData allows the 
 * caller to test response data and return true if polling is complete
 *    If isCompletedCallBack is true it calls the successCallBack
 *    e.g. (and the reason I wrote it) after doing WOL on 
 * machine poll the server to ping and determine if the machine is awake. 
 * This needs to abort after a period of time.
 * $(document).ready(function(){
 *     $.PollingWithTimeouts({
 *              url : '/foo/',
 *              sendData:  {foo: 'foo', etc: 'etc'},
 *              abortTimeOut: 60000
 *     },
 *     function(pollCallBackData) { alert(pollCallBackData); },
 *     function(abortCallBackData) { alert(abortCallBackData); },
 *     function(isCompletedCallBackData) {alert(isCompletedCallBackData);},
 *     function(successCallBackData) { alert(successCallBackData); },
 *   );
 * })
 *
 * Copyright (c) 2011 UKOLN (http://www.ukoln.ac.uk)
 * Mark Dewey
 * Licensed under GPL:
 * http://www.gnu.org/licenses/gpl.html
 *
 * Version: 0.9
 */

(function($) {
    $.PollingWithTimeouts = function(options, pollCallBack, 
                      abortCallBack, isCompletedCallBack, successCallBack) {

            settings = jQuery.extend({
                url: '',        // URL of ajax request
                method: 'get', // get or post
                sendData: '',   // array of values to be passed in 
                                //e.g. {name: "foo", something: "else"}
                timeout: 1000,   // timeout in milliseconds - 1 sec
                type: 'json', //or whatever else you like
                abortTimeOut: 60000  //1 min
            }, options);

                f = settings.method == 'post' 
                     || settings.method == 'POST' ? $.post : $.get;

                var abort = new Date().getTime() + settings.abortTimeOut;
                getdata();

                function getdata()
        {

                        f(settings.url, settings.sendData, function(d){

                if (abortCallBack && 
                          new Date().getTime() > abort) {
                        clearTimeout(PollingWithTimeouts);
                        abortCallBack(d);
                }
                else {
                        if (isCompletedCallBack) {
                                if (isCompletedCallBack(d)) {
                                        clearTimeout(PollingWithTimeouts);
                                        if (successCallBack) {
                                                successCallBack(d);
                                        }
                                } else {
                                        if (pollCallBack) {
                                                pollCallBack(d);
                                        }
                                        PollingWithTimeouts 
                                           = setTimeout(getdata, 
                                               settings.timeout);
                                }
                        } else {
                                if (pollCallBack) {
                                        pollCallBack(d);
                                }
                                PollingWithTimeouts = setTimeout(getdata, 
                                                         settings.timeout);
                        }
                }
            }, settings.type)
        }

        };
})(jQuery);

WOL

md269 — Fri, 11 Mar 2011 21:32:19 +0000

This is a early test project for enabling wake-on-lan. It is very integration based and works with CAS, Bath Person Finder and the network.

View the source

Badger

md269 — Thu, 11 Nov 2010 20:36:35 +0000

The badger application creates word clouds. While there are many generators already in the wild it adds an important feature, batch generation from spreadsheets. This saves a lot of effort when creating hundreds of Dev8D name badges.

See badger in action