View low bandwidth version

Archive for the ‘bandwidth’ Category

pmGraph – Bandwidth Monitoring for Networks

Saturday, February 20th, 2010
pmGraph video screencap

Video introducing pmGraph hosted by Vimeo

pmGraph is a free tool we produce to help administrators monitor bandwidth on networks.

Read more about it or watch the video above.

Many thanks to Mark for putting the video together.

Lopad: high speed collaborative text editing over low bandwidths

Saturday, January 30th, 2010

Video of lopad team at #crisiscampLDN

Note: lopad is still a work in progress. The #crisiscampLDN team have made good progress, there will be a first cut up in the next few days. Getting it to the optimised, low bandwidth tool it needs to be is a project for the next few weeks.

I’m in “the ball room” (the other one’s called “the tree room” in case you’re wondering where the heck we are) at #crisiscampLDN and I’m thoroughly impressed by the efforts of @nickstreet and @mrchrisadams from Headshift.

What are they doing? Something really cool – making the first release of lopad. Check out the intro on blip.tv from @leashless above.

What is lopad?

Aptivate makes a service called loband – it takes high-bandwidth webpages and makes them work quickly on slow connections.

There’s a demand (e.g. from staff at UNFAO and users in Rwanda) for a related service, tuned for low bandwidth environments, that lets users easily collaborate over a text file.

This is the first addition to the lo* (pronounced lo-star) family of products: lopad.

Think of it as a super-lightweight Google Docs with no registration, instant visibilty of other user’s changes and fast performane on all internet connections.

There’s already a product called Etherpad which has been open-sourced since Google bought the company (to absorb the team into Google Wave). They plan to discontinue the public service in a few months time. In response to this, there are already some other public instances of Etherpad (e.g. PiratePad) as well as many private ones, but the immediate goal of the lopad project is to:

  • Create lopad.org – a free public instance of etherpad, promoted for use in the relief and development sectors but open to anyone
  • Optimise lopad.org to perform well on low bandwidth and/or high latency connections

Some of the things we’re going to be doing along the way will be getting a production server up and running, getting a copy of etherpad on there, rebranding it to “lopad”, analysing the opportunities for improving the bandwidth performance of the system and then implementing them.

In true meta-style, there’s more info and developer notes online at: http://etherpad.com/crisiscampLDN

Some highlights from the above:

I also think this will be a valuable tool for use by people in the Crisis Camp effort.

So: friends, coders, countrymen, come and join in!

Simulating low bandwidths: how to make sure your apps work in the field

Saturday, January 23rd, 2010

I’m going to write about four ways to simulate a slow internet connection and a bit of background about why you’d want to do it. Simulation is great but I’ll say this now: there’s no substitute for testing stuff in the field. However, before you release to the team on the ground or grab your bag and hop on a plane, read this.

Why simulate low bandwidths?

Aptivate builds online software for people in the international development sector. Our users are in places where internet connections are slow and unreliable – we need to make sure our stuff works for them.

More people are accessing the web using mobile handsets or mobile internet connections (3G dongles and tethered phones). The bandwidth, latency and stability characteristics of these links are very different to the “always on broadband” that most developers target.

Finally, we’re involved with CrisisCamp London over the next few weeks – part of an international effort by dedicated volunteers to provide remote technical support and build tools for individuals and organisations working in Haiti. What they build needs to work in that environment – hopefully this post is going to be useful.

So, if you think your technology is going to be used in the scenarios above, read on. If you’re just awesome and like to know about this stuff, read on too…

Four ways to simulate a slow connection

  1. Use a profiler, your brain and some common sense (easiest)
  2. Use Aptivate’s online low bandwidth simulator (easy)
  3. Use Sloppy a desktop Java app that simulates slow web links (pretty easy)
  4. Get a machine (with maybe two network interfaces) and do some IP traffic shaping (best results, not easy)

There are probably a few more but I’m on a plane from Rome to London to go the the first CrisisCamp and can’t think properly!

1. A profiler + simulate it in your brain.

This is particularly useful if you’re developing for the web.

YSlow Running on Reliefweb.int

There are a few online and in-browser tools that give you a breakdown of the resources your website is using.

Three important things to get from these tools is:

  1. What’s the bandwidth usage of each of my pages and typical interactions?
  2. How many individual requests are required for each page?
  3. How much content is cached?

The overall bandwidth is a feasibility test. We say aim for 25k per page, but use your own judgement – how fast is your user’s connection, how long will it take for them to get to something useful (hint – if it’s longer than 5-10 seconds: #FAIL)

The number of requests also gives you an indication about performance over high latency or intermittent connections – in short, use fewer objects and cache them when you can.

Finally, if you’ve got a network usage meter (I have a noddy one running that comes with iStatMenus on the Mac ) you can get a rough idea of how much bandwidth an app is consuming (should work fine even if you’re developing an app in a mobile simulator). I’ve seen stuff for Windows I can’t remember, on Linux you could use BWM or get fancy with logging modes in IPTables – Google for more.

That’s it.

2. Use Aptivate’s online Low Bandwidth Simulator

Aptivates Low Bandwidth Simulator

This technique is only useful if your site is accessible from a public URL. It only simulates bandwidth, not latency or packet loss.

We make Loband, and online service that strips the junk out of webpages and gives you a compressed, simplified version that works better on slow links.

As part of the Loband code, there’s a simulator which you can access here.

You plug in the URL of your site, select the bandwidth you want to simulate and hit go. I haven’t tested it recently with any serious AJAX/HTML5/Flex/Flash stuff so your mileage may vary if you make heavy use of these tools.

Do what a user would do with your app and see if it’s usable.

That’s it.

3. Use the Sloppy desktop Java app

Sloppy Java desktop bandwidth simulator

This technique is great if your site is running on a local dev box or even if it’s online. It only simulates bandwidth, not latency or packet loss.

Make sure Java lives on your machine. Download Sloppy. Run it, start it, point it at your app.

Do what you would have done with 2.

That’s it.

4. Get a machine, (maybe two network cards) do IP traffic shaping.

This technique is the best of the bunch: you can simulate bandwidth, latency and packet loss and do so for anything running on your machine or LAN. That’s anything: browser apps, mail clients, Skype, mobile simulators etc. It’s not hard but is a little fiddly. There are two broad ways you could do this: for yourself on a single machine or, for a bunch of people on a LAN.

Terminal Showing iperf measuring different bandwidths throttled by dummynet

iperf showing a dummynet throttled link

Quickly,  to do it for yourself, on your own machine to do app testing: if you’re running FreeBSD / MacOSX,, follow Bjørn Hansen’s tutorial.

It gets a bit trickier if you want to do it for several machines at once.

What we’re trying to do is turn a machine with two network interfaces (NICs) into a “router”. Traffic goes in/out of the first interface at normal speeds, but the traffic goes in/out of the second interface at user-selected levels of crapness (bandwidth, latency, packet loss)

Relatively speaking: this is easy on a Mac / BSD box, trickier on Linux and hard on Windows. While most laptops actually have 2 network interfaces (wifi + ethernet) – I normally do this with a desktop that’s got 2 NICs  or a laptop + a USB / CardBus/PCMCIA NIC.

On a Mac/BSD you’re going to be using ipfw to control the dummnynet traffic shaper. Man up to find out more. In short: ipfw’s a firewall that classifies packets (e.g. by which port or IP they’re going to) into “flows”. Dummynet takes a flow and sticks it in a “pipe”. A pipe emulates a link with given bandwidth, propagation delay, queue size and packet loss rate.

….how on earth do we get this working?

There are better tutorials than I can write quickly here and  here. But in brief:

  1. Get a BSD machine with dummynet (OSX 10.4+ is enabled by default, might need a kernel rebuild for FreeBSD) running with 2 NICs. Fire up a terminal, type in ifconfig and make sure you can see the two interfaces (en0 and en2 for me)
  2. Make sure you can route packets between interfaces.
  3. Make a pipe for the traffic between interfaces
  4. Configure your pipe, stick your traffic in there and smoke it.
  5. Tweak the pipe and simulate to you heart’s content.

In reality, this always takes me half an hour to get right – I’ve never had this go smoothly first time.

First things I check if it’s not working:

  • Is OSX / BSD doing some daft routing / automatic internet connection sharing that messing with your ipfw settings?
  • Are you routing using the right interfaces? I’ve actually got 7 network interfaces that show up in ifconfig to choose from (firewire, bt, vm, wifi, ethernet etc.)
  • bit/s and Byte/s are quite different…
  • Don’t despair, it will work, there’s pictures of me doing it here. :-)

Typical bandwidth / latency / loss scenarios

The key commands you’ll be running to set parameters will look like:

ipfw pipe 1 config bw 50Kbit
ipfw pipe 1 config delay 200ms
ipfw pipe 1 config plr 0.2

The three variables you have to play with are bw (bandwidth) plr (random packet loss rate) and delay (latency). Here’s a super-rough guestimate for some typical scenarios, please advise if I’m way out or there are other common scenarios:

Scenario Bw (Kbit) delay (ms) pr (ratio)
2.5G mobile (GPRS) 50 200 0.2
3G mobile 1000 200 0.2
VSAT 5000 500 0.2
Busy LAN on VSAT 300 500 0.4

What about windows and linux?!

I promise to update this bit with more info when I’ve got Linux, Windows boxes and Chris to hand.

In short though: with linux it’s the same idea, machine with 2 NICS, get them routing, use IPTables and the linux traffic shaper, tc. It’s not as good as dummynet (no packet loss IIRC) but gets the job done. For Windows, I’d honestly have to do some more research, last time I tried it, I just pulled out my Macbook.

Please add any tips and corrections of  below!

Tariq

When it comes to websites… small IS beautiful

Thursday, July 9th, 2009

There are two reasons why you should make your websites as small as possible. By small I mean minimising the size of data your user must download to see your web pages.

The first reason is usability. Time and again it has been shown that users like speedy websites. Google and Amazon have recently found even a delay of half a second can mean a 20% drop in users. Obviously your site must provide what your audience is looking for, and it must make it easy to find, but the number one factor that contributes to a positive user experience is speed. Ideally you want your pages to load within 1 second. They must load within 10 seconds; research shows consistently that visitors will leave a site if it doesn’t load in 10 seconds or less, the fewer seconds it takes to load the more engaged a visitor will be. Even with the ever increasing connection speeds of broadband we are seeing in the UK, if you’re not careful, it’s still perfectly possible to make sites that are too slow.

The second reason – the reason that most interests me and Aptivate, the organisation I work for – is global accessibility. Like us, you may feel we have a moral duty to ensure important information is accessible in the developing world or you may see the developing world as an interesting emerging market. Either way, if you want your content to be accessible in the developing world you need to seriously consider the size of your web pages. Aptivate, has been focussing on this issue from the perspective of users in less developed countries. We’ve found that the majority of information is inaccessible; even information that is intended to be used by this audience. The fact is that the developing world is years behind the broadband revolution we are witnessing in the “global North”.

bandwidth vs page size

bandwidth vs page size

Not only that, but as more bandwidth becomes available in developing countries it is matched by increasing demand. We foresee that bandwidth will remain much lower in developing countries than in wealthy ones for some time to come. This must be considered when designing for a global audience.

Over the past 5 years the average web page has increased by 300%. Meanwhile, in developing country universities, we estimate the bandwidth available to an individual user will have increased by 20 – 60% – and this is from a very low starting point. Bandwidth is increasing slowly for developing country universities whilst bandwidth demands from their users and from websites, document downloads and on-line applications are increasing rapidly.

It CAN look good

When I talk to people about low-bandwidth friendly websites the first concern is that they would be somehow sub-standard. We must dispel the myth that low-bandwidth websites are boring and ugly. This is simply untrue.

Let’s make an analogy with building a house. If you wanted to build an energy efficient house would it have to be ugly? No. You may need to spend a bit more effort designing it in the beginning. The construction costs are nearly the same and there is no reason, other than the lack of imagination of your architect, that your house cannot be beautiful. And so it is with websites. The requirements to be small, fast, usable and globally accessible are just additional parameters for your designers. These additional requirements will be of negligible additional cost and yet will transform the user experience of all your users. Your designers are likely to produce a website that looks clean, clear and concise – all qualities that users have been found to prefer. If your main market is in the global North your users will benefit from a fast response which is the main contribution to their satisfaction in using your site. If your audience is in the developing world, designing for low-bandwidth will make the difference between them being able to see your website and not.

Small, fast, responsive web pages are good practice and are globally accessible. This is a win-win situation. The big players like Google and Amazon understand this. Others have not yet got the message.

Developing country universities

In 2008 Aptivate estimated that the bandwidth available to individual university students and researchers in low income and developing countries (for example, in most of Africa, parts of Latin America and South Asia) is 20 kb/s – which is about 1/100th the speed of a broadband connection to a typical UK home. While bandwidth will have increased since then it is still going to be about a factor of a 100 slower than the average domestic UK connection which is now over 3000 kb/s (3mb/s).

Recently I did a survey of 27 publishers’ websites. This was not an in-depth study just a quick temperature check but I think the results are still interesting. I chose the 27 publishers from the sponsors of a major conference. I “googled” each publisher then measured the size of the first page I got to from the Google search results, usually the publishers’ home page. The average page size was 250 kB which is not far off the current global average page size. However the largest was 800 kB while the smallest was 20 kB.

What does this mean for users in developing country universities? The average web page from this sample would take over a minute and a half to load. The table below shows the various page load times with times over 10 seconds high-lighted.

page load times in seconds

Connection Speed

Developing University

(20 kb/s)

Dial-Up

(56 kb/s)

UK Broadband

(3000 kb/s)

Page

Size

smallest (20 kB)

8

3

0.1

average (250 kB)

100

36

0.7

largest (800 kB)

320

114

2.1

These figures should be read as minimum download times. There are other factors besides bandwidth that effect download times like the complexity of the website. I find it’s pretty rare even in the UK to see pages loading in less than a second.

PDFs

If you’re a publisher it’s likely that you publish your articles as PDF files. In which case you may be asking yourself what’s the relevance of all this talk about web page optimisation. Firstly it should be noted that a lot of what I’ve said about web pages is true of PDF files as well. It is possible through bad formatting options to make PDFs unnecessarily large. PDF files can be optimised for printing which will make them higher quality but much larger. Alternatively they can be formatted for screen reading in which case they are a lot smaller. If you’re using a computer to read PDF optimised for screen reading you wouldn’t normally notice the difference… except in the amount of time you would have to wait for it. Giving the user a choice between these formats can help those with slow connections.

A year ago we did a small survey of PDF files from scientific journals. We found that most of the time these were well optimised. They were still large but this is because they contain a lot of information – graphs, charts, equations etc. When working with African university researchers we found that the large size of PDF files was not the biggest problem. The articles themselves represent high value content. Even if they take several hours to download (which, in some cases they did) this could still be tolerated by the user. They found ways of adapting to this for instance by doing other work while the article downloads or, in the rare cases where the power is left on, downloading the files overnight.

The real problem was the path the user had to follow to get access to the PDF article. While the PDF files represent valuable content for the user, the many web pages the user must navigate to gain access to the PDF usually represent little value. It’s important that this path is as direct as possible. We must be careful not to let too much branding or gadgetry thwart the user in their goal. While an African researcher may be prepared to start a PDF download that will take a long time they should not be expected to navigate through a dozen pages each of which may take several minutes to load. It is this kind of frustrating experience that will drive users from your site.

The causes

What makes web pages so big? Isn’t it the features that our users demand? Most of the time I don’t think it is. It’s just wastage and bloat.

When I get introduced to a new organisation I often have a look at their website and measure how big it is. If I have some spare time I like to see how hard it would be to halve the size of their home-page. This usually takes between 10 minutes and half and hour with little discernible difference to the user.

The most frequent culprits and the easiest to fix are the images. In many cases it would be better to change the design to rely less on large images. Even without changing the design large savings can be made simply by optimising the format of the images.

Next it’s worth trying to optimise the code. The HTML and CSS files that make up websites can be full of “comments”, white-space, unused sections and other unnecessary bits and pieces. It’s often straightforward to remove the wastage.

Another area of bloat is the JavaScript – chunks of code that are part of many websites and run on the users machine. Optimising the JavaScript can be easy or can be hard.

Sometimes the JavaScript just isn’t needed. For instance when it’s used for styling tricks which can now be done in more efficient ways.

Sometimes there’s lots of it that just isn’t used. There are JavaScript “library” files that contain many functions. A site may include a large library file but may only call one or two functions in it.

Sometimes the JavaScript comes as part of the Content Management System (CMS) the site uses. In this case it can be a bit trickier to sort out but still possible.

Things to do

If you’re interested in making your site faster and more globally accessible here are some ideas that might interest you.

The first step is find out how big your pages are. Tools like PingDom will measure the size of your pages. Tools like Google’s PageSpeed and Yahoo’s YSlow[5] will even make suggestion of what you can fix.

We have written on-line guidelines for designing website for global accessibility. We discuss the reasons why designing for low-bandwidth is a good idea and give concrete guidance on how to do that. We also list tools like YSlow and various automatic optimisers. You can see our Top Ten design guidelines here:

http://www.aptivate.org/webguidelines/TopTen.html

On the 11th of September (2009) we will be speaking at the ALPSP conference in Oxfordshire. We are also going to be running short “Halve Your Home Page” workshops – a hands-on session where we show you how shrink the size of your own site (email info@aptivate.org for details).

Low Bandwidth Web: Opera Turbo

Tuesday, June 9th, 2009

Aptivate (then Aidworld) was founded in 2003 by a group of techies and aidworkers wrestling with the question: how can you make the web usable for relief workers in the field?

Opera Turbo in Action

Opera Turbo in Action

The problem then was access to bandwidth and the cost of that access.

Typical satellite phone connection speeds were 9.6Kbps (think of cold treacle flowing uphill or the state of dial-up in the early 90s) and the cost would be anywhere from $2 to $20 per minute.

5 minutes to download something like cnn.com made it unusable and $100 for the privilege made it unaffordable.

We came up with loband – a free online service that simplifies web pages. It downloads them remotely, trims them down and  returns them to the user in a lightweight format. It can offer a 5-10x reduction in bandwidth used.

Fast forward 6 years and it’s interesting how similar the story is.

Some of us now have  fast desktop & mobile web connectivity, but websites have gotten heavier (the first page I get to on Facebook is 1.25MB…)  and we don’t always have access to our quick connections.

The fundamental issue is still there: the web can be slow and expensive if you’re not on a fast “unlimited data” connection.

Opera have been doing great things with their mobile browser for some time. They recently introduced the Opera Turbo feature into their desktop edition. The concept is similar to loband but its designed to integrate transparently into the browser.

Opera route all relevant traffic via their servers and return a compressed stream of data to the browser containing the content you want. From the picture above, you can see that they compress graphics to save bandwidth.

One thing I suspect they do (although I haven’t checked) is reduce the overall number of requests between the browser and the server. Going back to Facebook – it takes 92 HTTP requests to build my home page. That becomes painful if you’re on a low bandwidth, high latency connection. You effectively incur an overhead for each of those 92 requests.

If Opera can turn that into fewer, overall smaller requests – the Norwegians rock even more than I think they already do.

Web Optimisation: Google Page Speed

Friday, June 5th, 2009

Bandwidth management and optimisation (BMO) is one of Aptivate’s strategic areas.

Optimising websites is a key ingredient.

Google Page Speed in Firefox

Google Page Speed in Firefox

Google just introduced Page Speed which looks like it could be a useful tool. Much like Yahoo’s YSlow it sits on top of Firebug and tells you when you’ve done something naughty.

Aptivate publish web design guidelines targeted at authors developing content for audiences on low bandwidth connections. Google also publish a set of best practices for web performance.

This is interesting – we were motivated to write the guidelines because users in developing countries have a hard time accessing poorly optimised content. Google are motivated to write similar guidelines because they recognise that speed should be a commerical concern.

I think there’s a broadly shared goal that ultimately means good things for those on slower connections.

A quick aside on BMO:

Chris and Martin worked with KENET – The Kenya Education Network Trust, running bandwidth management workshops for network administrators in Kenya, in June 2009.

We’ve posted an overview of the workshop and resources we’re using on our OER Wiki.

Backup Mail Exchangers

Wednesday, January 28th, 2009

On Monday night, the power supply unit (PSU) in the server that hosts our mail server failed at around 2200 GMT. We don’t have physical access to the server out of hours, so I wasn’t able to replace it until about 1045 the next day, so our main email server was down for nearly 13 hours.

We didn’t have a backup MX because:

  • It usually can’t check whether recipients are valid or not, and therefore must accept mail that it can’t deliver;
  • It usually doesn’t have as good antispam checks as the primary, because it’s a hassle to keep it updated;
  • Spammers usually abuse backup MXes to send more spam, including Joe Jobs.

I thought that this was OK because people who send us mail also have mail servers with queues, which should hold the mail until our server comes back up. It’s normal for mail servers to go down sometimes and this should not cause mail to be lost or returned.

However, we had a report that one of our users did not receive a mail addressed to them, and was told by the sender that it had bounced. I saw the bounce messsage and suspected Exchange, so I decided to check how long Exchange holds messages before bouncing them. Turns out it’s only five hours by default. Most mail servers hold mail for far longer, for example five days, sending a warning message back to the sender after one day.

Bouncing messages looks bad on us. Apart from making our main mail server more reliable :) we need a backup MX to accept mail when the master is down.

However I do still want to minimise the spam problem that this will cause. Therefore I configured our backup MX to only accept mail when the master is down. Otherwise it defers it, which will tell the sender to try sending it to the master (again).

How did I achieve this magic? With a little Exim configuration that took me a day and that I’m quite proud of. I set up a new virtual machine which just has Exim on it, nothing else. I configured it as an Internet host, and to relay for our most important domains. Then I created /etc/exim4/exim4.conf.localmacros with the following contents:

CHECK_RCPT_LOCAL_ACL_FILE=/etc/exim4/exim4.acl.conf
callout_positive_expire = 5m

This allows us to create a file called /etc/exim4/exim4.acl.conf which contains additional ACL (access control list) conditions. The other change, callout_positive_expire, I’ll describe in a minute.

I created /etc/exim4/exim4.acl.conf with the following contents:

# if we know that the primary MX rejects this address, we should too
deny
        ! verify = recipient/callout=30s,defer_ok
        message = Rejected by primary MX

# detect whether the callout is failing, without causing it to
# defer the message. only a warn verb can do this.
warn
        set acl_m_callout_deferred = true
        verify = recipient/callout=30s
        set acl_m_callout_deferred = false

# if the callout did not fail, and the primary mail server is not
# refusing  mail for this address, then it's accepting it, so tell
# our client to try again later
defer
        ! condition = $acl_m_callout_deferred
        message = The primary MX is working, please use it

# callout is failing, main server must be failing,
# accept everything
accept
        message = Accepting mail on behalf of primary MX

The first clause, which has a deny verb, does a callout to the recipient. A callout is an Exim feature which makes a test SMTP connection and starts the process of sending a mail, checking that the recipient would be accepted. This is designed to catch and block emails that the main server would reject. Our backup server has no idea what addresses are valid in our domains; only the primary knows that.

The callout response is cached for the default two hours if it returns a negative result (the recipient does not exist on the master) or five minutes (see callout_positive_expire above) if the address does exist. We use a defer_ok condition here so that if we fail to contact the master, we don’t defer the mail immediately, but instead assume that the address is OK and therefore continue to the next clause.

The second clause of the ACL, which has a warn verb, is what took me so long to work out. Normally, if a condition in a statement returns a result of defer, which means that it failed, the server will defer the whole message (tell the sender to come back later). In almost all cases this is the right thing to do, but it’s the exact opposite of what we want here. We want to accept mail if the callout is failing, not defer it, otherwise our backup MX is useless (it stops accepting mail if the primary goes down).

Because this is such an unusual thing to do, there is no configurable option for it in Exim. The only workaround that I found is that there is exactly one way to avoid a deferring condition causing the message to be deferred: a warn verb. The documentation for the warn verb says:

If any condition on a warn statement cannot be completed (that is, there is some sort of defer), the log line specified by log_message is not written… After a defer, no further conditions or modifiers in the warn statement are processed. The incident is logged, and the ACL continues to be processed, from the next statement onwards.

So what we do is:

  1. Set the local variable
    acl_m_callout_deferred to true;
  2. Try the callout. If it defers (cannot contact the primary server) then we stop processing the rest of the conditions in the warn statement, as described above;
  3. If we get to this point, we know that the callout did not defer, so we set acl_m_callout_deferred to false.

The third clause  of the ACL, which has a defer verb, simply checks the variable that we set above. If we get this far then the primary server is not rejecting this address; and if it’s not deferring either, then it must be accepting mail for the address. In that case, we defer the message, telling our SMTP client to try again later, at which point it will hopefully succeed in delivering directly to the primary.

Callout result caching becomes a problem here. If the master was not reachable, but a previous callout had verified that a particular address existed, and that callout result was cached for the default 24 hours, then the backup MX would defer subsequent mail to that address for the next 24 hours, even if the master went down. This is why we changed the positive callout result caching time to 5 minutes earlier.

The fourth clause  of the ACL, which has an accept verb, is even simpler. It accepts everything that was not denied or deferred earlier. We can only get this far if the master is not accepting or rejecting mail for that address.

So far the configuration appears to work fine and has blocked 14 spam attempts (abusing the backup MX) in 14 hours.

Offline Wikipedia

Friday, November 21st, 2008

I’m working on making Wikipedia, the (in)famous free encyclopaedia, available offline, for a project in a school in rural Zambia where Internet access will be slow, expensive and unreliable.

What I’m looking for is:

  • Completely offline operation
  • Runs on Linux
  • Reasonable selection of content from English Wikipedia, preferably with some images
  • Looks and feels like the Wikipedia website (e.g. accessed through a browser)
  • Keyword search like the Wikipedia website

Tools that have built-in search engines usually require that you download a pages and articles dump file from Wikipedia (about 3 GB download) and then generate a search index, which can take from half an hour to five days.

For an open source project that seems ideally suited to being used offline, and considering the amount of interest, there are surprisingly few options (already developed). They also took me a long time to find, so I’m collating the information here in the hope that it will help others. Here are my impressions of the solutions that I’ve tried so far, gathered from various sources including makeuseof.com.

The One True Wikipedia

The One True Wikipedia, for comparison

MediaWiki (the Wikipedia wiki software) can be downloaded and installed on a computer configured as an AMP server (Apache, MySQL, PHP). You can then import a Wikipedia database dump and use the wiki offline. This is quite a complex process, and importing takes a long time, about 4 hours for the articles themselves (on a 3 GHz P4). Apparently it takes days to build the search index (I’m testing this at the moment). This method does not include any images, as the image dump is apparently 75 GB, and no longer appears to be available, and it displays some odd template codes in the text (shown in red below) which may confuse users.

Mediawiki local installation

Mediawiki local installation

Wikipedia Selection for Schools is a static website, created by Wikimedia and SOS Childrens Villages, with a hand-chosen and checked selection of articles from the main Wikipedia, and images, that fit on a DVD or 3GB of disk space. It’s available for free download using BitTorrent, which is rather slow. Although it looks like Wikipedia, it’s a static website, so while it’s easy to install, it has no search feature. It also has only 5,500 articles compared to the 2 million in Wikipedia itself (about 0.25%). Another review is on the Speed of Creativity Blog. Older versions are available here. (thanks BBC)

Wikipedia Selection for Schools

Wikipedia Selection for Schools

Zipedia is a Firefox plugin which loads and indexes a Wikipedia dump file. It requires a different dump file, containing the latest metadata (8 GB) instead of the usual one (3 GB). You can then access Wikipedia offline in your browser by going to a URL such as wikipedia://wiki. It does not support images, and the search feature only searches article titles, not their contents. You can pass the indexed data between users as a Zip file to save time and bandwidth, and you may be able to share this file between multiple users on a computer or a network. (thanks Ghacks.net)

WikiTaxi is a free Windows application which also loads and indexes Wikipedia dump files. It has its own user interface, which displays Wikipedia formatting properly (e.g. tables). It looks very nice, but it’s a shame that it doesn’t run on Linux.

WikiTaxi screenshot (wikitaxi.org)

WikiTaxi screenshot (wikitaxi.org)

Moulin Wiki is a project to develop open source offline distributions of Wikipedia content, based on the Kiwix browser. They claim that their 150 MB Arabic version contains an impressive 70,000 articles, and that their 1.5 GB French version contains the entire French Wikipedia, more than 700,000 articles. Unfortunately they have not yet released an English version.

Kiwix itself can be used to read a downloaded dump file, thereby giving access to the whole English Wikipedia via the 3 GB download. It runs on Linux only (as far as I know) and the user interface is a customised version of the Firefox browser. Unfortunately I could not get it to build on Ubuntu Hardy due to an incompatible change in Xulrunner. (Kiwix developers told me that a new version would be released before the end of November 2008, but I wasn’t able to test it yet).

Kiwix (and probably MoulinWiki)

Kiwix (and probably MoulinWiki)

Wikipedia Dump Reader is a KDE application which browses Wikipedia dump files. It generates an index on the first run, which took 5 hours on a 3 GHz P4, and you can’t use it until it’s finished. It doesn’t require extracting or uncompressing the dump file, so it’s efficient on disk space, and you can copy or share the index between computers. The display is in plain text, so it looks nothing like Wikipedia, and it includes some odd system codes in the output which could confuse users.

Wikipedia Dump Reader

Wikipedia Dump Reader

Thanassis Tsiodras has created a set of scripts to extract Wikipedia article titles from the compressed dump, index them, parse and display them with a search engine. It’s a clever hack but the user interface is quite rough, it doesn’t always work, requires about two times the dump file size in additional data, it was a pain to figure out how to use it and get it working, and it looks nothing like Wikipedia, but better than the Dump Reader above.

Thanassis Tsiodras' Fast Wiki with Search

Thanassis Tsiodras' Fast Wiki with Search

Pocket Wikipedia is designed for PDAs, but apparently runs on Linux and Windows as well. The interface looks a bit rough, and I haven’t tested the keyword search yet. It doesn’t say exactly how many articles it contains, but my guess is that it’s about 3% of Wikipedia. Unfortunately it’s closed source, and as it comes from Romania, I don’t trust it enough to run it. (thanks makeuseof.com)

Pocket Wikipedia on Linux

Pocket Wikipedia on Linux (makeuseof.com)

Wikislice allows users to download part of Wikipedia and view it using the free Webaroo client. Unfortunately this client appears only to work on Windows. (thanks makeuseof.com)

WikiSlice (makeuseof.com)

WikiSlice (makeuseof.com)

Encyclopodia puts the open source project on an iPod, but I want to use it on Linux.

Encyclopodia

Encyclopodia

It appears that if you need search and Linux compatibility, then running a real Wikipedia (MediaWiki) server is probably the best option, despite the time taken.