View low bandwidth version

Archive for the ‘Appropriate Technology’ Category

The (ongoing) need for speed

Monday, June 21st, 2010
Jakob Nielsen

Jakob Nielsen

13 years ago Jakob Nielsen wrote an important article stating that one of the most significant factors in web usability is speed.

In the work that we do, designing web applications that are used in developing countries, we have taken this advice very much to heart.

13 years later Jakob Nielsen has felt the need to write a new version of that article again. And I am glad he has! Despite the roll out of broadband web authors are still creating sites that are slow, although for different reasons, according to Jakob.

In the original article Jakob said that large images were the main culprit in causing slow web pages. Now he says, with the advent of broadband, large images are not the main problem.

Interestingly, with the sites we look at and the connection speeds we deal with, large images still are one of the main contributing factors to slow sites.

Jakob now lays the blame on too many fancy widgets.

I would agree with Jakob here. In my experience the size of javascript is now rivalling that of the large images for the sites we’re interested in.

The research into user interface response times is as true now as it was back in 1968 when it was done. From Nielsen’s article, remember these times:

  • 0.1 seconds gives the feeling of instantaneous response — that is, the outcome feels like it was caused by the user, not the computer.
  • 1 second keeps the user’s flow of thought seamless. Users can sense a delay, and thus know the computer is generating the outcome, but they still feel in control of the overall experience and that they’re moving freely rather than waiting on the computer.
  • 10 seconds keeps the user’s attention. From 1–10 seconds, users definitely feel at the mercy of the computer and wish it was faster, but they can handle it. After 10 seconds, they start thinking about other things, making it harder to get their brains back on track once the computer finally does respond.

A 10-second delay will often make users leave a site immediately.

Now consider the implications of these times in conjunction with your users’ connection speed, particularly if they happen to be in the developing world.

(see also our web design guidelines for low bandwidth connections.)

Alan’s Random Idea

Saturday, June 19th, 2010

You are probably familiar with Moore’s Law – that every year or so the power of computers doubles. I have a theory too, well… more a hypothesis…. ok let’s call it a “Random Idea”  – about progress in software.

Imagine for a moment you’re an engineer working on hard-drive design. Your goal is obvious – you want to cram more stuff in less area on a disk. You want it to work faster and for mobile devices in particular, use less power. It might not be easy to achieve but you know what you’re trying to do.

Imagine you’re a chip designer. What are your goals? You want to make chips with more stuff in less area, that work faster and use less power.
Alans Random Idea Graph
So while Moore’s Law states that the hardware capability follows a geometric increase, Alan’s Random Idea says that software capability increases linearly.

Imagine you’re working on the Microsoft Excel team. What’s your goal? You can already embed the kitchen sink in a cell. What are you going to make it do next? Put the cells in a cube instead of a grid? (Hey.. that’s actually quite a mad idea… hmm… that’s another blog post).

The goals for software, for what we want to do with computers, require more imagination. So REAL software capability progresses linearly.

The area between Moore’s curve and mine I call “guff“. The not-so essential stuff we waste all our computing power on.

Let me give you an example of what I mean. Do you remember DOS? That thing we had before Windows? Even on an old DOS machine you can run WordPerfect, the word processor and Visicalc the spreadsheet. I bet you could even run an email program on it. A lot of the business value you get from a computer in most small businesses is in doing email, writing documents and working out things in spreadsheets.

Guff, the Great Leap and the Bicycle

The Brompton folding bicycle is a classic in design. Beautifully engineered and fantastically functional. People love Bromptons. You could own one for 10 years and it wouldn’t even lose 10% of its value. They’re expensive too.
A Great Leap Bike
At the other end of the spectrum we have the “Cambridge Bicycle“. Cambridge must be the world capital of bicycle theft. The un-written rule in Cambridge is you should spend more on your lock than on the bike. And most people don’t like to spend more than £20 on a lock. The bikes are atrocious. New-comers to Cambridge at first cower away when they see a typical bike worried that just being near the rusting, brake-less thing might pre-dispose them to having a nasty accident.

However, from a distance… a long distance… (like out in space), the difference in capability between someone with a Brompton and a Cambridge Bike is a lot less than that between having a Cambridge bike and having none at all. However terrible a Cambridge bike may be, it’s already made the “Great Leap“.

The same is true of the DOS machine, which is the Cambridge Bicycle of the of the computing world. It’s also already made the Great Leap between having no computer and having something useful. Moore’s Law means that the cost of making the Great Leap becomes, at least in theory, incredibly cheap. The current crop of mobile phones have enough processing power to run a business. In a couple of years time they will be throw-away items.

There are many people in the world who are seriously constrained by resources, who are living on just a few dollars a day. Perhaps Moore’s Law and Alan’s Random Idea might mean they have a chance, if they can get by without the guff, of making a great leap.

This was the gist of a last minute lightning talk I gave at SPA2010.

pmGraph – Bandwidth Monitoring for Networks

Saturday, February 20th, 2010
pmGraph video screencap

Video introducing pmGraph hosted by Vimeo

pmGraph is a free tool we produce to help administrators monitor bandwidth on networks.

Read more about it or watch the video above.

Many thanks to Mark for putting the video together.

Simulating low bandwidths: how to make sure your apps work in the field

Saturday, January 23rd, 2010

I’m going to write about four ways to simulate a slow internet connection and a bit of background about why you’d want to do it. Simulation is great but I’ll say this now: there’s no substitute for testing stuff in the field. However, before you release to the team on the ground or grab your bag and hop on a plane, read this.

Why simulate low bandwidths?

Aptivate builds online software for people in the international development sector. Our users are in places where internet connections are slow and unreliable – we need to make sure our stuff works for them.

More people are accessing the web using mobile handsets or mobile internet connections (3G dongles and tethered phones). The bandwidth, latency and stability characteristics of these links are very different to the “always on broadband” that most developers target.

Finally, we’re involved with CrisisCamp London over the next few weeks – part of an international effort by dedicated volunteers to provide remote technical support and build tools for individuals and organisations working in Haiti. What they build needs to work in that environment – hopefully this post is going to be useful.

So, if you think your technology is going to be used in the scenarios above, read on. If you’re just awesome and like to know about this stuff, read on too…

Four ways to simulate a slow connection

  1. Use a profiler, your brain and some common sense (easiest)
  2. Use Aptivate’s online low bandwidth simulator (easy)
  3. Use Sloppy a desktop Java app that simulates slow web links (pretty easy)
  4. Get a machine (with maybe two network interfaces) and do some IP traffic shaping (best results, not easy)

There are probably a few more but I’m on a plane from Rome to London to go the the first CrisisCamp and can’t think properly!

1. A profiler + simulate it in your brain.

This is particularly useful if you’re developing for the web.

YSlow Running on Reliefweb.int

There are a few online and in-browser tools that give you a breakdown of the resources your website is using.

Three important things to get from these tools is:

  1. What’s the bandwidth usage of each of my pages and typical interactions?
  2. How many individual requests are required for each page?
  3. How much content is cached?

The overall bandwidth is a feasibility test. We say aim for 25k per page, but use your own judgement – how fast is your user’s connection, how long will it take for them to get to something useful (hint – if it’s longer than 5-10 seconds: #FAIL)

The number of requests also gives you an indication about performance over high latency or intermittent connections – in short, use fewer objects and cache them when you can.

Finally, if you’ve got a network usage meter (I have a noddy one running that comes with iStatMenus on the Mac ) you can get a rough idea of how much bandwidth an app is consuming (should work fine even if you’re developing an app in a mobile simulator). I’ve seen stuff for Windows I can’t remember, on Linux you could use BWM or get fancy with logging modes in IPTables – Google for more.

That’s it.

2. Use Aptivate’s online Low Bandwidth Simulator

Aptivates Low Bandwidth Simulator

This technique is only useful if your site is accessible from a public URL. It only simulates bandwidth, not latency or packet loss.

We make Loband, and online service that strips the junk out of webpages and gives you a compressed, simplified version that works better on slow links.

As part of the Loband code, there’s a simulator which you can access here.

You plug in the URL of your site, select the bandwidth you want to simulate and hit go. I haven’t tested it recently with any serious AJAX/HTML5/Flex/Flash stuff so your mileage may vary if you make heavy use of these tools.

Do what a user would do with your app and see if it’s usable.

That’s it.

3. Use the Sloppy desktop Java app

Sloppy Java desktop bandwidth simulator

This technique is great if your site is running on a local dev box or even if it’s online. It only simulates bandwidth, not latency or packet loss.

Make sure Java lives on your machine. Download Sloppy. Run it, start it, point it at your app.

Do what you would have done with 2.

That’s it.

4. Get a machine, (maybe two network cards) do IP traffic shaping.

This technique is the best of the bunch: you can simulate bandwidth, latency and packet loss and do so for anything running on your machine or LAN. That’s anything: browser apps, mail clients, Skype, mobile simulators etc. It’s not hard but is a little fiddly. There are two broad ways you could do this: for yourself on a single machine or, for a bunch of people on a LAN.

Terminal Showing iperf measuring different bandwidths throttled by dummynet

iperf showing a dummynet throttled link

Quickly,  to do it for yourself, on your own machine to do app testing: if you’re running FreeBSD / MacOSX,, follow Bjørn Hansen’s tutorial.

It gets a bit trickier if you want to do it for several machines at once.

What we’re trying to do is turn a machine with two network interfaces (NICs) into a “router”. Traffic goes in/out of the first interface at normal speeds, but the traffic goes in/out of the second interface at user-selected levels of crapness (bandwidth, latency, packet loss)

Relatively speaking: this is easy on a Mac / BSD box, trickier on Linux and hard on Windows. While most laptops actually have 2 network interfaces (wifi + ethernet) – I normally do this with a desktop that’s got 2 NICs  or a laptop + a USB / CardBus/PCMCIA NIC.

On a Mac/BSD you’re going to be using ipfw to control the dummnynet traffic shaper. Man up to find out more. In short: ipfw’s a firewall that classifies packets (e.g. by which port or IP they’re going to) into “flows”. Dummynet takes a flow and sticks it in a “pipe”. A pipe emulates a link with given bandwidth, propagation delay, queue size and packet loss rate.

….how on earth do we get this working?

There are better tutorials than I can write quickly here and  here. But in brief:

  1. Get a BSD machine with dummynet (OSX 10.4+ is enabled by default, might need a kernel rebuild for FreeBSD) running with 2 NICs. Fire up a terminal, type in ifconfig and make sure you can see the two interfaces (en0 and en2 for me)
  2. Make sure you can route packets between interfaces.
  3. Make a pipe for the traffic between interfaces
  4. Configure your pipe, stick your traffic in there and smoke it.
  5. Tweak the pipe and simulate to you heart’s content.

In reality, this always takes me half an hour to get right – I’ve never had this go smoothly first time.

First things I check if it’s not working:

  • Is OSX / BSD doing some daft routing / automatic internet connection sharing that messing with your ipfw settings?
  • Are you routing using the right interfaces? I’ve actually got 7 network interfaces that show up in ifconfig to choose from (firewire, bt, vm, wifi, ethernet etc.)
  • bit/s and Byte/s are quite different…
  • Don’t despair, it will work, there’s pictures of me doing it here. :-)

Typical bandwidth / latency / loss scenarios

The key commands you’ll be running to set parameters will look like:

ipfw pipe 1 config bw 50Kbit
ipfw pipe 1 config delay 200ms
ipfw pipe 1 config plr 0.2

The three variables you have to play with are bw (bandwidth) plr (random packet loss rate) and delay (latency). Here’s a super-rough guestimate for some typical scenarios, please advise if I’m way out or there are other common scenarios:

Scenario Bw (Kbit) delay (ms) pr (ratio)
2.5G mobile (GPRS) 50 200 0.2
3G mobile 1000 200 0.2
VSAT 5000 500 0.2
Busy LAN on VSAT 300 500 0.4

What about windows and linux?!

I promise to update this bit with more info when I’ve got Linux, Windows boxes and Chris to hand.

In short though: with linux it’s the same idea, machine with 2 NICS, get them routing, use IPTables and the linux traffic shaper, tc. It’s not as good as dummynet (no packet loss IIRC) but gets the job done. For Windows, I’d honestly have to do some more research, last time I tried it, I just pulled out my Macbook.

Please add any tips and corrections of  below!

Tariq

Technology decisions in organisations great and small

Thursday, January 21st, 2010

Ken Banks often writes about Social Mobile’s Long tail – it’s a really helpful concept; one that I find myself frequently using when explaining our work to others.

Ken Banks Social Mobile Long Tail Graphic

Social Mobile's Long Tail by Ken Banks

Whenever I see Ken’s picture, I’m reminded of the similar relationship between complexity and organisational size and I’m proud of how Aptivate works successfully across this spectrum. We try to bring the breadth of our knowledge, skills and experience to bear when working with everyone from communities in rural Zambia, to NGOs in the UK, international agencies in Europe and governments across Africa.

I think this is really important.

It’s great to be “the policy people” or “the community technology people” but you need people who span these worlds and can join the dots.

That’s us.

We’ve spent months in rural Zambia working with young women getting low-power computers, GPRS connections and mobile systems working to support local entrepreneurship. Now we have greater confidence offering advice on mobile monitoring and evaluation strategies for NGOs in the region, and in turn, to guide an international agency wanting to know what kind of policy monitoring is possible, and how data might integrate into their wider systems.

I had an enjoyable conversation yesterday with the folks at CAFOD who want to know if mobiles could strengthen their work at the local partner and international levels. I met them through BarCampAfricaUK last November and finally had a chance to catch up.

Personally, I’m really interested in working with medium-sized organisations trying to make better use of technology. I probably have similar conversations 2-3 times a month.

I think there are some common characteristics and challenges for these organisations:

  • They already use some technology in the areas you’d expect: fundraising, communications, advocacy, admin and finance, and monitoring and evaluation.
  • They don’t have much capacity to explore and understand how new technologies (e.g. mobiles, collaboration tools and media capture) or advances in current technologies (e.g. open standards, APIs, social media) can help their programmes.
  • Local partners are already ahead of the game when it comes to the use of mobiles. This is typically out of necessity – even basic SMS is an astoundingly versatile medium.
  • The “technology champions” in an organisation, the individuals who appreciate the possibilities, are not always the decision makers. They often don’t have the time to investigate these opportunities and present information around which decisions can be taken.
  • Experimenting with the various tools out there can be challenging for the non-geek and it’s hard to find out about the realities of implementation.
  • Consultants are expensive and companies who sell “off the shelf products” might not have the best interest of the organisation at heart.
  • There are some great resources out there that catalogue technologies, there are also some good case studies that cover certain scenarios but there are few resources that specifically help people make decisions at the organisational level.

So here’s a promise: we’ll help you make decisions about technology. We’ll do a whole lot more, but at its simplest, we’ll do what it takes for you to decide what do to.

The first three things on my list of “how to support decisions” after my conversations yesterday are:

  • Write a blog post on technology decision making for medium-sized organisations, reassuring them that they’re in good company. (done)
  • Write a primer on “why use mobiles for data gathering and communication” with a goal to support decisions.
  • Put together a “mobile gadget lab in a briefcase” to take to organisations so they can play with pre-configured versions of various tools on various devices supporting a couple of different workflows.

Do any of these thoughts resonate with you?

Comments most welcome!

Tariq

Large Wireless Networks

Tuesday, January 5th, 2010

I saw an interesting request on the AfNOG mailing list:

How does one determine the number of users,  a wireless network can support. I need to buy a wireless router to support 2000 users within an organization. The problem is how do I determine this capability given the specs of the wireless router.

To put it in a better way “what determines the number of users a wireless router can support”[?]

Although I’m not an expert on wireless networks, I have worked with them a bit, and I sent a reply that might be useful to others (I hope).

I’m not sure there’s an easy answer to that question. Some factors that may influence the decision are:

  • The total bandwidth available to a single wireless access point (AP), e.g. 54 MBps for an 802.11g router. This also depends on the level of 802.11 that the clients support. An 802.11b client will use much more airtime per packet than an 802.11g client, so if most of your clients are 802.11b then you won’t get more than 11MBps per AP, regardless of the theoretical maximum of the AP.
  • The frequency space available. There are only three non-overlapping 802.11b bands (maybe fewer for 802.11g), so no matter how many APs you have, the most bandwidth you could get in a given spot cannot be more than three times the bandwidth of one AP. Also, if they form a contiguous roaming network (same SSID and key) you have little or no control over which one a client will associate with, so you can’t evenly divide the available bandwidth between the three that you can see.
  • The guard time between different transmissions and for RTS/CTS round trips. This will cut your available bandwidth at least in half from the theoretical maximum, and more if you have hidden nodes (which is close to inevitable with thousands of clients, unless they are all in the same room).
  • The maximum number of clients that can associate with a given router. Most APs don’t publish this number, but Cradlepoint routers can handle between 4 and 64 clients per router. Keenan Systems reckons that “Once you have more than 25 clients associated most access points start to break down”. I’d guess that Cisco kit has the highest limit, especially the professional versions (not Linksys branded) and el cheapo generic Chinese kit has the lowest.
  • If the AP is serving DHCP and running NAT (acting as a router as well as an AP) then the translation and DHCP tables of the router will be a limit. Some router DHCP servers only allow class C subnets, with a maximum of 253 usable client IP addresses per AP. It’s probably more advisable to use a real machine (with a hard disk) as a DHCP server.
  • Similarly, if you don’t do NAT on the AP, then whatever handles the NAT on your Internet gateway will see the IPs of the individual machines, and will therefore need to be able to handle however many simultaneous IPs your clients have, and connections that they make.
  • Whatever your DHCP server, the number of IPs available in your network subnet will limit the number of clients who can have a valid unique IP address at one time.
  • The bandwidth of your Internet connection. The minimum that I’ve seen working at all is 3kbps per client, or 6 MBps with 2000 clients. That should be real bandwidth, not contended upstream by the ISP, otherwise multiply by the contention ratio. Don’t forget to include your fixed clients as well.

The best advice I can give you, never having built a wireless network this large myself, is to:

  • Grit your teeth and buy the best kit you can find on the market. Be prepared to pay through the nose, e.g. $1000 per AP or more.
  • Talk to the manufacturers about the maximum number of associated clients, and get assurances in writing that their kit can handle the load. Preferably get them to propose a solution for 2000 clients, also in writing.
  • Use small cells with directional antennae and lots of APs in areas where you expect more than 10 clients at peak times.
  • Try to scale your network up smoothly rather than buying a complete solution in one go. Don’t try to support 2000 clients in the first year, let alone the first day.
  • Monitor and graph the performance of the network, particularly bandwidth, wireless contention, number of errors and number of associated clients, and identify hotspots.
  • Keep one or two APs spare, and deploy them in the areas that are seeing the most activity.

Sunday Folayan wrote:

Must this network be implemented with JUST ONE wireless router? With one router … If you run 802.11bg at 2.4ghz, you have just about 2Mbps of bandwidth to play with, from one AP. If you deploy 802.11a at 5.8Ghz, you should get better than 10Mbps. If any of the clients is 802.11bg, the AP will default to 802.11bg, even if it is capable of 802.11a. With 2000 users, that is an average of 1Kbps or 5kbps at the best per subscriber! Could this be what you want?

To put it in a different way … One single AP cannot do it.

And Hervey Allen wrote:

From what I’ve experienced wireless router specifications and claims often do not match what you will experience in real-world use. I know of several large-scale installations (10,000+ users and above) who ended up using Cisco Aironet series routers with Power over Ethernet capabilities (PoE).

I will double-check, but last time I was on-site the upper limit for one of these wireless routers was around 50 concurrent users with light to moderate use. That is, a single user running a torrent can make an access point almost unusable for the other 49 potential users…

It would be interesting to hear from others on the list who have large wireless installations what their experience has been, and what hardware they have used.

Issues of giving out addresses, roaming, recapturing addresses, etc… are quite important.

Patrick Okui wrote:

Joel Ja did a pretty good presentation on what he’s learned from setting up wifi installations for the various meetings/events at NANOG27. A few things have changed in the wifi world since 2003 but the concepts are still valid.

Hamish Downer wrote in a comment to this post:

This page has some good answers. It is about tech conferences, but the basic problem of getting lots of people on wifi in a single space is covered by the solutions.

I fully agree with Hamish, the page has excellent advice from people who have actually done this, unlike me.

Finally, Mark Tinka replied:

I generally wouldn’t recommend vendors on a public mailing list in such variable matters as wireless deployments, but given the scale you’re considering, Aruba came to see me once (uninvited, as usual), and they seemed to have some rather interesting things to say re: their wireless product portfolio, with particular regard to large scale installations.

You might want to add them to your shopping list, but my guess is the price point is way-up-there, what with their controllers and all.

But be careful about “buying” everything they tell you (same goes for other vendors). As others have mentioned, binding assurances from them as well as PoC’s (proof of concept) before you sign would be great!

I hope this helps someone. Please let us know how you get on.

Agile Development and Retrospectives: Learning from failure?

Friday, December 11th, 2009

I had an interesting chat with Alan last night about the role of “retrospectives” and it reminded me about the ICT4D Twitter Chat today around “Learning from Failure” being organised by the fine folks at Inveneo.

He was at the XPDay in London earlier this week – a 2-day hotbed of agile technology development geekery featuring a combination of traditional speaker sessions and open spaces.

Aptivate is a big advocate of using Agile methodologies. We see them as central to taking a participatory approach to international development. One thing we do after delivering a project is have a “debrief” or “retrospective” with the project stakeholders.

“Four key questions to focusing a community on learning and improvement” are described at www.retrospectives.com:

  1. What did we do well, that if we don’t discuss we might forget?
  2. What did we learn?
  3. What should we do differently next time?
  4. What still puzzles us?

Simple and sensible.

The key idea that Alan picked up on though was this: do a retrospective at the end of each iteration.

Every two weeks, every incremental release of a project, sit back, take an hour with your team and ask the above questions.

We’re certainly going to start doing this and from what was said at the XPDay – if you’re going to do one thing to improve your development process, do this.

Finally – ensure that all participants adhere to “Retrospective Prime Directive:”

Regardless of what we discover, we understand and truly believe that everyone did the best job they could, given what they knew at the time, their skills and abilities, the resources available, and the situation at hand.

Happy twitter chatting!

Translations, PDAs and Field Research

Wednesday, December 2nd, 2009

Translation can be a real headache.

PDA used for interviews in Tanzania

PDA used for interviews in Tanzania

Identifying text for translation, finding individual strings and phrases to avoid duplications, contextual exceptions, keeping track of them, revisions, collaborating remotely, reviewing, back-translating,  integrating translations back into a finished product – you name it, the translation workflow has got it.

But first, a bit of background:

Aptivate started working with Camfed about 2 years ago when they were planning a major baseline study of their work supporting women’s education and empowerment in Africa. As part of their broader monitoring and evaluation work they wanted to understand the impact of their programme on areas such as attitudes towards girls education, awareness of HIV and sexual health issues and the effectiveness of community structures.

We trained young women from rural areas in Tanzania, Zambia and Zimbabwe to use PDAs for face-to-face interviews with teachers, students, parents and officials in the education system.

We used Palm Tungsten E2 PDAs and Solar Bags from Voltaic to run the exercise. We customised and bug-fixed a version of the excellent Episurveyor for use in the education context (it was designed as a health tool).

The surveys that Camfed created for the study (50-60 questions for 6 different stakeholders, many common questions) were designed in English and had to be available in the following languages:

  • Swahili
  • Shona
  • Ndebele
  • Bemba
  • Lozi

The questionnaires were created in a spreadsheet – one sheet per stakeholder (e.g. parent, teacher) with a list of questions and optional responses on each sheet. We put together a tool in Excel to help with the translation process. Essentially it:

  1. Went around sheets indexing each cell with relevant text
  2. Built a single list of strings in a new sheet
  3. Presented only unique strings to a translator and locked the rest down
  4. Rebuilt the original surveys in the new language once the translation was completed
  5. Can repeat all the above to allow for back-translations too

This is a good example of the agile approach – do the simplest thing you can to get the job done well (and on a deadline!). The translations got done, we scripted the automatic translation of the EpiSurveyor survey files (which are XML objects) and that was, as they say, that.

Until I had a chat with Camfed yesterday and they asked – “you know that translation tool you made, can we use it for some other things we’re doing?”

Fantastic!

It’s great to have built a tool that starts to get useful beyond its original remit. The Excel tool we made isn’t suitable for general use yet and after using it for 2 years, there is plenty of scope for improvement around issues of collaboration and revision management.

Enter the internet.

I posted a question to MetaFilter yesterday on this subject and I got some really interesting responses I thought I’d share in case anybody is thinking of doing this kind of thing.

In particular, check out:

Happy translating!

Low Bandwidth Web: Opera Turbo

Tuesday, June 9th, 2009

Aptivate (then Aidworld) was founded in 2003 by a group of techies and aidworkers wrestling with the question: how can you make the web usable for relief workers in the field?

Opera Turbo in Action

Opera Turbo in Action

The problem then was access to bandwidth and the cost of that access.

Typical satellite phone connection speeds were 9.6Kbps (think of cold treacle flowing uphill or the state of dial-up in the early 90s) and the cost would be anywhere from $2 to $20 per minute.

5 minutes to download something like cnn.com made it unusable and $100 for the privilege made it unaffordable.

We came up with loband – a free online service that simplifies web pages. It downloads them remotely, trims them down and  returns them to the user in a lightweight format. It can offer a 5-10x reduction in bandwidth used.

Fast forward 6 years and it’s interesting how similar the story is.

Some of us now have  fast desktop & mobile web connectivity, but websites have gotten heavier (the first page I get to on Facebook is 1.25MB…)  and we don’t always have access to our quick connections.

The fundamental issue is still there: the web can be slow and expensive if you’re not on a fast “unlimited data” connection.

Opera have been doing great things with their mobile browser for some time. They recently introduced the Opera Turbo feature into their desktop edition. The concept is similar to loband but its designed to integrate transparently into the browser.

Opera route all relevant traffic via their servers and return a compressed stream of data to the browser containing the content you want. From the picture above, you can see that they compress graphics to save bandwidth.

One thing I suspect they do (although I haven’t checked) is reduce the overall number of requests between the browser and the server. Going back to Facebook – it takes 92 HTTP requests to build my home page. That becomes painful if you’re on a low bandwidth, high latency connection. You effectively incur an overhead for each of those 92 requests.

If Opera can turn that into fewer, overall smaller requests – the Norwegians rock even more than I think they already do.

Backup Mail Exchangers

Wednesday, January 28th, 2009

On Monday night, the power supply unit (PSU) in the server that hosts our mail server failed at around 2200 GMT. We don’t have physical access to the server out of hours, so I wasn’t able to replace it until about 1045 the next day, so our main email server was down for nearly 13 hours.

We didn’t have a backup MX because:

  • It usually can’t check whether recipients are valid or not, and therefore must accept mail that it can’t deliver;
  • It usually doesn’t have as good antispam checks as the primary, because it’s a hassle to keep it updated;
  • Spammers usually abuse backup MXes to send more spam, including Joe Jobs.

I thought that this was OK because people who send us mail also have mail servers with queues, which should hold the mail until our server comes back up. It’s normal for mail servers to go down sometimes and this should not cause mail to be lost or returned.

However, we had a report that one of our users did not receive a mail addressed to them, and was told by the sender that it had bounced. I saw the bounce messsage and suspected Exchange, so I decided to check how long Exchange holds messages before bouncing them. Turns out it’s only five hours by default. Most mail servers hold mail for far longer, for example five days, sending a warning message back to the sender after one day.

Bouncing messages looks bad on us. Apart from making our main mail server more reliable :) we need a backup MX to accept mail when the master is down.

However I do still want to minimise the spam problem that this will cause. Therefore I configured our backup MX to only accept mail when the master is down. Otherwise it defers it, which will tell the sender to try sending it to the master (again).

How did I achieve this magic? With a little Exim configuration that took me a day and that I’m quite proud of. I set up a new virtual machine which just has Exim on it, nothing else. I configured it as an Internet host, and to relay for our most important domains. Then I created /etc/exim4/exim4.conf.localmacros with the following contents:

CHECK_RCPT_LOCAL_ACL_FILE=/etc/exim4/exim4.acl.conf
callout_positive_expire = 5m

This allows us to create a file called /etc/exim4/exim4.acl.conf which contains additional ACL (access control list) conditions. The other change, callout_positive_expire, I’ll describe in a minute.

I created /etc/exim4/exim4.acl.conf with the following contents:

# if we know that the primary MX rejects this address, we should too
deny
        ! verify = recipient/callout=30s,defer_ok
        message = Rejected by primary MX

# detect whether the callout is failing, without causing it to
# defer the message. only a warn verb can do this.
warn
        set acl_m_callout_deferred = true
        verify = recipient/callout=30s
        set acl_m_callout_deferred = false

# if the callout did not fail, and the primary mail server is not
# refusing  mail for this address, then it's accepting it, so tell
# our client to try again later
defer
        ! condition = $acl_m_callout_deferred
        message = The primary MX is working, please use it

# callout is failing, main server must be failing,
# accept everything
accept
        message = Accepting mail on behalf of primary MX

The first clause, which has a deny verb, does a callout to the recipient. A callout is an Exim feature which makes a test SMTP connection and starts the process of sending a mail, checking that the recipient would be accepted. This is designed to catch and block emails that the main server would reject. Our backup server has no idea what addresses are valid in our domains; only the primary knows that.

The callout response is cached for the default two hours if it returns a negative result (the recipient does not exist on the master) or five minutes (see callout_positive_expire above) if the address does exist. We use a defer_ok condition here so that if we fail to contact the master, we don’t defer the mail immediately, but instead assume that the address is OK and therefore continue to the next clause.

The second clause of the ACL, which has a warn verb, is what took me so long to work out. Normally, if a condition in a statement returns a result of defer, which means that it failed, the server will defer the whole message (tell the sender to come back later). In almost all cases this is the right thing to do, but it’s the exact opposite of what we want here. We want to accept mail if the callout is failing, not defer it, otherwise our backup MX is useless (it stops accepting mail if the primary goes down).

Because this is such an unusual thing to do, there is no configurable option for it in Exim. The only workaround that I found is that there is exactly one way to avoid a deferring condition causing the message to be deferred: a warn verb. The documentation for the warn verb says:

If any condition on a warn statement cannot be completed (that is, there is some sort of defer), the log line specified by log_message is not written… After a defer, no further conditions or modifiers in the warn statement are processed. The incident is logged, and the ACL continues to be processed, from the next statement onwards.

So what we do is:

  1. Set the local variable
    acl_m_callout_deferred to true;
  2. Try the callout. If it defers (cannot contact the primary server) then we stop processing the rest of the conditions in the warn statement, as described above;
  3. If we get to this point, we know that the callout did not defer, so we set acl_m_callout_deferred to false.

The third clause  of the ACL, which has a defer verb, simply checks the variable that we set above. If we get this far then the primary server is not rejecting this address; and if it’s not deferring either, then it must be accepting mail for the address. In that case, we defer the message, telling our SMTP client to try again later, at which point it will hopefully succeed in delivering directly to the primary.

Callout result caching becomes a problem here. If the master was not reachable, but a previous callout had verified that a particular address existed, and that callout result was cached for the default 24 hours, then the backup MX would defer subsequent mail to that address for the next 24 hours, even if the master went down. This is why we changed the positive callout result caching time to 5 minutes earlier.

The fourth clause  of the ACL, which has an accept verb, is even simpler. It accepts everything that was not denied or deferred earlier. We can only get this far if the master is not accepting or rejecting mail for that address.

So far the configuration appears to work fine and has blocked 14 spam attempts (abusing the backup MX) in 14 hours.