View low bandwidth version

Archive for the ‘Open Source’ Category

Content indexing in Django using Apache Tika

Wednesday, February 1st, 2012

For the Documents module of our new open-source Generic Intranet, we need to be able to extract the text content and metadata from various kinds of documents:

  • PDF files
  • Microsoft Office DOC, XLS and PPT files
  • and the new XML equivalents, DOCX, XLSX and PPTX.

I found various tools online to help extract this text, largely thanks to Stack Overflow here and here. This ended up with a hodgepodge of tools:

There were a number of problems with this hodgepodge:

  • I was unable to find any Python or command-line solution for old Excel (XLS) files;
  • These solutions did not extract metadata, only document text;
  • The choice of which tool to use depends on the MIME type returned by the file(1) command, which varies depending on the OS (Debian/Ubuntu or CentOS) and which version of the library is installed

Another Stack Overflow post recommended Apache Tika for metadata extraction. It appears to support all the document formats that we need, and to have auto-detection of the document format, which solves all the MIME type problems as well. However, it introduces a new problem: it’s written in Java, which is hard to access from Python.

Luckily I found some instructions for building a Python wrapper around Tika, using some tools that I’d never heard of, and this seemed like a good approach. Unfortunately the installation process is very non-standard, which would not fit in with our fabric-based automated deployment process, and would make it harder for users to install the Intranet themselves.

The instructions are somewhat outdated at the time of writing, as they refer to Tika version 0.7, while 1.0 has been released. I was unable to register for an account to update that page, so I wrote to the author with the details that I discovered, and will also document here that the following command works for me:

python ../jcc/jcc/__main__.py \
        --include /usr/share/java/org.eclipse.osgi.jar \
        --jar tika-parsers-1.0.jar \
        --jar tika-core-1.0.jar \
        java.io.File java.io.FileInputStream \
        java.io.StringBufferInputStream \
        --package org.xml.sax \
        --include tika-app-1.0.jar \
        --python tika --version 1.0 --reserved asm

I was able to go further than this, and package Tika in a way that makes it easy to install with Pip, and thus integrate with our deployment process.

The wrapper is written using JCC, which works by generating and compiling C++ code that links to the Java classes, and then a Python wrapper around that C++. This means that it needs to be recompiled for each platform, so I couldn’t just distribute a binary blob with the Intranet (I had the same problem with DocToText above).

The version of setuptools on our servers doesn’t support JCC’s shared library mode. JCC dies with an error if it’s not explicitly disabled or the patch applies. I couldn’t do either of these as part of our standard deployment process. So I patched JCC to disabled shared mode, since we don’t need it anyway. I also added some patches to allow various setup.py commands used by pip to be forwarded through JCC to the setup function call.

This seems to be enough to allow you to install JCC like this:

pip install git+git://github.com/aptivate/jcc.git

I also wrote a setup.py file that handles pip’s command line invokations and passes the necessary options to JCC, and JCC’s invocation of the setup function. This seems to be enough to install the package using pip:

pip install git+git://github.com/aptivate/python-tika.git

and you can use the last parameter as a package specification in pip_packages.txt, or whatever you pass to pip -r.

You can find the pip-installable Tika package, complete with Tika 1.0 JAR files, in our python-tika repository on Github. This will save you the work of downloading and compiling Tika and all of its dependencies. I have started a discussion with the JCC developers about merging these changes into the upstream project.

Rough Guide to rural data collection with ODK

Monday, December 5th, 2011

This post has three purposes, which I think overlap sufficiently to combine them:

  • A User Guide for the system that we developed for UNICEF, IDS and RuralNet Zambia
  • A Developers’ Guide for anyone wishing to build something similar
  • Notes on lessons learned that may assist future implementers

Project goals

Automate the data entry part of a long paper-based survey, by replacing the paper forms with electronic devices.

Hardware and application selection

The survey has several long and complex questions, and long sets of multiple-choice answers. The data collection needs to be done in dusty rural Zambia, and the devices might need to be used for a full day without power. Collected data should be sent wirelessly to a secure data repository at some time after collection.

Text entry is required for many fields. That means either a real keyboard with keys, or a sufficiently large touch screen to type comfortably on. Use of the device camera, and presentation of reports and graphs on the same device, might be required in future.

Two possible hardware platforms were identified:

  • Tablet laptops with touch screens
  • Tablet mobile devices (iPad or Android tablet)

We selected the latter for this project due to lower cost, lighter weight, better usability and longer battery life.

The available software options that we identified were:

  • EpiSurveyor (Java J2ME, partly closed source, we have used before and fixed bugs)
  • OpenXdata (Java J2ME, open source, developed and supported by an Aptivate alumnus among others)
  • Open Data Kit (ODK) (Android, open source, active community)
  • Bespoke online/offline survey in HTML5

Of these, we eliminated EpiSurveyor and OpenXdata due to lack of compatibility with the hardware platform(s) we had chosen.

We chose ODK over a bespoke system due to limited time available for development, and ability to easily take photos and record GPS coordinates using the device’s hardware.

Of the available Android tablet devices, we chose the Samsung Galaxy Tab for the pilot project, due to its high quality construction. For future projects we would probably use a lower cost device; see the lessons learned for details.

Form creation

Since the survey is quite long (about 230 questions) we wanted an easy way to enter the questions. The ODK application requires the form to be in XForms format. We identified the following tools for creating XForms:

We decided to use XLS2XForm, which enabled us to enter the large number of questions easily in Excel. The others all have graphical builders, which have advantages and disadvantages for less technical users:

  • More visually appealing
  • All available options presented visually (types of controls, groups, etc.)
  • Less likely to make a mistake and produce an invalid form
  • Cumbersome user interface slows down data entry

Unfortunately, none of these designers were able to import an existing form in XForms format, which means that the modifiable “source code” of the form must be maintained in a “proprietary” format in each case, and it’s difficult to switch between tools.

You can download the conversion tools, and the Excel spreadsheet with the completed questionnaire as we delivered it to RuralNet, here. RuralNet staff, please use the latest version of the spreadsheet that you can find locally. To use the tools, you will need to download and install Python 2.7 and Java (JRE). Then download the tools as a ZIP file and extract it somewhere. I recommend that you keep the master copy of the spreadsheet in Dropbox to ensure that it’s backed up, and it’s always clear what the latest version is.

For help in building surveys using XLS2XForm, please see the documentation. In addition to the question types listed there, we have used the following shortcuts, which also work in this customised version of XLS2XForm:

  • text is short for add text prompt (a text field, such as a person’s name)
  • note is short for add note prompt (a read-only field, providing additional information for the user)
  • time is a time field without a date (for example, survey start and end times)

To compile the spreadsheet into an XForms form, run the build_and_validate.py script by double-clicking on it. If it works, it will show the message “Success!”, otherwise it will show an error message, usually caused by a mistake in the Excel spreadsheet. If it works, it will create (replace) the file called zambia-ranq-round3.xml in the same directory. If your spreadsheet has a different name, you can create a shortcut to call build_and_validate_custom.py with the name of the spreadsheet on the command line.

Software components

ODK Aggregate is the software that powers the Internet server. It is a repository for blank forms (designs) and completed forms (data). Our server is located at http://partimob.appspot.com/. This server is currently paid for by us, and will need to transfer to RuralNet at some point.

ODK Collect is the application runs on the device, and users interact with it to complete the survey. It’s essentially a user interface for XForms. It can download blank forms (designs) from an ODK Aggregate server, and upload completed forms (data) to the Aggregate server as well.

ODK Briefcase is the software that downloads completed forms (data) from the Aggregate server and convert them into CSV (spreadsheet) format, which can be loaded into

Customised ODK Collect

We are using a custom version of ODK Collect. You can download the source code for it here, or the compiled application here. You can also find it in the ZIP file download. If you prefer, you can use the latest official version of ODK Collect. The two are compatible, but our version adds the following useful features:

  • Use supplied login and password by default to save a round trip and a prompt.
  • Add keyboard navigation, useful for form filling on android-x86 because the mouse interface is pretty clunky.
  • Restore ability to modify completed and submitted forms on the device, which was removed from the official version in 1.1.7.
  • Improved error messages and progress indication during form uploads.
  • Allow setting the instance name on the first page of the survey.
  • Allow saving incomplete surveys on required questions (in case a survey is interrupted; almost all of our questions are required).

There are several ways to install ODK Collect on a device:

  • Download it from the Android Market (official version only, not our customised version)
  • Copy the APK file onto a microSD card, insert the card into the device, and use the My Files application find and open it from the SD card.
  • Attach the USB cable from the device to a computer, enable mass storage mode on the device, and on the computer, drag and drop the APK file onto the device’s internal memory, then use the My Files application to find and open it.
  • Attach the USB cable from the device to a computer, and use ADB‘s install command to install the APK file.

It’s useful to put the application onto the device’s desktop. To do that, open the Applications list, find ODK Collect, and press and hold it with your finger for a few seconds. The background will change to the desktop; release your finger to drop the application there.

It’s also useful to remove all the other junk from the desktop. For each icon and widget on the desktop, press and hold it with your finger for a few seconds, until the trashcan icon appears, then drag your finger to the trashcan and release it there.

Form management on the device

There are several ways to put blank forms (designs) onto the tablets:

  • Download them from the ODK Aggregate server using ODK Collect.
  • Copy them onto a microSD card, insert the card into the device, and use the My Files application to copy them from the SD card to the /sdcard/odk/forms directory.
  • Attach the USB cable from the device to a computer, enable mass storage mode on the device, and on the computer, drag and drop the form into the /sdcard/odk/forms directory.
  • Attach the USB cable from the device to a computer, and use ADB or DDMS to push the file onto the device, into the /sdcard/odk/forms directory.

Of these methods, ADB or DDMS is recommended for rapid development, and using the Aggregate server is recommended for production use, since the form must be installed on the Aggregate server for it to be able to accept submissions.

Similarly there are several ways to copy completed forms (data) off the device:

  • Upload them to the ODK Aggregate server using ODK Collect.
  • Use the My Files application to copy them from /sdcard/odk/instances to a microSD card, then remove the card and connect it to the computer, and drop the files into the ODK Briefcase data directory.
  • Attach the USB cable from the device to a computer, enable mass storage mode on the device, and on the computer, drag and drop the files from the /sdcard/odk/instances directory to the ODK Briefcase data directory.
  • Attach the USB cable from the device to a computer, and use ADB or DDMS to pull the file from the device’s /sdcard/odk/instances directory to the ODK Briefcase data directory.

Of these methods, using ODK Aggregate is recommended for development and production use.

Since the Aggregate server is on the Internet, this method requires that the device have Internet access. So it either needs a valid SIM card installed with credit and a data bundle, or a WiFi network connected. We had many problems with using SIM cards for data, so WiFi is preferred if possible.

The directories mentioned above will not exist until ODK Collect is installed on the device and run for the first time. Forms downloaded from the Aggregate server will also be placed in the /sdcard/odk/forms directory. Forms completed on the device will be placed in the /sdcard/odk/instances directory.

Configuring ODK Collect

Collect needs to know the details of the ODK Aggregate server to log into it, download blank forms and upload completed forms.

Open the ODK Collect application, press the Settings button and click on Change Settings. Click on URL and enter https://partimob.appspot.com. Similarly, complete the Username and Password using the details that you’ve been given by the Aggregate server operator, or the account that you’ve created on the Aggregate server. This account should only have Data Collector permissions, no more. Press the Back key to get back to the main menu of ODK Collect.

Downloading forms using ODK Collect

Open ODK Collect on the device, and click on the Get Blank Form button. Collect will try to log into the Aggregate server using the details that you’ve provided, and get a list of forms on the server that have the Downloadable box ticked. This is on by default for newly uploaded forms.

Tick the box next to all the forms that you want to download, and click on the Get Selected button.

Filling forms on the device

Open ODK Collect on the device, and click on the Fill Blank Form button. All the forms in the device’s /sdcard/odk/forms directory should be listed. Choose the form that you want to complete.

You will see an introductory screen showing how to move between questions by swiping your finger across the screen, from right to left or left to right. This screen has a text box at the bottom, which you can use to name the form that you’re completing. Naming forms is useful if your data collection is interrupted and you need to resume it later. It’s much easier to identify the form using its name, rather than opening it and flicking through to find some identifying information. You might name the form based on the household code that you’re surveying.

Depending on your answers to some questions, others may be hidden, or their text might change.

At the end of the form there is another chance to Name this form, and a tickbox to Mark form as finalized. Before you can upload the form to the Aggregate server, this box must be ticked, and you must press the Save Form and Exit button. Otherwise Collect will consider that the form is incomplete.

Sending completed forms to Aggregate

Open ODK Collect on the device, and click on the Send Finalized Form button on the main menu. Tick the box next to all the forms that you want to upload to Aggregate, and click on Send Selected. After the upload is complete, you should see the Upload Results message. Every form should have “Success” next to it, otherwise it was not sent successfully.

Downloading forms using Briefcase

We are using a customised version of ODK Briefcase with the following changes:

  • Fix the export of repeated groups, which before only worked for the first row (issue 461).
  • Shorten exported column names, to allow the CSV file to be imported into Access.
  • Allow the server name, username and password to be provided on the command line (or via a shortcut).

You can find the source code here and the pre-compiled version here, as an executable JAR file. You can also find it in the ZIP file download. If you make changes to the source and want to build the executable JAR again, install Maven and use the mvn package command.

To download the completed forms, open Briefcase by double-clicking on the briefcase-1.0-jar-with-dependencies.jar file. On the Transfer tab, click on the Connect button. For the URL, enter https://partimob.appspot.com, and for the user name and password, give the details of an ODK Aggregate account with Data Viewer permissions.

Then you should see a list of forms appear under the heading Forms to Transfer. Tick the box next to the one that your users have been completing, and then click on the Transfer button. If you do this after all the completed forms (data) have been submitted to the ODK Aggregate server, you will not need to do it again for that form template (design).

Now switch to the Transform tab and see if the form appears in the Form list. If it doesn’t, then exit and restart the Briefcase application (issue 464).

For Output Type, choose .csv and media files. For Output Directory, choose the directory where you’d like to save the CSV files. Note that any previous files exported to that directory from the same form will be overwritten without warning, even if they have been modified (cleaned). Click on the Output button to write the CSV files.

Cleaning data in Excel

You can find the Excel spreadsheet that we use for data storage and cleaning here. Note that Excel is a long way from the best way to store and manipulate data like this. Microsoft Access would be far more appropriate. Yet again I wish there was a sufficiently powerful open source alternative.

Because the spreadsheet contains cleaned data, which is “better” than the raw data which is included in the CSV export, we don’t want to overwrite existing rows. For the main section of the questionnaire (the so-called Single Responses) you can include only the new data like this:

  • Open the main spreadsheet and switch to the Single Responses tab
  • Highlight all rows from 3 down to the bottom, and Sort them by the SubmissionDate column.
  • Note the last submission date on this spreadsheet.
  • Open the newly exported CSV file for the single responses (something like RANQ-2011-Round-4-v5.csv).
  • Sort this file by the SubmissionDate column as well.
  • Highlight and copy all the rows whose submission date is later (more recent) than the last one in the main spreadsheet.
  • Paste them at the bottom of the Single Responses tab of the main spreadsheet, below the other data.

For the other tables, this process needs to be done completely manually at present.

You can then check and clean the data by viewing and modifying it in Excel. Note that each sheet has one or two columns at the end, which are filled by formulae that look up values from the Single Responses sheet, such as the Household Code.

Using the Android x86 Emulator

To be written.

Lessons learned

To be written.

Ubuntu Laptops in Schools

Thursday, September 1st, 2011

I’m currently working on a project that’s putting computers into Zambian schools to try to revolutionise education, making it more fun and interactive for kids, and reducing the problems of teacher absence.

They’re using Intel Classmate style PCs, currently running Windows 7 Home Starter. I’m investigating whether Ubuntu would provide a better experience. It might be faster, more reliable, more manageable and easier to lock down than Windows.

Ubuntu 10.10 (Maverick) doesn’t boot on these computers, probably due to problems with the HPET. I don’t like Unity so I don’t want to try 11.04 just yet, which left me falling back to 10.04 (Lucid) with long-term support.

Automatic Logins

The computers should be in a kiosk-like mode for student use, where no login is required but they are locked down. They should also be used by teachers (with a password and fewer restrictions) and administrators (with another password and no restrictions). So I created three user accounts. Student is set to log in by default with no password.

While this works, there are other places where a password is requested and none works, because the Student account doesn’t have a valid password:

  • unlocking from screensaver
  • switching users
  • sudo from the command line

The last one is less important because students should not be able to access the command line anyway, or have any administrative rights. But they need to unlock the screensaver and be able to switch users.

We solved the screensaver problem by telling the screensaver not to lock the screen for this user, just as we did for Camfed in the Zambia SRC with LTSP:

# Disable locking the screen for users with no password to unlock it
sudo -u student gconftool-2 \
	--type boolean \
	--set /apps/gnome-screensaver/lock_enabled false

However the user switching was more tricky. Luckily I found a very helpful question and answer on SuperUser. I improved on it slightly by reusing Ubuntu’s builtin nopasswdlogin group, so that users who can log in with no password can also be switched to with no password.

To achieve this, just add the following line at the beginning of /etc/pam.d/gnome-screensaver:

auth sufficient pam_succeed_if.so user ingroup nopasswdlogin

Firefox Kiosk Mode

We want the browser to be fullscreen all the time, so we need to use some extensions:

  • Full Fullscreen to make it start in fullscreen mode;
  • Keyconfig to stop them exiting full screen mode with F11, or closing the browser with Alt-F4.

We also change some preferences using about:config:

xpinstall.enabled: false
to prevent installing more extensions;
app.update.auto: false
to stop Firefox checking for updates by itself;
browser.sessionstore.resume_from_crash: false
to prevent the Restore previous session prompt when starting Firefox;
extensions.update.enabled: false
to stop Firefox checking for updates to its installed extensions;
extensions.update.notifyUser: false
to avoid a prompt if an extension update is discovered;
browser.tabs.warnOnClose: false
to avoid the prompt to save your tabs on browser exit;

Window Manager

We want the students to have access to a restricted set of applications. The user interface also needs to be unbreakable (child-proof). Windows should always be maximised, as the laptops have quite small screens. All of this points to using a custom window manager/desktop instead of the standard Gnome or KDE.

Fluxbox and Openbox were recommended, but they seem to be aimed at highly-customised desktop environments (for geeks) rather than locked-down kiosks or embedded systems. Matchbox looks like quite a good fit. It has a very simple front menu and an everything-maximised window manager, which sounds great for ease of use.

We’re using GDM for the user login, which offers users a choice of which session (window manager) to run. This is OK, and even quite good for administrators, as it provides a failsafe option in case the usual window manager is borked. But I can’t see how to disable or override this for particular users. Students have no-password logins, so they don’t even get the opportunity to choose a window manager.

The DefaultSession in /etc/gdm/custom.conf (chosen using gdmsetup) changes their window manager, but affects all users, and we don’t want to force everyone to use the restrictive kiosk window manager.

I found that GDM lets you specify your own Xsession script, which gdm uses to actually start the session selected by the user. So I wrote a replacement:

#!/bin/sh

if [ "$USER" = "student" ]; then
	/etc/gdm/Xsession /usr/bin/matchbox-session
else
	/etc/gdm/Xsession "$@"
fi

All it does is call the original Xsession, overriding the name of the session manager if the current user is the special student user, and otherwise behaves exactly as normal.

Save it in /usr/local/bin/GdmKioskSession, make it executable, and add the following line to /etc/gdm/custom.conf:

BaseXSession=/usr/local/bin/GdmKioskSession

If you don’t even want the application menu, but want to force a particular application such as a web browser (true kiosk mode), replace /usr/bin/matchbox-session with /usr/local/bin/kiosk-session, create that file with the following contents and make it executable:

#!/bin/sh
matchbox-window-manager -use_titlebar no &
exec /usr/bin/chromium-browser -kiosk -app=http://staging.ischool.zm/

More lockdown tips to follow.

Traffic shaping with PF, ALTQ and HFSC

Friday, August 5th, 2011

We usually use Linux firewalls for traffic shaping, because the power of the traffic control (tc) system exceeds FreeBSD’s dummynet in most ways.

Dummynet can be used to create arbitrary delays and packet loss, which is very useful for simulating poor connections, but not for sharing bandwidth and prioritising packets between different traffic classes on a real traffic shaper.

However, I’ve just been testing PF (the new standard packet filter) and ALTQ (the alternative queueing system) on FreeBSD, and I’m impressed by the capabilities. I prefer this combination (PF+ALTQ) over Linux TC because:

  • PF and ALTQ are fully integrated and configured using the same file, whereas TC has its own (very hard to use) classifier. I normally use the iptables CLASSIFY target to classify traffic instead, but this is not integrated.
  • TC is very hard to use generally. The authors seem more concerned with functionality than usability.
  • ALTQ has named queues which helps usability enormously compared to TC’s hex numbered classes.
  • ALTQ gives very low delay when the interface is not 100% saturated, which seems impossible to achieve with TC.

It does annoy me that ALTQ is not enabled in the default kernel, so you have to compile your own kernel. I used the following commands to copy the default GENERIC configuration to a custom one, which I called ALTQ:

cd /boot
cp -p kernel GENERIC # backup the current kernel
cd /usr/src/sys/i386/conf
cp GENERIC ~/ALTQ
ln -s ~/ALTQ .
vi ALTQ

and added the following lines to the new kernel configuration file, ALTQ:

options ALTQ
options ALTQ_RED
options ALTQ_RIO
options ALTQ_HFSC
options ALTQ_PRIQ

and then compiled and installed the new kernel:

cd /usr/src
make buildkernel KERNCONF=ALTQ
make installkernel KERNCONF=ALTQ

and then reboot to load the new kernel. After that, we need to create a pf configuration. Some example configurations use CBQ queues, but I prefer HFSC because:

  • HFSC is guaranteed accurate, whereas CBQ is approximate
  • CBQ requires you to guess the average packet size and its accuracy depends entirely on this
  • HFSC has service curves which allow you to deliver small files quickly and drop the priority of large connections (e.g. file downloads) with great ease.

Here is a sample configuration of PF+ALTQ+HFSC that I used for testing on a transparent bridging firewall (bridge0 connecting em0 and em1):

altq on em1 hfsc bandwidth 1Mb queue { ftp, ssh, icmp, other }
queue ftp bandwidth 30% priority 0 hfsc (upperlimit 99%)
queue ssh bandwidth 30% priority 2 hfsc (upperlimit 99%)
queue icmp bandwidth 10% priority 2 hfsc (upperlimit 99%)
queue other bandwidth 30% priority 1 hfsc (default upperlimit 99%)
pass out quick on bridge0 inet proto tcp from any port 21 to any queue ftp
pass out quick on bridge0 inet proto tcp from any port 22 to any queue ssh
pass out quick on bridge0 inet proto icmp from any to any queue icmp
pass out quick on bridge0 all

We are only queueing on em1 here, which is the downstream, so we are only limiting downloads. We deliberately limit them to 1 Mbps for testing. The limit should always be lower than your actual download bandwidth, to ensure that the queue is on the FreeBSD firewall and not any other device.

We create four named queues under the root, which is implicitly named root_em1. We reserve 30% of bandwidth each for FTP, SSH and other traffic, and 10% for ICMP. However, any class can exceed its reserved bandwidth, up to the upperlimit, which defaults to 100%, which means that one class can potentially cause delays to traffic in other classes, so we override this to 99%.

Note that even though we create the queues on the em1 device, we must filter packets on bridge0, as otherwise our traffic does not match our pf rules.

Update: I found some more information about traffic shaping and advanced usage of HFSC, including realtime guaranteed classes for VoIP.

Update 2: For a simpler setup with ALTQ, try this guide.

8 bit alpha png optimisation with pngquant

Tuesday, June 28th, 2011

It looks like we may have found an alternative to great png optimisation at last, pngquant.  We really struggled with this issue on the Reaction Scorecards website as the homepage had some graphics with fine lines and alpha transparency which were proving difficult to optimise.

In the past I have used Adobe fireworks which seemed unrivalled in the open source world but pngquant is just great.

The command line tool allows you to squish the 32bit png files output from Inkscape into 8 bit files with any number of colours from 2 to 255. To make the test I used an 11.9 kB inkscape file which when processed using pngquant and 24 colours was perfectly acceptable and now had only a 3.1 kB file size. Yippee!!. I’m a happy bunny

happy transparent 8bit bunny

(MyPaint bunny 39 kB -> Gimp bunny 53 kB -> happy pngquant bunny 13 kB)

AfNOG 2011, Part 2

Monday, May 30th, 2011
People sitting at computers in a lecture

Boot Camp

AfNOG boot camp was absolutely massive this year. I think they had 75 people when they were only expecting 40. They took over half our classroom as well, which made setup tricky as we had to work around people and ask them to move repeatedly, and we couldn’t get all of our tables in to cable them up.

It was followed by the obligatory welcome dinner, at the White Sands’ outdoor restaurant, with the requisite number of speeches and applauses.

Today we had the first day of Scalable Services. Desktop installation hadn’t gone too well. My attempt to respin with fixes, wiping the unused space after the imaged partition, failed badly and resulted in a corrupted image, so we had to reimage those boxes.

People sitting around dinner tables in front of a stage on the beach

Welcome

Luckily it seems that everyone brought laptops, so the PCs aren’t really needed. And the virtual machines seem to be working well so far. We haven’t yet had to compile any software on the virtual machines, and I hope it won’t be too slow when we do. We’re using 34 out of the 35 virtual machines that we created.

Tomorrow is my first session, a 1 hour practical on virtualisation, installing VirtualBox and FreeBSD, after Joel’s theory session.

AfNOG 2011, Part 1

Saturday, May 28th, 2011
Alan Barrett laying cable

Alan Barrett laying cable

I’m in Dar es Salaam, Tanzania for AfNOG 2011. I arrived on Wednesday morning at 7am (on the red-eye flight from London Heathrow) and I’m here until Tuesday 7th June.

Until now we’ve been setting up the venue. We’ve been super busy, working until midnight every night so far. We had to run our own cables, quite a lot of them (over 600 metres).

Running them through the windows was tricky, since we needed to be able to close them for security, and to allow the air conditioning to work. Someone (Alan?) came up with the genius idea of using tough palm leaves wrapped around them to protect them as they pass through the narrow gap between window panes. Bio-degradable trunking!

To cope with the power failures, Geert Jan built a monster Power-over-Ethernet injector to power the wireless access points in each room and keep the wireless network running.

The training workshops start tomorrow, Sunday 29th May, with the Unix Boot Camp, an introduction to Unix and the command line. We expect that many of the participants will have little experience with Unix, as has been the case in previous years. These free tools have immense benefits, both for us running the workshops and for the participants when they return home. But they are very different to the Windows environments that the participants are most familiar with. Without basic skills, they would struggle and hold back the group during the rest of the workshops.

Feeding the cable monster

Feeding the cable monster

I’m not involved in the boot camp, but after it finishes, we move straight into the main tracks, which last for five days. This year we have some new tracks: Network Monitoring & Management, Advanced Routing Techniques and Computer Emergency Response Team training.

We have also cancelled the basic Unix System Administration track (SA-E) this year, as it has finally been localised to most African countries, and therefore people have the opportunity to attend it locally at lower cost and build local communities. However, this leaves us with nowhere to cover more advanced systems administration techniques, which are some of my favourite topics, including:

Geert Jan with his 8-way Power over Ethernet injector

Geert Jan and the Monster Injector

  • virtualisation (desktops, servers and thin clients, VirtualBox, Xen, KVM, jails, lxc)
  • system imaging (ghost, snapshots)
  • backups (snapshots, Rsync, Rdiff-backup, Duplicity)
  • file servers (NFS, Samba, sshfs, AFS, ZFS)
  • virtualised storage (iSCSI, ATAoE, Fibre Channel, DRBD)
  • cloud computing (Amazon and Linode virtual servers, scripting and APIs)
  • cluster computing (Mosix, virtual machine host clusters)
  • DHCP (for network management and booting)
  • network security (firewalls, port locking, 802.1x)
  • wireless networks (planning, monitoring, troubleshooting, WEP and WPA, 802.1x authentication)
  • Windows domains and security (including Samba 4)

If participants show enough interest in these topics, they could be added in future. I think it’s unfortunate that the course is arranged into week-long tracks rather than half-day or one-day sessions from which people could pick and choose, Bar Camp style. That would make it much easier for people to run sessions on many new topics.

Stacked up computers

Some of our 80 desktop computers

In the past this would have been difficult, because we provided desktop computers for participants. It used to take us 3-4 days to set up 80-odd desktop PCs with customised FreeBSD installations. We’ve noticed that more and more people are coming to the workshops with their laptops, and this time we’ve made a big effort to shift from dedicated to virtual platforms, to reduce setup time and costs in future.

The hardest track to do this for, in my opinion, was Scalable Services English (SS-E), the one I’m working on. We were all pretty much agreed to stay with desktop PCs this year, making us the only track to do so. But when we arrived, we discovered that the mains power situation here is pretty awful. On Wednesday we had four power failures. We only have five UPS, not nearly enough to protect every desktop.

For participants with laptops, they effectively have their own built-in UPS. If we give them virtual machines to work with, then we only have to protect the hosts. We can keep those in the NOC (Network Operations Centre), where the UPS are, so they’ll be protected for around 15 minutes of any power outage, which we have to hope will be enough for the hotel to start their generator.

Cannibalising RAM

Cannibalising RAM

Some participants will probably forget their laptops, so we’ll provide them with desktops, but we have no way to UPS them. These desktops will be set up with FreeBSD, as in previous years.

We rented 80 machines from a local company. Some had Windows, in varying states of repair, some had no operating system installed. We decided to use some of these desktops as hosts for the participants’ virtual machines.They only had 2 GB of RAM each, but since we had plenty, we cannibalised eight others for their RAM to upgrade our machines to 4 GB each.

We decided to use VirtualBox for the virtual machines. It’s free, open source, can host on all major platforms (Windows, Mac, Linux and even FreeBSD), has a nice GUI and a command-line automation tool, supports bridged networking easily, and is relatively fast and efficient.

Backs of systems being imaged

Imaging backend

We configured the master (that we’ll copy onto the other machines) starting with the setup from last year. We then had to install VirtualBox and build our first virtual machine inside it. Then we discovered that the virtual machine was unable to access the network in bridged mode. Packets sent by the virtual machine we simply never sent by the host. We needed to use bridged mode so that participants could run services on their machines simply by installing them. without requiring extra configuration on the host.

We had no Internet access for most of that day, because all three of our redundant providers were down for different reasons. Eventually we managed to use Geert Jan’s 3G dongle to get online and research the problem. We found that bridged networking doesn’t work in the binary package of VirtualBox 3.2.12 that comes with FreeBSD 8.2, so we had to wait until Internet access was fixed to download 120 MB of software (ports updates and VirtualBox 4.0.8) like this:

Michuki Mwangi configuring a PC for imaging

Imaging frontend

pkg_add -r portupgrade
portsnap fetch extract update
portupgrade virtualbox-ose virtualbox-ose-kmod

This took a long time, as VirtualBox is a large piece of software which also required us to download and build a new version of QT, but eventually it succeeded and the problem was solved.

We decided to put only five virtual machines on each host. Sometimes we would have the whole class compiling software from ports, which would slow down all of them significantly. We will use six or seven servers to host 30-35 virtual machines. On the master host, we created five copies of our master virtual machine by copying its hard disk like this:

cd .VirtualBox/HardDisks
for i in 1 2 3 4 5; do
	cp AfNOG\ SSE\ Master.vdi vm0$i.vdi
	VBoxManage internalcommands sethduuid vm0$i.vdi
done
Moving the systems to the NOC

Relocation

Then we created the virtual machines in the VirtualBox GUI and attached them to these new images. We needed to generate a new UUID for each disk image copy, using the undocumented sethduuid command above, otherwise VirtualBox would refuse to add the copies because it had a disk image already registered with the same UUID.

We could have created the virtual machines using the VBoxManage command as well, but it would have taken longer to work out how to use it than simply to create the five machines by hand. I later worked out the commands that we could have used:

cd ~/"VirtualBox VMs"
for i in {1..5}; do
	echo $i
	vmname=VM0$i
	diskimage="$vmname/FreeBSD.vdi"
	VBoxManage createvm --name "$vmname" --ostype FreeBSD
	VBoxManage modifyvm "$vmname" --memory 256 \
		--nic1 bridged --bridgeadapter1 bge0.219 \
		--nic2 bridged --bridgeadapter2 bge0.$[50+$i] \
		--vram 4 --pae off --audio none --usb on \
		--uart1 0x3f8 4 --uartmode1 server /home/chris/"$vmname"-console.pipe
	VBoxManage storagectl "$vmname" --name "IDE Controller" --add ide
	cp VM01/FreeBSD.vdi "$diskimage"
	VBoxManage internalcommands sethduuid "$diskimage"
	VBoxManage storageattach "$vmname" --storagectl "IDE Controller" \
		--port 0 --device 0 --type hdd --medium "$diskimage"
done

We named the images VM01 to VM05, which was important for running automated scripts on them. Then we configured VirtualBox to start them automatically at boot time, in headless mode, by adding the following lines to /etc/rc.conf:

vboxheadless_enable="YES"
vboxheadless_machines="VM01 VM02 VM03 VM04 VM05"
vboxheadless_user="inst"

We wrote a short script to help us apply the same command to all five virtual machines:

#!/bin/sh
# script to control all five virtual machines

command=$1
shift

for i in 1 2 3 4 5; do
	VBoxManage $command VM0$i "$@"
done

This allows us to log into a machine and do things like:

  • ./manage acpipowerbutton to initiate a controlled shutdown of all five virtual machines
  • ./manage modifyvm --macaddress1 auto to generate new, random MAC addresses after cloning the host
  • ./manage startvm --type headless to get the virtual machines running again (headlessly, independent of the GUI)
Room with desks around the edge, covered in computers and equipment

The NOC

We knew that we wouldn’t have space to attach monitors and keyboards to all the hosts, and we’d have to fiddle about with cables in the hot NOC room (without working aircon) if we needed access to their consoles, so we added the ability to log into them remotely using VNC and GDM. To do this, we had to install the VNC server:

pkg_add -r vnc

Which unfortunately doesn’t come with the nice xorg loadable module that adds a built-in VNC server to the X server, making a fast and stateless remote control session possible. So we had to hack about with inetd, first by adding a service name with a port number to /etc/services:

vnc		5900/tcp

And then a service line in /etc/inetd.conf:

vnc	stream	tcp	nowait		root	/usr/local/bin/Xvnc Xvnc -inetd :1 -query localhost -geometry 1024x768 -depth 24 -once -fp /usr/local/lib/X11/fonts/misc/ -securitytypes=none

This requires us to enable the XDMCP protocol in GDM, in order for VNC to communicate with it to present a GDM login screen. So we replaced the contents of /usr/local/etc/gdm/custom.conf with the following:

[xdmcp]
Enable=true

[security]
DisallowTCP=false

And then restarted GDM:

sudo /usr/local/etc/rc.d/gdm restart

And checked that we could connect from another machine and got a login prompt:

vncviewer 196.200.217.128

Which did indeed give us a working login screen:

VNC graphical login on a FreeBSD virtual machine host

VNC graphical login on a FreeBSD virtual machine host

This method is very slow. I wanted to find a better way to access the guests, especially if their network configuration was broken. I tried to connect a host serial port to a pipe and then access that pipe from a shell command, eventually over ssh, in a similar way to the text-only console offered by Xen (xm console). The above VBoxManage commands set up a pipe in my home directory, and I wrote the following short script to access it:

#!/bin/sh
set -x
echo "Console for $USER"
nc -U /home/chris/$USER-console.pipe

I created user accounts for each virtual machine, with the same name, and set their shells to this script, so that when they log in, they would automatically be connected to the pipe. However I was unable to make it work well. In particular, because of incompatible terminal emulations, I was unable to run vi to edit configuration files in the guest. If you find a way around this, please let me know. I haven’t tried it yet, but conman looks like it might be a good bet.

I spent a long time searching for the hidden VNC support in VirtualBox 4. It’s undocumented (the manual only talks about RDP) and people on the IRC channel say that it doesn’t exist, but it does, at least when starting the guests in headless mode. I added the following lines to /etc/rc.conf:

vboxheadless_VM01_flags="-n -m 5901"
vboxheadless_VM02_flags="-n -m 5902"
vboxheadless_VM03_flags="-n -m 5903"
vboxheadless_VM04_flags="-n -m 5904"
vboxheadless_VM05_flags="-n -m 5905"

And then, after starting the guests in headless mode, I could connect to these ports and access the virtual displays, much more conveniently and much faster than by shutting down the guests using VBoxManage and starting them again using the VirtualBox GUI.

We used multicast to image the six virtual machine hosts from the master. This took about three hours, so we left it running overnight.

In the morning we checked that the hosts had been imaged successfully by booting them with their newly installed images, and gave them unique hostnames (host1.sse.ws.afnog.org etc.) and IP addresses.

We used the manage script to reset the MAC addresses of the network cards of each virtual machine on each host:

for i in 128 129 130 131 132 133 134; do ssh 196.200.217.$i ./manage acpipowerbutton; done
for i in 128 129 130 131 132 133 134; do ssh 196.200.217.$i ./manage modifyvm --macaddress1 auto; done
for i in 128 129 130 131 132 133 134; do ssh 196.200.217.$i ./manage startvm; done
Michuki Mwangi setting up a projector

Astral projection

Since they were all configured for DHCP, we could have got their IP addresses from the DHCP server, but we wanted to give them a nice naming scheme, so we logged in to each one (using the console and the VirtualBox GUI) and assigned it a unique hostname and a static IP address.

We checked that we could log into each virtual machine remotely using the SSH keys that we’d installed, and then we shut down the hosts and moved them to the NOC.

Boot camp starts tomorrow, next door, but we still have to arrange our room.

Michuki Mwangi surrounded by rows of desks covered with computers

Classroom

We may have up to 37 people, our biggest class ever, in a room that’s about eight metres on a side, so layout of the room is a real challenge.

I wanted to do something to facilitate working in groups, such as each table having four places (two each side) and with its long axis front-to-back. This was vetoed because participants would have to turn their heads to see the projected screen, and it might be hard for them to take notes as a result.

So we’re going to have long, cramped benches instead. I think this is unfortunate, and I hope I can persuade people to try something more imaginative in future.

Move over microsoft: Design’s going open source

Thursday, May 19th, 2011

I’ve been designing websites since 1999 but switching from Microsoft Windows to Ubuntu has been one of those pivotal experiences worth sharing. Joining Aptivate as the in-house designer recently has given me the opportunity to challenge some pretty old work-flows and move towards a totally open source design practice.

Aptivate work almost exclusively with open source software so it seems a great idea to give Microsoft the push, and frankly I’d had enough of waiting 9 minutes for my laptop to reboot. That’s enough time for making 6 people a cup of tea, water the plants and rearrange the desk; all good things ONCE a day not 5 times when the PC crashes. 3 years of filling up with junk makes a Windows PC very very sluggish and an unhappy designer with a tidy workspace.

Changing over – a gradual process

Leaving Microsoft was never going to a straight switch. Leaving a web platform if you are dependent on it for your income is a scary thing.

Since i cut my design teeth on Apple Macs and Adobe software in 1995, I have moved gradually to Windows in a bid to be one step closer to the end user who on the whole use this platform. I was still, however heavily dependent on Adobe products, primarily Fireworks and Dreamweaver for interface design and development, but also Illustrator for vector graphics and Photoshop for photo editing. I know they are good, but really? couldn’t I achieve a good web result without them?

Open source alternatives within Windows.

Going back a bit to before the anger towards the laptop really kicked in. Finding open source alternatives was an exciting challenge. These were readily available for Windows so I didn’t have to switch, just try them out. I found a really useful website www.osalt.com which helps you to find open source alternatives to commercial software.

Web development – Aptana 2

Essentially I wanted an html/css editor with code completion, ftp client and project manager.. I tried Amaya, Bluefish KompoZer and Mozilla SeaMonkey which all had great features, but none of them did ALL the things I wanted together and I really was a bit spoilt by having them all rolled into one with Dreamweaver. Finally I found Aptana 2 which whilst a bit fiddly to get started with seemed to have everything I needed, hurray!

  • Best thing about Aptana – has to be the available plugin support for different types of project such as php, python/django, javascript, svn and git. It’s very comprehensive.
  • What I miss the most from Dreamweaver – nothing… had I still been dependent on the WYSIWYG editor in Dreamweaver then I could argue that this would be a bit deal, but since most of my work is now with dynamic database driven sites, I tend to use Firebug to make visual tweaks before committing them to code.

Web design – Inkscape

A key part of my work involves developing brand assets, icons and other vector graphics for both web and print design. For this I had depended on Illustrator. Quite quickly however I discovered Inkscape . What an amazing product! it’s a bit clunky but it lets me do 90% of the key design tasks I did in Illustrator and 60% of tasks I used to do in Fireworks. Essentially it is the easiest transition I have made on this journey and means that I now only use Inkscape for designing for the web.

  • Best thing about Inkscape – The ability to export 32 bit png files from any selection, the page or object. The other best bit is all the native SVG features I have yet to discover!
  • What I miss the most from Illustrator – nothing worth mentioning.
  • What I miss the most from Fireworks – Image optimisation – no image editing software that I’m aware of does a better job of compressing and optimising all nature of image files than Fireworks. It creates an 8 bit alpha png with a tiny size and smooth, edging where transparent bits kick in. This for me is the single most important missing feature of any open source alternative. Inkscape needs a good image optimising tool.

Photo editing – Gimp

Scaling, cropping, optimising photos; that’s a big part of creating photographic content for a website. Mainly I leave that to the end user but sometimes I create photo-based graphics too.
Gimp seems the most logical choice, as it suggests so itself, but I’m not finding it intuitive and it seems to crash more often than not. Fireworks has a limited range of bitmap editing tools compared to Photoshop but integrates really well within the context of creating an interface design as bitmap and vector graphic objects live happily on the same layer or multiple layers. Now there are several open source alternatives for this work such as but I’ll admit that I’m not finding a great solution and again it’s the optimisation issue that makes me frustrated.

Designing for Low Bandwidth

If we need to create websites with small file sizes for countries with low bandwidth then we need powerful optimisers. I found that Fireworks was great because of the high compression rates achievable which result in files more than half the size achievable using other programmes, open source or commercial. So here is a proposal for a future project for Aptivate – create an 8 bit Alpha transparency image optimiser that challenges Fireworks.

Switching to Ubuntu

A couple of months later I reformatted and partitioned the hard drive of my Dell Precision 9300 workstation. I still wanted access to Windows, so did a dual boot with an extra partition for shared files. Apart from my video card packing in unexpectedly (nothing to do with the installation I’m assured) , the installation went without a hitch. I was amazed by how intuitive the Ubuntu interface was. Using the software install centre, I was able to quickly and directly install all the applications I needed. Start up and shut down was a mere seconds and as I was in production mode within a couple of hours. I was more than happy to find Dropbox, Skype and Acrobat Reader were also available. All in all it was pain free, and… Ubuntu is actually graphically beautiful (not something I had anticipated at all).

The dreaded Terminal

Well it’s inevitable even for a designer. The WYSIWYG addicts worst fear, THE TERMINAL!!! agghh!! . A Baptism of code fire at Aptivate, no namby pamby intro here. I had pulled down the designer’s defence gate and in trickled (and sometimes poured) an endless stream of code, configuration files, settings, database fixtures, tables, smart tricks and speedy scripts, screen sharing and networking wizardry. Ah! then I had a cup of tea. I’ll be a terminal ninja one day.

Do I miss Windows?

I don’t miss windows. I booted into it shortly after installing Ubuntu, and found it an empty experience, like going back to a house I you’ve just moved out of but still have keys to. Can’t find the kettle to make a cup of tea so not staying. I recently installed virtual box and now run several different versions of Windows XP to help me cross-browser debug CSS.

Moving forward

Apart from using my new open source toolbox to help Aptivate refresh it’s current website, there are things I want to do. It would be great to contribute to the open source community and generally help to improve already great products such as Inkscape. I’ve already started submitting Inkscape icons to the Open Clip Art Library and it would be great if we developed the image optimiser I mentioned before. There is also a lot of potential in interactive svg’s which would be great to explore especially since Internet Explorer 9 supports them.. oh Microsoft you are never forgotten…

If you are designer reading this, and fancy getting involved with Aptivate’s open source efforts somehow, get in touch. If you are a developer and have a secret optimiser up your sleeve, please let us feed you and keep you amused because we want it!!!

Apptivate for Development

Friday, May 13th, 2011

As I’m at the Open Data for Development Conference, I thought I should write a quick note about some of our recent openness achievements:

  • SARPAM is a current Re-Action project to share drug price data across Southern Africa, helping countries negotiate a better deal from drug suppliers.
  • IHP+Results is another Re-Action project, just completed, to promote accountability among IHP signatories by sharing information on their health service improvements and progress towards meeting their Millennium Development Goals.

You can find the source code for both these applications on Aptivate’s GitHub. Our intention is to agree with our partners to publish full source code wherever possible.

Offline Websites and Low Bandwidth Simulator in Go

Wednesday, February 16th, 2011

Jon Thompson writes about Jeff Allen’s interesting new work on tools for working with low bandwidth:

Jeff continues to try and solve the low bandwidth/high latency problems that aid workers face in the field every day and that we encountered in Indonesia. We all know the joy of VSAT networks that slow to a crawl because either some folks on the team are downloading stuff they shouldn’t be downloading or all the computers are infected with bandwidth sucking viruses. It appears Jeff has moved one step closer to sorting out some of the problems surrounding bandwidth optimization by utilizing the Go programming language.

Rather than try and explain to you what Jeff has done I’ll let you read ‘A rate-limiting HTTP proxy in Go‘ and ‘How to control your HTTP transactions in Go‘ and sort out what he is talking about. Hopefully, this post will bait Jeff into leaving a lengthy comment that explains exactly what the hell he is up to.

My understanding is that Jeff is developing two useful tools:

People have been trying to make offlineable websites for a long time. Some of the best examples so far are using entirely client-side (in-browser) technology, such as the Logistics Operational Guide, developed by the World Food Programme for the Logistics Cluster, which can run entirely offline using Google Gears.

Gears had a lot of potential for developers to create offlineable websites, but Google has abandoned its future development in favour of the open standard HTML5, which is not ready yet. So there’s no obvious and future-proof way to develop offlineable websites at the moment. Jeff’s proxy, combined with a spidering system, could be one way to download an entire site, even if it wasn’t designed to be downloaded by the developers.

Another important potential comes from content management systems (CMS) such as WordPress, Drupal and Joomla. More and more websites are developed using such systems, rather than coded from scratch. The systems know all of the pages on the site, and the links between them, and could easily build an offlineable version of the site for download into Gears, HTML5 or Jeff’s proxy. And one plugin could potentially enable thousands of sites to be offlineable, especially if it was included in the CMS distribution and enabled by default.

A few wikis such as MediaWiki, MoinMoin, DocuWiki and JSPWiki have a programming interface (XML-RPC or WebDAV) that allows a smart client to download pages in their original text format, which could make them more efficient to store offline and also potentially editable offline. Jeff’s proxy could be extended to support sites built in such wikis automatically. There are still some limitations to this approach:

  • The pages would not look the same as the online versions, since the styling wouldn’t be downloaded and the effects of CMS plugins would not be visible;
  • It would probably still be quite slow to download an entire site this way, by spidering, without server-side support for downloading multiple pages at once;
  • Few websites are built out of Wikis, so the potential maximum reach is limited compared to better support for WordPress, Drupal or Joomla.

Anyway, I wish I knew Go, and had time to hack on Jeff’s proxy tools.