View low bandwidth version

Author Archive for chrisw

Content indexing in Django using Apache Tika

Wednesday, February 1st, 2012

For the Documents module of our new open-source Generic Intranet, we need to be able to extract the text content and metadata from various kinds of documents:

  • PDF files
  • Microsoft Office DOC, XLS and PPT files
  • and the new XML equivalents, DOCX, XLSX and PPTX.

I found various tools online to help extract this text, largely thanks to Stack Overflow here and here. This ended up with a hodgepodge of tools:

There were a number of problems with this hodgepodge:

  • I was unable to find any Python or command-line solution for old Excel (XLS) files;
  • These solutions did not extract metadata, only document text;
  • The choice of which tool to use depends on the MIME type returned by the file(1) command, which varies depending on the OS (Debian/Ubuntu or CentOS) and which version of the library is installed

Another Stack Overflow post recommended Apache Tika for metadata extraction. It appears to support all the document formats that we need, and to have auto-detection of the document format, which solves all the MIME type problems as well. However, it introduces a new problem: it’s written in Java, which is hard to access from Python.

Luckily I found some instructions for building a Python wrapper around Tika, using some tools that I’d never heard of, and this seemed like a good approach. Unfortunately the installation process is very non-standard, which would not fit in with our fabric-based automated deployment process, and would make it harder for users to install the Intranet themselves.

The instructions are somewhat outdated at the time of writing, as they refer to Tika version 0.7, while 1.0 has been released. I was unable to register for an account to update that page, so I wrote to the author with the details that I discovered, and will also document here that the following command works for me:

python ../jcc/jcc/__main__.py \
        --include /usr/share/java/org.eclipse.osgi.jar \
        --jar tika-parsers-1.0.jar \
        --jar tika-core-1.0.jar \
        java.io.File java.io.FileInputStream \
        java.io.StringBufferInputStream \
        --package org.xml.sax \
        --include tika-app-1.0.jar \
        --python tika --version 1.0 --reserved asm

I was able to go further than this, and package Tika in a way that makes it easy to install with Pip, and thus integrate with our deployment process.

The wrapper is written using JCC, which works by generating and compiling C++ code that links to the Java classes, and then a Python wrapper around that C++. This means that it needs to be recompiled for each platform, so I couldn’t just distribute a binary blob with the Intranet (I had the same problem with DocToText above).

The version of setuptools on our servers doesn’t support JCC’s shared library mode. JCC dies with an error if it’s not explicitly disabled or the patch applies. I couldn’t do either of these as part of our standard deployment process. So I patched JCC to disabled shared mode, since we don’t need it anyway. I also added some patches to allow various setup.py commands used by pip to be forwarded through JCC to the setup function call.

This seems to be enough to allow you to install JCC like this:

pip install git+git://github.com/aptivate/jcc.git

I also wrote a setup.py file that handles pip’s command line invokations and passes the necessary options to JCC, and JCC’s invocation of the setup function. This seems to be enough to install the package using pip:

pip install git+git://github.com/aptivate/python-tika.git

and you can use the last parameter as a package specification in pip_packages.txt, or whatever you pass to pip -r.

You can find the pip-installable Tika package, complete with Tika 1.0 JAR files, in our python-tika repository on Github. This will save you the work of downloading and compiling Tika and all of its dependencies. I have started a discussion with the JCC developers about merging these changes into the upstream project.

Rough Guide to rural data collection with ODK

Monday, December 5th, 2011

This post has three purposes, which I think overlap sufficiently to combine them:

  • A User Guide for the system that we developed for UNICEF, IDS and RuralNet Zambia
  • A Developers’ Guide for anyone wishing to build something similar
  • Notes on lessons learned that may assist future implementers

Project goals

Automate the data entry part of a long paper-based survey, by replacing the paper forms with electronic devices.

Hardware and application selection

The survey has several long and complex questions, and long sets of multiple-choice answers. The data collection needs to be done in dusty rural Zambia, and the devices might need to be used for a full day without power. Collected data should be sent wirelessly to a secure data repository at some time after collection.

Text entry is required for many fields. That means either a real keyboard with keys, or a sufficiently large touch screen to type comfortably on. Use of the device camera, and presentation of reports and graphs on the same device, might be required in future.

Two possible hardware platforms were identified:

  • Tablet laptops with touch screens
  • Tablet mobile devices (iPad or Android tablet)

We selected the latter for this project due to lower cost, lighter weight, better usability and longer battery life.

The available software options that we identified were:

  • EpiSurveyor (Java J2ME, partly closed source, we have used before and fixed bugs)
  • OpenXdata (Java J2ME, open source, developed and supported by an Aptivate alumnus among others)
  • Open Data Kit (ODK) (Android, open source, active community)
  • Bespoke online/offline survey in HTML5

Of these, we eliminated EpiSurveyor and OpenXdata due to lack of compatibility with the hardware platform(s) we had chosen.

We chose ODK over a bespoke system due to limited time available for development, and ability to easily take photos and record GPS coordinates using the device’s hardware.

Of the available Android tablet devices, we chose the Samsung Galaxy Tab for the pilot project, due to its high quality construction. For future projects we would probably use a lower cost device; see the lessons learned for details.

Form creation

Since the survey is quite long (about 230 questions) we wanted an easy way to enter the questions. The ODK application requires the form to be in XForms format. We identified the following tools for creating XForms:

We decided to use XLS2XForm, which enabled us to enter the large number of questions easily in Excel. The others all have graphical builders, which have advantages and disadvantages for less technical users:

  • More visually appealing
  • All available options presented visually (types of controls, groups, etc.)
  • Less likely to make a mistake and produce an invalid form
  • Cumbersome user interface slows down data entry

Unfortunately, none of these designers were able to import an existing form in XForms format, which means that the modifiable “source code” of the form must be maintained in a “proprietary” format in each case, and it’s difficult to switch between tools.

You can download the conversion tools, and the Excel spreadsheet with the completed questionnaire as we delivered it to RuralNet, here. RuralNet staff, please use the latest version of the spreadsheet that you can find locally. To use the tools, you will need to download and install Python 2.7 and Java (JRE). Then download the tools as a ZIP file and extract it somewhere. I recommend that you keep the master copy of the spreadsheet in Dropbox to ensure that it’s backed up, and it’s always clear what the latest version is.

For help in building surveys using XLS2XForm, please see the documentation. In addition to the question types listed there, we have used the following shortcuts, which also work in this customised version of XLS2XForm:

  • text is short for add text prompt (a text field, such as a person’s name)
  • note is short for add note prompt (a read-only field, providing additional information for the user)
  • time is a time field without a date (for example, survey start and end times)

To compile the spreadsheet into an XForms form, run the build_and_validate.py script by double-clicking on it. If it works, it will show the message “Success!”, otherwise it will show an error message, usually caused by a mistake in the Excel spreadsheet. If it works, it will create (replace) the file called zambia-ranq-round3.xml in the same directory. If your spreadsheet has a different name, you can create a shortcut to call build_and_validate_custom.py with the name of the spreadsheet on the command line.

Software components

ODK Aggregate is the software that powers the Internet server. It is a repository for blank forms (designs) and completed forms (data). Our server is located at http://partimob.appspot.com/. This server is currently paid for by us, and will need to transfer to RuralNet at some point.

ODK Collect is the application runs on the device, and users interact with it to complete the survey. It’s essentially a user interface for XForms. It can download blank forms (designs) from an ODK Aggregate server, and upload completed forms (data) to the Aggregate server as well.

ODK Briefcase is the software that downloads completed forms (data) from the Aggregate server and convert them into CSV (spreadsheet) format, which can be loaded into

Customised ODK Collect

We are using a custom version of ODK Collect. You can download the source code for it here, or the compiled application here. You can also find it in the ZIP file download. If you prefer, you can use the latest official version of ODK Collect. The two are compatible, but our version adds the following useful features:

  • Use supplied login and password by default to save a round trip and a prompt.
  • Add keyboard navigation, useful for form filling on android-x86 because the mouse interface is pretty clunky.
  • Restore ability to modify completed and submitted forms on the device, which was removed from the official version in 1.1.7.
  • Improved error messages and progress indication during form uploads.
  • Allow setting the instance name on the first page of the survey.
  • Allow saving incomplete surveys on required questions (in case a survey is interrupted; almost all of our questions are required).

There are several ways to install ODK Collect on a device:

  • Download it from the Android Market (official version only, not our customised version)
  • Copy the APK file onto a microSD card, insert the card into the device, and use the My Files application find and open it from the SD card.
  • Attach the USB cable from the device to a computer, enable mass storage mode on the device, and on the computer, drag and drop the APK file onto the device’s internal memory, then use the My Files application to find and open it.
  • Attach the USB cable from the device to a computer, and use ADB‘s install command to install the APK file.

It’s useful to put the application onto the device’s desktop. To do that, open the Applications list, find ODK Collect, and press and hold it with your finger for a few seconds. The background will change to the desktop; release your finger to drop the application there.

It’s also useful to remove all the other junk from the desktop. For each icon and widget on the desktop, press and hold it with your finger for a few seconds, until the trashcan icon appears, then drag your finger to the trashcan and release it there.

Form management on the device

There are several ways to put blank forms (designs) onto the tablets:

  • Download them from the ODK Aggregate server using ODK Collect.
  • Copy them onto a microSD card, insert the card into the device, and use the My Files application to copy them from the SD card to the /sdcard/odk/forms directory.
  • Attach the USB cable from the device to a computer, enable mass storage mode on the device, and on the computer, drag and drop the form into the /sdcard/odk/forms directory.
  • Attach the USB cable from the device to a computer, and use ADB or DDMS to push the file onto the device, into the /sdcard/odk/forms directory.

Of these methods, ADB or DDMS is recommended for rapid development, and using the Aggregate server is recommended for production use, since the form must be installed on the Aggregate server for it to be able to accept submissions.

Similarly there are several ways to copy completed forms (data) off the device:

  • Upload them to the ODK Aggregate server using ODK Collect.
  • Use the My Files application to copy them from /sdcard/odk/instances to a microSD card, then remove the card and connect it to the computer, and drop the files into the ODK Briefcase data directory.
  • Attach the USB cable from the device to a computer, enable mass storage mode on the device, and on the computer, drag and drop the files from the /sdcard/odk/instances directory to the ODK Briefcase data directory.
  • Attach the USB cable from the device to a computer, and use ADB or DDMS to pull the file from the device’s /sdcard/odk/instances directory to the ODK Briefcase data directory.

Of these methods, using ODK Aggregate is recommended for development and production use.

Since the Aggregate server is on the Internet, this method requires that the device have Internet access. So it either needs a valid SIM card installed with credit and a data bundle, or a WiFi network connected. We had many problems with using SIM cards for data, so WiFi is preferred if possible.

The directories mentioned above will not exist until ODK Collect is installed on the device and run for the first time. Forms downloaded from the Aggregate server will also be placed in the /sdcard/odk/forms directory. Forms completed on the device will be placed in the /sdcard/odk/instances directory.

Configuring ODK Collect

Collect needs to know the details of the ODK Aggregate server to log into it, download blank forms and upload completed forms.

Open the ODK Collect application, press the Settings button and click on Change Settings. Click on URL and enter https://partimob.appspot.com. Similarly, complete the Username and Password using the details that you’ve been given by the Aggregate server operator, or the account that you’ve created on the Aggregate server. This account should only have Data Collector permissions, no more. Press the Back key to get back to the main menu of ODK Collect.

Downloading forms using ODK Collect

Open ODK Collect on the device, and click on the Get Blank Form button. Collect will try to log into the Aggregate server using the details that you’ve provided, and get a list of forms on the server that have the Downloadable box ticked. This is on by default for newly uploaded forms.

Tick the box next to all the forms that you want to download, and click on the Get Selected button.

Filling forms on the device

Open ODK Collect on the device, and click on the Fill Blank Form button. All the forms in the device’s /sdcard/odk/forms directory should be listed. Choose the form that you want to complete.

You will see an introductory screen showing how to move between questions by swiping your finger across the screen, from right to left or left to right. This screen has a text box at the bottom, which you can use to name the form that you’re completing. Naming forms is useful if your data collection is interrupted and you need to resume it later. It’s much easier to identify the form using its name, rather than opening it and flicking through to find some identifying information. You might name the form based on the household code that you’re surveying.

Depending on your answers to some questions, others may be hidden, or their text might change.

At the end of the form there is another chance to Name this form, and a tickbox to Mark form as finalized. Before you can upload the form to the Aggregate server, this box must be ticked, and you must press the Save Form and Exit button. Otherwise Collect will consider that the form is incomplete.

Sending completed forms to Aggregate

Open ODK Collect on the device, and click on the Send Finalized Form button on the main menu. Tick the box next to all the forms that you want to upload to Aggregate, and click on Send Selected. After the upload is complete, you should see the Upload Results message. Every form should have “Success” next to it, otherwise it was not sent successfully.

Downloading forms using Briefcase

We are using a customised version of ODK Briefcase with the following changes:

  • Fix the export of repeated groups, which before only worked for the first row (issue 461).
  • Shorten exported column names, to allow the CSV file to be imported into Access.
  • Allow the server name, username and password to be provided on the command line (or via a shortcut).

You can find the source code here and the pre-compiled version here, as an executable JAR file. You can also find it in the ZIP file download. If you make changes to the source and want to build the executable JAR again, install Maven and use the mvn package command.

To download the completed forms, open Briefcase by double-clicking on the briefcase-1.0-jar-with-dependencies.jar file. On the Transfer tab, click on the Connect button. For the URL, enter https://partimob.appspot.com, and for the user name and password, give the details of an ODK Aggregate account with Data Viewer permissions.

Then you should see a list of forms appear under the heading Forms to Transfer. Tick the box next to the one that your users have been completing, and then click on the Transfer button. If you do this after all the completed forms (data) have been submitted to the ODK Aggregate server, you will not need to do it again for that form template (design).

Now switch to the Transform tab and see if the form appears in the Form list. If it doesn’t, then exit and restart the Briefcase application (issue 464).

For Output Type, choose .csv and media files. For Output Directory, choose the directory where you’d like to save the CSV files. Note that any previous files exported to that directory from the same form will be overwritten without warning, even if they have been modified (cleaned). Click on the Output button to write the CSV files.

Cleaning data in Excel

You can find the Excel spreadsheet that we use for data storage and cleaning here. Note that Excel is a long way from the best way to store and manipulate data like this. Microsoft Access would be far more appropriate. Yet again I wish there was a sufficiently powerful open source alternative.

Because the spreadsheet contains cleaned data, which is “better” than the raw data which is included in the CSV export, we don’t want to overwrite existing rows. For the main section of the questionnaire (the so-called Single Responses) you can include only the new data like this:

  • Open the main spreadsheet and switch to the Single Responses tab
  • Highlight all rows from 3 down to the bottom, and Sort them by the SubmissionDate column.
  • Note the last submission date on this spreadsheet.
  • Open the newly exported CSV file for the single responses (something like RANQ-2011-Round-4-v5.csv).
  • Sort this file by the SubmissionDate column as well.
  • Highlight and copy all the rows whose submission date is later (more recent) than the last one in the main spreadsheet.
  • Paste them at the bottom of the Single Responses tab of the main spreadsheet, below the other data.

For the other tables, this process needs to be done completely manually at present.

You can then check and clean the data by viewing and modifying it in Excel. Note that each sheet has one or two columns at the end, which are filled by formulae that look up values from the Single Responses sheet, such as the Household Code.

Using the Android x86 Emulator

To be written.

Lessons learned

To be written.

How can a $35 tablet computer change the world?

Friday, October 21st, 2011

Osama Manzar poses some very interesting questions about India’s new $35 tablet computer “for the poor”. However he doesn’t attempt to answer these questions, leaving the reader in no doubt that he thinks the answer is No! in all cases.

I must admit to being skeptical about any such innovation, and I’ve been listening to both sides of the debate on the BytesForAll mailing list. Despite my skepticism, Osama’s questions have some answers, and I’d like to present them for comment.

  • India has one of the lowest ratio of teachers—just 456 teachers per million people.
  • Seventy-two percent of our primary schools have only three teachers or less.
  • 25% of teachers were absent from school, and only about half were teaching, during unannounced visits to a nationally representative sample of government primary schools.

How is the $35 tablet going to solve any of these problems?

Of course technology on its own is not going to solve these problems. It is just a valuable weapon in the armoury of those who would launch an all-out war on poverty (and other abstract nouns).

Kentaro Toyama, an ex-Microsoft guru turned ICT4D researcher, says that “technology is [just] an amplifier of human intent and capacity.” And when faced with a task that’s possible but simply too large, an amplifier is exactly what we need. It doesn’t need to be high tech. Tanzania did just fine with radio, one of the oldest, simplest and most inclusive ICTs:

About ten years after independence, Tanzania decided to move towards universal primary education, almost doubling the number of children in school. The government estimated that it needed an extra 40,000 teachers. As the existing training colleges were producing only 5,000 new teachers a year, it was decided to recruit secondary school leavers and train them on an apprenticeship model, partly on the job and partly through distance education. Over a period of three years, they were posted in schools where they had a reduced teaching load. They then followed correspondence courses backed by radio programmes; they were supervised and tested on their classroom practices, and passed their examinations. Two evaluations found that they ended up reasonably competent in the classroom (Chale, 1993; quoted by Perranton, 2000; retrieved from UNU)

If India were to launch a massive teacher education programme, they would find it cheaper to implement that programme using technology. For example, they might distribute radios, TVs, portable audio players or even (heaven forbid!) computers to trainee teachers. It might take longer for those teachers to reach high standards, and more might drop out, without the personal connection and feedback of face-to-face training. Even so, one could train more teachers for more time and achieve a similar number of fully trained teachers at a lower cost.

In the business sector, more than 70% micro, small and medium enterprises (MSMEs) are not connected to information society to leverage opportunities of business and efficiency. How will the $35 tablet help in the financial inclusion of MSMEs, which are largely situated in small towns and remote areas?

It’s unfortunate that the tablet doesn’t include a long-range wireless network (such as GPRS), which must surely cover most of India as it does Africa. Even without an Internet connection, it can still provide useful services such as record keeping, business accounting and stock tracking to small enterprises. The tablet is based on Android, but the marketplace has been disabled, and this is a serious limitation. I think it’s likely to be overcome soon. When that happens, India’s many skilled software developers will be free to create localised applications for a potentially huge local market.

Most of India’s 3.3 million non-governmental organisations (NGOs) are also located in remote areas—70% of them lack any sort of information and communication technology (ICT) infrastructure or connectivity, and have no websites.

How can the $35 tablet help these NGOs’ global outreach efforts or aid the millions of people working with them in rural areas?

You probably know the answer to this question as well as I do: The same way as computer and phones can, only more so. Helping people to communicate and to do their work is exactly what ICTs do. All of them. With the possible exception of Angry Birds. A computer can help us to make leaflets, track visits to patients and beneficiaries, diagnose illnesses, improve farming techniques, or learn about anything we wish to know in the whole world of knowledge.

Can it bring transparency in governance at this level?

Good question. Not by itself, sure. Transparency comes from open data. The people might get together to publish what the government would rather hide, or pressure the government to release the data, but a $35 tablet won’t help them much.

When they do release that data, however, the usual problem is how to make use of it. Government data tends to be massive and unwieldy, and answering difficult questions takes much time and significant skill even with the best of data. I think that free, open, widecast media provide the biggest opportunity to make real use of transparency, and our use of the Internet as an enabler of democracy is the best example of that. Potentially, a simple but powerful Internet device could help bring people together to investigate and answer those difficult questions. But by the sound of it, this tablet is not quite there yet. Hopefully it will be soon.

Since a large population of our country communicate verbally, and cannot read and write with ease, their preferred medium of content consumption and content production is audio-visual… But to make use of good multimedia content, you need powerful machines, not cheap and underperforming ones.

I disagree with that. I grew up with “multimedia content” on BBC Micros: simple games, moving blocks around a screen, simple word processors and spreadsheets and databases and graphics. A picture is worth a thousand words, and a simple, clear diagram can be worth far more than a complex, confusing one. Advanced graphics are no substitute for a visual designer’s ingenuity and skill. Wikipedia is “multimedia content” that is perfectly suited to a $35 tablet.

If the $35 tablet can do anything good to education in India, the only way is by handing them to each and every teacher and school management staff to monitor the workings and functioning of the school and its teachers…

Monitoring is an interesting application, and a double-edged sword. Robert Chambers, the inventor of participatory rural appraisal, told us a story at the recent ICT4D Finale event in Cambridge of a hospital in India where the nurses were given mobile phones “to collect data at the source.” But the director of the hospital used it to monitor what they were doing, effectively spying on them. The nurses went on strike and eventually the director was fired. I think that for monitoring to have a positive benefit, it must be done with consent and a shared vision to use the data to improve performance, not to criticise and control.

rather than assuming that each student will buy Aakash and India will become digitally literate overnight.

I have to agree with that sentiment, although I’m not sure who raised it. Kapil Sibal, who takes the credit for inventing the $35 tablet, merely said:

This low cost device is part of our national mission on education through information and communication technology (NME-ICT) which will connect over 1,000 institutions across the country, enabling tonnes of web-based course content for free.

Now that doesn’t sound so far-fetched, does it?

Computers in Schools: Sound solutions

Monday, September 5th, 2011

Activities with sound are ideal for kids. Preferably lots of sound. Especially when it comes to teaching language, reading and writing.

When you have a classroom full of children with computers, each working at their own pace on speech or language tasks, they need private sound rather than the built-in speakers of their laptops. Otherwise the cacophony would make learning much harder for all of them.

Headphones (or headsets) are the normal solution for language labs in UK schools. But they’re not great for use with primary school kids in a dirty, dusty environment. They’re extremely fragile, hard to clean, uncomfortable to wear for long periods, and can spread ear infections.

A bluetooth headset would work, and would be nice in theory, but much more expensive, and would need charging often.

The most obvious solution seems to be something that looks like a mobile phone, but attaches to a computer with a cable. They’re very hard to find. It seems that everyone wants tiny, delicate, wireless or in-ear headsets. So manufacturers don’t bother making the kind of big, clunky, bulletproof handsets I’m thinking of.

First, after long and fruitless searching, I discovered that what I’m looking for is actually called a handset (because you hold it in your hands) and not a headset (that fits over your head).

And then I found them:

USB Handset USB Handset Nokia-like USB Handset Slim grey USB handset

Unfortunately the cheapest I’ve found so far is £10 ($14) through Maplin, which is about ten times the cost of the cheap, fragile headsets we’d like to replace.

If you know of any others, or a cheaper bulk supplier than Maplin (such as their supplier in China) please let us know!

Ubuntu Laptops in Schools

Thursday, September 1st, 2011

I’m currently working on a project that’s putting computers into Zambian schools to try to revolutionise education, making it more fun and interactive for kids, and reducing the problems of teacher absence.

They’re using Intel Classmate style PCs, currently running Windows 7 Home Starter. I’m investigating whether Ubuntu would provide a better experience. It might be faster, more reliable, more manageable and easier to lock down than Windows.

Ubuntu 10.10 (Maverick) doesn’t boot on these computers, probably due to problems with the HPET. I don’t like Unity so I don’t want to try 11.04 just yet, which left me falling back to 10.04 (Lucid) with long-term support.

Automatic Logins

The computers should be in a kiosk-like mode for student use, where no login is required but they are locked down. They should also be used by teachers (with a password and fewer restrictions) and administrators (with another password and no restrictions). So I created three user accounts. Student is set to log in by default with no password.

While this works, there are other places where a password is requested and none works, because the Student account doesn’t have a valid password:

  • unlocking from screensaver
  • switching users
  • sudo from the command line

The last one is less important because students should not be able to access the command line anyway, or have any administrative rights. But they need to unlock the screensaver and be able to switch users.

We solved the screensaver problem by telling the screensaver not to lock the screen for this user, just as we did for Camfed in the Zambia SRC with LTSP:

# Disable locking the screen for users with no password to unlock it
sudo -u student gconftool-2 \
	--type boolean \
	--set /apps/gnome-screensaver/lock_enabled false

However the user switching was more tricky. Luckily I found a very helpful question and answer on SuperUser. I improved on it slightly by reusing Ubuntu’s builtin nopasswdlogin group, so that users who can log in with no password can also be switched to with no password.

To achieve this, just add the following line at the beginning of /etc/pam.d/gnome-screensaver:

auth sufficient pam_succeed_if.so user ingroup nopasswdlogin

Firefox Kiosk Mode

We want the browser to be fullscreen all the time, so we need to use some extensions:

  • Full Fullscreen to make it start in fullscreen mode;
  • Keyconfig to stop them exiting full screen mode with F11, or closing the browser with Alt-F4.

We also change some preferences using about:config:

xpinstall.enabled: false
to prevent installing more extensions;
app.update.auto: false
to stop Firefox checking for updates by itself;
browser.sessionstore.resume_from_crash: false
to prevent the Restore previous session prompt when starting Firefox;
extensions.update.enabled: false
to stop Firefox checking for updates to its installed extensions;
extensions.update.notifyUser: false
to avoid a prompt if an extension update is discovered;
browser.tabs.warnOnClose: false
to avoid the prompt to save your tabs on browser exit;

Window Manager

We want the students to have access to a restricted set of applications. The user interface also needs to be unbreakable (child-proof). Windows should always be maximised, as the laptops have quite small screens. All of this points to using a custom window manager/desktop instead of the standard Gnome or KDE.

Fluxbox and Openbox were recommended, but they seem to be aimed at highly-customised desktop environments (for geeks) rather than locked-down kiosks or embedded systems. Matchbox looks like quite a good fit. It has a very simple front menu and an everything-maximised window manager, which sounds great for ease of use.

We’re using GDM for the user login, which offers users a choice of which session (window manager) to run. This is OK, and even quite good for administrators, as it provides a failsafe option in case the usual window manager is borked. But I can’t see how to disable or override this for particular users. Students have no-password logins, so they don’t even get the opportunity to choose a window manager.

The DefaultSession in /etc/gdm/custom.conf (chosen using gdmsetup) changes their window manager, but affects all users, and we don’t want to force everyone to use the restrictive kiosk window manager.

I found that GDM lets you specify your own Xsession script, which gdm uses to actually start the session selected by the user. So I wrote a replacement:

#!/bin/sh

if [ "$USER" = "student" ]; then
	/etc/gdm/Xsession /usr/bin/matchbox-session
else
	/etc/gdm/Xsession "$@"
fi

All it does is call the original Xsession, overriding the name of the session manager if the current user is the special student user, and otherwise behaves exactly as normal.

Save it in /usr/local/bin/GdmKioskSession, make it executable, and add the following line to /etc/gdm/custom.conf:

BaseXSession=/usr/local/bin/GdmKioskSession

If you don’t even want the application menu, but want to force a particular application such as a web browser (true kiosk mode), replace /usr/bin/matchbox-session with /usr/local/bin/kiosk-session, create that file with the following contents and make it executable:

#!/bin/sh
matchbox-window-manager -use_titlebar no &
exec /usr/bin/chromium-browser -kiosk -app=http://staging.ischool.zm/

More lockdown tips to follow.

Traffic shaping with PF, ALTQ and HFSC

Friday, August 5th, 2011

We usually use Linux firewalls for traffic shaping, because the power of the traffic control (tc) system exceeds FreeBSD’s dummynet in most ways.

Dummynet can be used to create arbitrary delays and packet loss, which is very useful for simulating poor connections, but not for sharing bandwidth and prioritising packets between different traffic classes on a real traffic shaper.

However, I’ve just been testing PF (the new standard packet filter) and ALTQ (the alternative queueing system) on FreeBSD, and I’m impressed by the capabilities. I prefer this combination (PF+ALTQ) over Linux TC because:

  • PF and ALTQ are fully integrated and configured using the same file, whereas TC has its own (very hard to use) classifier. I normally use the iptables CLASSIFY target to classify traffic instead, but this is not integrated.
  • TC is very hard to use generally. The authors seem more concerned with functionality than usability.
  • ALTQ has named queues which helps usability enormously compared to TC’s hex numbered classes.
  • ALTQ gives very low delay when the interface is not 100% saturated, which seems impossible to achieve with TC.

It does annoy me that ALTQ is not enabled in the default kernel, so you have to compile your own kernel. I used the following commands to copy the default GENERIC configuration to a custom one, which I called ALTQ:

cd /boot
cp -p kernel GENERIC # backup the current kernel
cd /usr/src/sys/i386/conf
cp GENERIC ~/ALTQ
ln -s ~/ALTQ .
vi ALTQ

and added the following lines to the new kernel configuration file, ALTQ:

options ALTQ
options ALTQ_RED
options ALTQ_RIO
options ALTQ_HFSC
options ALTQ_PRIQ

and then compiled and installed the new kernel:

cd /usr/src
make buildkernel KERNCONF=ALTQ
make installkernel KERNCONF=ALTQ

and then reboot to load the new kernel. After that, we need to create a pf configuration. Some example configurations use CBQ queues, but I prefer HFSC because:

  • HFSC is guaranteed accurate, whereas CBQ is approximate
  • CBQ requires you to guess the average packet size and its accuracy depends entirely on this
  • HFSC has service curves which allow you to deliver small files quickly and drop the priority of large connections (e.g. file downloads) with great ease.

Here is a sample configuration of PF+ALTQ+HFSC that I used for testing on a transparent bridging firewall (bridge0 connecting em0 and em1):

altq on em1 hfsc bandwidth 1Mb queue { ftp, ssh, icmp, other }
queue ftp bandwidth 30% priority 0 hfsc (upperlimit 99%)
queue ssh bandwidth 30% priority 2 hfsc (upperlimit 99%)
queue icmp bandwidth 10% priority 2 hfsc (upperlimit 99%)
queue other bandwidth 30% priority 1 hfsc (default upperlimit 99%)
pass out quick on bridge0 inet proto tcp from any port 21 to any queue ftp
pass out quick on bridge0 inet proto tcp from any port 22 to any queue ssh
pass out quick on bridge0 inet proto icmp from any to any queue icmp
pass out quick on bridge0 all

We are only queueing on em1 here, which is the downstream, so we are only limiting downloads. We deliberately limit them to 1 Mbps for testing. The limit should always be lower than your actual download bandwidth, to ensure that the queue is on the FreeBSD firewall and not any other device.

We create four named queues under the root, which is implicitly named root_em1. We reserve 30% of bandwidth each for FTP, SSH and other traffic, and 10% for ICMP. However, any class can exceed its reserved bandwidth, up to the upperlimit, which defaults to 100%, which means that one class can potentially cause delays to traffic in other classes, so we override this to 99%.

Note that even though we create the queues on the em1 device, we must filter packets on bridge0, as otherwise our traffic does not match our pf rules.

Update: I found some more information about traffic shaping and advanced usage of HFSC, including realtime guaranteed classes for VoIP.

Update 2: For a simpler setup with ALTQ, try this guide.

AfNOG 2011, Part 2

Monday, May 30th, 2011
People sitting at computers in a lecture

Boot Camp

AfNOG boot camp was absolutely massive this year. I think they had 75 people when they were only expecting 40. They took over half our classroom as well, which made setup tricky as we had to work around people and ask them to move repeatedly, and we couldn’t get all of our tables in to cable them up.

It was followed by the obligatory welcome dinner, at the White Sands’ outdoor restaurant, with the requisite number of speeches and applauses.

Today we had the first day of Scalable Services. Desktop installation hadn’t gone too well. My attempt to respin with fixes, wiping the unused space after the imaged partition, failed badly and resulted in a corrupted image, so we had to reimage those boxes.

People sitting around dinner tables in front of a stage on the beach

Welcome

Luckily it seems that everyone brought laptops, so the PCs aren’t really needed. And the virtual machines seem to be working well so far. We haven’t yet had to compile any software on the virtual machines, and I hope it won’t be too slow when we do. We’re using 34 out of the 35 virtual machines that we created.

Tomorrow is my first session, a 1 hour practical on virtualisation, installing VirtualBox and FreeBSD, after Joel’s theory session.

AfNOG 2011, Part 1

Saturday, May 28th, 2011
Alan Barrett laying cable

Alan Barrett laying cable

I’m in Dar es Salaam, Tanzania for AfNOG 2011. I arrived on Wednesday morning at 7am (on the red-eye flight from London Heathrow) and I’m here until Tuesday 7th June.

Until now we’ve been setting up the venue. We’ve been super busy, working until midnight every night so far. We had to run our own cables, quite a lot of them (over 600 metres).

Running them through the windows was tricky, since we needed to be able to close them for security, and to allow the air conditioning to work. Someone (Alan?) came up with the genius idea of using tough palm leaves wrapped around them to protect them as they pass through the narrow gap between window panes. Bio-degradable trunking!

To cope with the power failures, Geert Jan built a monster Power-over-Ethernet injector to power the wireless access points in each room and keep the wireless network running.

The training workshops start tomorrow, Sunday 29th May, with the Unix Boot Camp, an introduction to Unix and the command line. We expect that many of the participants will have little experience with Unix, as has been the case in previous years. These free tools have immense benefits, both for us running the workshops and for the participants when they return home. But they are very different to the Windows environments that the participants are most familiar with. Without basic skills, they would struggle and hold back the group during the rest of the workshops.

Feeding the cable monster

Feeding the cable monster

I’m not involved in the boot camp, but after it finishes, we move straight into the main tracks, which last for five days. This year we have some new tracks: Network Monitoring & Management, Advanced Routing Techniques and Computer Emergency Response Team training.

We have also cancelled the basic Unix System Administration track (SA-E) this year, as it has finally been localised to most African countries, and therefore people have the opportunity to attend it locally at lower cost and build local communities. However, this leaves us with nowhere to cover more advanced systems administration techniques, which are some of my favourite topics, including:

Geert Jan with his 8-way Power over Ethernet injector

Geert Jan and the Monster Injector

  • virtualisation (desktops, servers and thin clients, VirtualBox, Xen, KVM, jails, lxc)
  • system imaging (ghost, snapshots)
  • backups (snapshots, Rsync, Rdiff-backup, Duplicity)
  • file servers (NFS, Samba, sshfs, AFS, ZFS)
  • virtualised storage (iSCSI, ATAoE, Fibre Channel, DRBD)
  • cloud computing (Amazon and Linode virtual servers, scripting and APIs)
  • cluster computing (Mosix, virtual machine host clusters)
  • DHCP (for network management and booting)
  • network security (firewalls, port locking, 802.1x)
  • wireless networks (planning, monitoring, troubleshooting, WEP and WPA, 802.1x authentication)
  • Windows domains and security (including Samba 4)

If participants show enough interest in these topics, they could be added in future. I think it’s unfortunate that the course is arranged into week-long tracks rather than half-day or one-day sessions from which people could pick and choose, Bar Camp style. That would make it much easier for people to run sessions on many new topics.

Stacked up computers

Some of our 80 desktop computers

In the past this would have been difficult, because we provided desktop computers for participants. It used to take us 3-4 days to set up 80-odd desktop PCs with customised FreeBSD installations. We’ve noticed that more and more people are coming to the workshops with their laptops, and this time we’ve made a big effort to shift from dedicated to virtual platforms, to reduce setup time and costs in future.

The hardest track to do this for, in my opinion, was Scalable Services English (SS-E), the one I’m working on. We were all pretty much agreed to stay with desktop PCs this year, making us the only track to do so. But when we arrived, we discovered that the mains power situation here is pretty awful. On Wednesday we had four power failures. We only have five UPS, not nearly enough to protect every desktop.

For participants with laptops, they effectively have their own built-in UPS. If we give them virtual machines to work with, then we only have to protect the hosts. We can keep those in the NOC (Network Operations Centre), where the UPS are, so they’ll be protected for around 15 minutes of any power outage, which we have to hope will be enough for the hotel to start their generator.

Cannibalising RAM

Cannibalising RAM

Some participants will probably forget their laptops, so we’ll provide them with desktops, but we have no way to UPS them. These desktops will be set up with FreeBSD, as in previous years.

We rented 80 machines from a local company. Some had Windows, in varying states of repair, some had no operating system installed. We decided to use some of these desktops as hosts for the participants’ virtual machines.They only had 2 GB of RAM each, but since we had plenty, we cannibalised eight others for their RAM to upgrade our machines to 4 GB each.

We decided to use VirtualBox for the virtual machines. It’s free, open source, can host on all major platforms (Windows, Mac, Linux and even FreeBSD), has a nice GUI and a command-line automation tool, supports bridged networking easily, and is relatively fast and efficient.

Backs of systems being imaged

Imaging backend

We configured the master (that we’ll copy onto the other machines) starting with the setup from last year. We then had to install VirtualBox and build our first virtual machine inside it. Then we discovered that the virtual machine was unable to access the network in bridged mode. Packets sent by the virtual machine we simply never sent by the host. We needed to use bridged mode so that participants could run services on their machines simply by installing them. without requiring extra configuration on the host.

We had no Internet access for most of that day, because all three of our redundant providers were down for different reasons. Eventually we managed to use Geert Jan’s 3G dongle to get online and research the problem. We found that bridged networking doesn’t work in the binary package of VirtualBox 3.2.12 that comes with FreeBSD 8.2, so we had to wait until Internet access was fixed to download 120 MB of software (ports updates and VirtualBox 4.0.8) like this:

Michuki Mwangi configuring a PC for imaging

Imaging frontend

pkg_add -r portupgrade
portsnap fetch extract update
portupgrade virtualbox-ose virtualbox-ose-kmod

This took a long time, as VirtualBox is a large piece of software which also required us to download and build a new version of QT, but eventually it succeeded and the problem was solved.

We decided to put only five virtual machines on each host. Sometimes we would have the whole class compiling software from ports, which would slow down all of them significantly. We will use six or seven servers to host 30-35 virtual machines. On the master host, we created five copies of our master virtual machine by copying its hard disk like this:

cd .VirtualBox/HardDisks
for i in 1 2 3 4 5; do
	cp AfNOG\ SSE\ Master.vdi vm0$i.vdi
	VBoxManage internalcommands sethduuid vm0$i.vdi
done
Moving the systems to the NOC

Relocation

Then we created the virtual machines in the VirtualBox GUI and attached them to these new images. We needed to generate a new UUID for each disk image copy, using the undocumented sethduuid command above, otherwise VirtualBox would refuse to add the copies because it had a disk image already registered with the same UUID.

We could have created the virtual machines using the VBoxManage command as well, but it would have taken longer to work out how to use it than simply to create the five machines by hand. I later worked out the commands that we could have used:

cd ~/"VirtualBox VMs"
for i in {1..5}; do
	echo $i
	vmname=VM0$i
	diskimage="$vmname/FreeBSD.vdi"
	VBoxManage createvm --name "$vmname" --ostype FreeBSD
	VBoxManage modifyvm "$vmname" --memory 256 \
		--nic1 bridged --bridgeadapter1 bge0.219 \
		--nic2 bridged --bridgeadapter2 bge0.$[50+$i] \
		--vram 4 --pae off --audio none --usb on \
		--uart1 0x3f8 4 --uartmode1 server /home/chris/"$vmname"-console.pipe
	VBoxManage storagectl "$vmname" --name "IDE Controller" --add ide
	cp VM01/FreeBSD.vdi "$diskimage"
	VBoxManage internalcommands sethduuid "$diskimage"
	VBoxManage storageattach "$vmname" --storagectl "IDE Controller" \
		--port 0 --device 0 --type hdd --medium "$diskimage"
done

We named the images VM01 to VM05, which was important for running automated scripts on them. Then we configured VirtualBox to start them automatically at boot time, in headless mode, by adding the following lines to /etc/rc.conf:

vboxheadless_enable="YES"
vboxheadless_machines="VM01 VM02 VM03 VM04 VM05"
vboxheadless_user="inst"

We wrote a short script to help us apply the same command to all five virtual machines:

#!/bin/sh
# script to control all five virtual machines

command=$1
shift

for i in 1 2 3 4 5; do
	VBoxManage $command VM0$i "$@"
done

This allows us to log into a machine and do things like:

  • ./manage acpipowerbutton to initiate a controlled shutdown of all five virtual machines
  • ./manage modifyvm --macaddress1 auto to generate new, random MAC addresses after cloning the host
  • ./manage startvm --type headless to get the virtual machines running again (headlessly, independent of the GUI)
Room with desks around the edge, covered in computers and equipment

The NOC

We knew that we wouldn’t have space to attach monitors and keyboards to all the hosts, and we’d have to fiddle about with cables in the hot NOC room (without working aircon) if we needed access to their consoles, so we added the ability to log into them remotely using VNC and GDM. To do this, we had to install the VNC server:

pkg_add -r vnc

Which unfortunately doesn’t come with the nice xorg loadable module that adds a built-in VNC server to the X server, making a fast and stateless remote control session possible. So we had to hack about with inetd, first by adding a service name with a port number to /etc/services:

vnc		5900/tcp

And then a service line in /etc/inetd.conf:

vnc	stream	tcp	nowait		root	/usr/local/bin/Xvnc Xvnc -inetd :1 -query localhost -geometry 1024x768 -depth 24 -once -fp /usr/local/lib/X11/fonts/misc/ -securitytypes=none

This requires us to enable the XDMCP protocol in GDM, in order for VNC to communicate with it to present a GDM login screen. So we replaced the contents of /usr/local/etc/gdm/custom.conf with the following:

[xdmcp]
Enable=true

[security]
DisallowTCP=false

And then restarted GDM:

sudo /usr/local/etc/rc.d/gdm restart

And checked that we could connect from another machine and got a login prompt:

vncviewer 196.200.217.128

Which did indeed give us a working login screen:

VNC graphical login on a FreeBSD virtual machine host

VNC graphical login on a FreeBSD virtual machine host

This method is very slow. I wanted to find a better way to access the guests, especially if their network configuration was broken. I tried to connect a host serial port to a pipe and then access that pipe from a shell command, eventually over ssh, in a similar way to the text-only console offered by Xen (xm console). The above VBoxManage commands set up a pipe in my home directory, and I wrote the following short script to access it:

#!/bin/sh
set -x
echo "Console for $USER"
nc -U /home/chris/$USER-console.pipe

I created user accounts for each virtual machine, with the same name, and set their shells to this script, so that when they log in, they would automatically be connected to the pipe. However I was unable to make it work well. In particular, because of incompatible terminal emulations, I was unable to run vi to edit configuration files in the guest. If you find a way around this, please let me know. I haven’t tried it yet, but conman looks like it might be a good bet.

I spent a long time searching for the hidden VNC support in VirtualBox 4. It’s undocumented (the manual only talks about RDP) and people on the IRC channel say that it doesn’t exist, but it does, at least when starting the guests in headless mode. I added the following lines to /etc/rc.conf:

vboxheadless_VM01_flags="-n -m 5901"
vboxheadless_VM02_flags="-n -m 5902"
vboxheadless_VM03_flags="-n -m 5903"
vboxheadless_VM04_flags="-n -m 5904"
vboxheadless_VM05_flags="-n -m 5905"

And then, after starting the guests in headless mode, I could connect to these ports and access the virtual displays, much more conveniently and much faster than by shutting down the guests using VBoxManage and starting them again using the VirtualBox GUI.

We used multicast to image the six virtual machine hosts from the master. This took about three hours, so we left it running overnight.

In the morning we checked that the hosts had been imaged successfully by booting them with their newly installed images, and gave them unique hostnames (host1.sse.ws.afnog.org etc.) and IP addresses.

We used the manage script to reset the MAC addresses of the network cards of each virtual machine on each host:

for i in 128 129 130 131 132 133 134; do ssh 196.200.217.$i ./manage acpipowerbutton; done
for i in 128 129 130 131 132 133 134; do ssh 196.200.217.$i ./manage modifyvm --macaddress1 auto; done
for i in 128 129 130 131 132 133 134; do ssh 196.200.217.$i ./manage startvm; done
Michuki Mwangi setting up a projector

Astral projection

Since they were all configured for DHCP, we could have got their IP addresses from the DHCP server, but we wanted to give them a nice naming scheme, so we logged in to each one (using the console and the VirtualBox GUI) and assigned it a unique hostname and a static IP address.

We checked that we could log into each virtual machine remotely using the SSH keys that we’d installed, and then we shut down the hosts and moved them to the NOC.

Boot camp starts tomorrow, next door, but we still have to arrange our room.

Michuki Mwangi surrounded by rows of desks covered with computers

Classroom

We may have up to 37 people, our biggest class ever, in a room that’s about eight metres on a side, so layout of the room is a real challenge.

I wanted to do something to facilitate working in groups, such as each table having four places (two each side) and with its long axis front-to-back. This was vetoed because participants would have to turn their heads to see the projected screen, and it might be hard for them to take notes as a result.

So we’re going to have long, cramped benches instead. I think this is unfortunate, and I hope I can persuade people to try something more imaginative in future.

Apptivate for Development

Friday, May 13th, 2011

As I’m at the Open Data for Development Conference, I thought I should write a quick note about some of our recent openness achievements:

  • SARPAM is a current Re-Action project to share drug price data across Southern Africa, helping countries negotiate a better deal from drug suppliers.
  • IHP+Results is another Re-Action project, just completed, to promote accountability among IHP signatories by sharing information on their health service improvements and progress towards meeting their Millennium Development Goals.

You can find the source code for both these applications on Aptivate’s GitHub. Our intention is to agree with our partners to publish full source code wherever possible.

Offline Websites and Low Bandwidth Simulator in Go

Wednesday, February 16th, 2011

Jon Thompson writes about Jeff Allen’s interesting new work on tools for working with low bandwidth:

Jeff continues to try and solve the low bandwidth/high latency problems that aid workers face in the field every day and that we encountered in Indonesia. We all know the joy of VSAT networks that slow to a crawl because either some folks on the team are downloading stuff they shouldn’t be downloading or all the computers are infected with bandwidth sucking viruses. It appears Jeff has moved one step closer to sorting out some of the problems surrounding bandwidth optimization by utilizing the Go programming language.

Rather than try and explain to you what Jeff has done I’ll let you read ‘A rate-limiting HTTP proxy in Go‘ and ‘How to control your HTTP transactions in Go‘ and sort out what he is talking about. Hopefully, this post will bait Jeff into leaving a lengthy comment that explains exactly what the hell he is up to.

My understanding is that Jeff is developing two useful tools:

People have been trying to make offlineable websites for a long time. Some of the best examples so far are using entirely client-side (in-browser) technology, such as the Logistics Operational Guide, developed by the World Food Programme for the Logistics Cluster, which can run entirely offline using Google Gears.

Gears had a lot of potential for developers to create offlineable websites, but Google has abandoned its future development in favour of the open standard HTML5, which is not ready yet. So there’s no obvious and future-proof way to develop offlineable websites at the moment. Jeff’s proxy, combined with a spidering system, could be one way to download an entire site, even if it wasn’t designed to be downloaded by the developers.

Another important potential comes from content management systems (CMS) such as WordPress, Drupal and Joomla. More and more websites are developed using such systems, rather than coded from scratch. The systems know all of the pages on the site, and the links between them, and could easily build an offlineable version of the site for download into Gears, HTML5 or Jeff’s proxy. And one plugin could potentially enable thousands of sites to be offlineable, especially if it was included in the CMS distribution and enabled by default.

A few wikis such as MediaWiki, MoinMoin, DocuWiki and JSPWiki have a programming interface (XML-RPC or WebDAV) that allows a smart client to download pages in their original text format, which could make them more efficient to store offline and also potentially editable offline. Jeff’s proxy could be extended to support sites built in such wikis automatically. There are still some limitations to this approach:

  • The pages would not look the same as the online versions, since the styling wouldn’t be downloaded and the effects of CMS plugins would not be visible;
  • It would probably still be quite slow to download an entire site this way, by spidering, without server-side support for downloading multiple pages at once;
  • Few websites are built out of Wikis, so the potential maximum reach is limited compared to better support for WordPress, Drupal or Joomla.

Anyway, I wish I knew Go, and had time to hack on Jeff’s proxy tools.