View low bandwidth version

Archive for the ‘Linux’ Category

Content indexing in Django using Apache Tika

Wednesday, February 1st, 2012

For the Documents module of our new open-source Generic Intranet, we need to be able to extract the text content and metadata from various kinds of documents:

  • PDF files
  • Microsoft Office DOC, XLS and PPT files
  • and the new XML equivalents, DOCX, XLSX and PPTX.

I found various tools online to help extract this text, largely thanks to Stack Overflow here and here. This ended up with a hodgepodge of tools:

There were a number of problems with this hodgepodge:

  • I was unable to find any Python or command-line solution for old Excel (XLS) files;
  • These solutions did not extract metadata, only document text;
  • The choice of which tool to use depends on the MIME type returned by the file(1) command, which varies depending on the OS (Debian/Ubuntu or CentOS) and which version of the library is installed

Another Stack Overflow post recommended Apache Tika for metadata extraction. It appears to support all the document formats that we need, and to have auto-detection of the document format, which solves all the MIME type problems as well. However, it introduces a new problem: it’s written in Java, which is hard to access from Python.

Luckily I found some instructions for building a Python wrapper around Tika, using some tools that I’d never heard of, and this seemed like a good approach. Unfortunately the installation process is very non-standard, which would not fit in with our fabric-based automated deployment process, and would make it harder for users to install the Intranet themselves.

The instructions are somewhat outdated at the time of writing, as they refer to Tika version 0.7, while 1.0 has been released. I was unable to register for an account to update that page, so I wrote to the author with the details that I discovered, and will also document here that the following command works for me:

python ../jcc/jcc/__main__.py \
        --include /usr/share/java/org.eclipse.osgi.jar \
        --jar tika-parsers-1.0.jar \
        --jar tika-core-1.0.jar \
        java.io.File java.io.FileInputStream \
        java.io.StringBufferInputStream \
        --package org.xml.sax \
        --include tika-app-1.0.jar \
        --python tika --version 1.0 --reserved asm

I was able to go further than this, and package Tika in a way that makes it easy to install with Pip, and thus integrate with our deployment process.

The wrapper is written using JCC, which works by generating and compiling C++ code that links to the Java classes, and then a Python wrapper around that C++. This means that it needs to be recompiled for each platform, so I couldn’t just distribute a binary blob with the Intranet (I had the same problem with DocToText above).

The version of setuptools on our servers doesn’t support JCC’s shared library mode. JCC dies with an error if it’s not explicitly disabled or the patch applies. I couldn’t do either of these as part of our standard deployment process. So I patched JCC to disabled shared mode, since we don’t need it anyway. I also added some patches to allow various setup.py commands used by pip to be forwarded through JCC to the setup function call.

This seems to be enough to allow you to install JCC like this:

pip install git+git://github.com/aptivate/jcc.git

I also wrote a setup.py file that handles pip’s command line invokations and passes the necessary options to JCC, and JCC’s invocation of the setup function. This seems to be enough to install the package using pip:

pip install git+git://github.com/aptivate/python-tika.git

and you can use the last parameter as a package specification in pip_packages.txt, or whatever you pass to pip -r.

You can find the pip-installable Tika package, complete with Tika 1.0 JAR files, in our python-tika repository on Github. This will save you the work of downloading and compiling Tika and all of its dependencies. I have started a discussion with the JCC developers about merging these changes into the upstream project.

Ubuntu Laptops in Schools

Thursday, September 1st, 2011

I’m currently working on a project that’s putting computers into Zambian schools to try to revolutionise education, making it more fun and interactive for kids, and reducing the problems of teacher absence.

They’re using Intel Classmate style PCs, currently running Windows 7 Home Starter. I’m investigating whether Ubuntu would provide a better experience. It might be faster, more reliable, more manageable and easier to lock down than Windows.

Ubuntu 10.10 (Maverick) doesn’t boot on these computers, probably due to problems with the HPET. I don’t like Unity so I don’t want to try 11.04 just yet, which left me falling back to 10.04 (Lucid) with long-term support.

Automatic Logins

The computers should be in a kiosk-like mode for student use, where no login is required but they are locked down. They should also be used by teachers (with a password and fewer restrictions) and administrators (with another password and no restrictions). So I created three user accounts. Student is set to log in by default with no password.

While this works, there are other places where a password is requested and none works, because the Student account doesn’t have a valid password:

  • unlocking from screensaver
  • switching users
  • sudo from the command line

The last one is less important because students should not be able to access the command line anyway, or have any administrative rights. But they need to unlock the screensaver and be able to switch users.

We solved the screensaver problem by telling the screensaver not to lock the screen for this user, just as we did for Camfed in the Zambia SRC with LTSP:

# Disable locking the screen for users with no password to unlock it
sudo -u student gconftool-2 \
	--type boolean \
	--set /apps/gnome-screensaver/lock_enabled false

However the user switching was more tricky. Luckily I found a very helpful question and answer on SuperUser. I improved on it slightly by reusing Ubuntu’s builtin nopasswdlogin group, so that users who can log in with no password can also be switched to with no password.

To achieve this, just add the following line at the beginning of /etc/pam.d/gnome-screensaver:

auth sufficient pam_succeed_if.so user ingroup nopasswdlogin

Firefox Kiosk Mode

We want the browser to be fullscreen all the time, so we need to use some extensions:

  • Full Fullscreen to make it start in fullscreen mode;
  • Keyconfig to stop them exiting full screen mode with F11, or closing the browser with Alt-F4.

We also change some preferences using about:config:

xpinstall.enabled: false
to prevent installing more extensions;
app.update.auto: false
to stop Firefox checking for updates by itself;
browser.sessionstore.resume_from_crash: false
to prevent the Restore previous session prompt when starting Firefox;
extensions.update.enabled: false
to stop Firefox checking for updates to its installed extensions;
extensions.update.notifyUser: false
to avoid a prompt if an extension update is discovered;
browser.tabs.warnOnClose: false
to avoid the prompt to save your tabs on browser exit;

Window Manager

We want the students to have access to a restricted set of applications. The user interface also needs to be unbreakable (child-proof). Windows should always be maximised, as the laptops have quite small screens. All of this points to using a custom window manager/desktop instead of the standard Gnome or KDE.

Fluxbox and Openbox were recommended, but they seem to be aimed at highly-customised desktop environments (for geeks) rather than locked-down kiosks or embedded systems. Matchbox looks like quite a good fit. It has a very simple front menu and an everything-maximised window manager, which sounds great for ease of use.

We’re using GDM for the user login, which offers users a choice of which session (window manager) to run. This is OK, and even quite good for administrators, as it provides a failsafe option in case the usual window manager is borked. But I can’t see how to disable or override this for particular users. Students have no-password logins, so they don’t even get the opportunity to choose a window manager.

The DefaultSession in /etc/gdm/custom.conf (chosen using gdmsetup) changes their window manager, but affects all users, and we don’t want to force everyone to use the restrictive kiosk window manager.

I found that GDM lets you specify your own Xsession script, which gdm uses to actually start the session selected by the user. So I wrote a replacement:

#!/bin/sh

if [ "$USER" = "student" ]; then
	/etc/gdm/Xsession /usr/bin/matchbox-session
else
	/etc/gdm/Xsession "$@"
fi

All it does is call the original Xsession, overriding the name of the session manager if the current user is the special student user, and otherwise behaves exactly as normal.

Save it in /usr/local/bin/GdmKioskSession, make it executable, and add the following line to /etc/gdm/custom.conf:

BaseXSession=/usr/local/bin/GdmKioskSession

If you don’t even want the application menu, but want to force a particular application such as a web browser (true kiosk mode), replace /usr/bin/matchbox-session with /usr/local/bin/kiosk-session, create that file with the following contents and make it executable:

#!/bin/sh
matchbox-window-manager -use_titlebar no &
exec /usr/bin/chromium-browser -kiosk -app=http://staging.ischool.zm/

More lockdown tips to follow.

Move over microsoft: Design’s going open source

Thursday, May 19th, 2011

I’ve been designing websites since 1999 but switching from Microsoft Windows to Ubuntu has been one of those pivotal experiences worth sharing. Joining Aptivate as the in-house designer recently has given me the opportunity to challenge some pretty old work-flows and move towards a totally open source design practice.

Aptivate work almost exclusively with open source software so it seems a great idea to give Microsoft the push, and frankly I’d had enough of waiting 9 minutes for my laptop to reboot. That’s enough time for making 6 people a cup of tea, water the plants and rearrange the desk; all good things ONCE a day not 5 times when the PC crashes. 3 years of filling up with junk makes a Windows PC very very sluggish and an unhappy designer with a tidy workspace.

Changing over – a gradual process

Leaving Microsoft was never going to a straight switch. Leaving a web platform if you are dependent on it for your income is a scary thing.

Since i cut my design teeth on Apple Macs and Adobe software in 1995, I have moved gradually to Windows in a bid to be one step closer to the end user who on the whole use this platform. I was still, however heavily dependent on Adobe products, primarily Fireworks and Dreamweaver for interface design and development, but also Illustrator for vector graphics and Photoshop for photo editing. I know they are good, but really? couldn’t I achieve a good web result without them?

Open source alternatives within Windows.

Going back a bit to before the anger towards the laptop really kicked in. Finding open source alternatives was an exciting challenge. These were readily available for Windows so I didn’t have to switch, just try them out. I found a really useful website www.osalt.com which helps you to find open source alternatives to commercial software.

Web development – Aptana 2

Essentially I wanted an html/css editor with code completion, ftp client and project manager.. I tried Amaya, Bluefish KompoZer and Mozilla SeaMonkey which all had great features, but none of them did ALL the things I wanted together and I really was a bit spoilt by having them all rolled into one with Dreamweaver. Finally I found Aptana 2 which whilst a bit fiddly to get started with seemed to have everything I needed, hurray!

  • Best thing about Aptana – has to be the available plugin support for different types of project such as php, python/django, javascript, svn and git. It’s very comprehensive.
  • What I miss the most from Dreamweaver – nothing… had I still been dependent on the WYSIWYG editor in Dreamweaver then I could argue that this would be a bit deal, but since most of my work is now with dynamic database driven sites, I tend to use Firebug to make visual tweaks before committing them to code.

Web design – Inkscape

A key part of my work involves developing brand assets, icons and other vector graphics for both web and print design. For this I had depended on Illustrator. Quite quickly however I discovered Inkscape . What an amazing product! it’s a bit clunky but it lets me do 90% of the key design tasks I did in Illustrator and 60% of tasks I used to do in Fireworks. Essentially it is the easiest transition I have made on this journey and means that I now only use Inkscape for designing for the web.

  • Best thing about Inkscape – The ability to export 32 bit png files from any selection, the page or object. The other best bit is all the native SVG features I have yet to discover!
  • What I miss the most from Illustrator – nothing worth mentioning.
  • What I miss the most from Fireworks – Image optimisation – no image editing software that I’m aware of does a better job of compressing and optimising all nature of image files than Fireworks. It creates an 8 bit alpha png with a tiny size and smooth, edging where transparent bits kick in. This for me is the single most important missing feature of any open source alternative. Inkscape needs a good image optimising tool.

Photo editing – Gimp

Scaling, cropping, optimising photos; that’s a big part of creating photographic content for a website. Mainly I leave that to the end user but sometimes I create photo-based graphics too.
Gimp seems the most logical choice, as it suggests so itself, but I’m not finding it intuitive and it seems to crash more often than not. Fireworks has a limited range of bitmap editing tools compared to Photoshop but integrates really well within the context of creating an interface design as bitmap and vector graphic objects live happily on the same layer or multiple layers. Now there are several open source alternatives for this work such as but I’ll admit that I’m not finding a great solution and again it’s the optimisation issue that makes me frustrated.

Designing for Low Bandwidth

If we need to create websites with small file sizes for countries with low bandwidth then we need powerful optimisers. I found that Fireworks was great because of the high compression rates achievable which result in files more than half the size achievable using other programmes, open source or commercial. So here is a proposal for a future project for Aptivate – create an 8 bit Alpha transparency image optimiser that challenges Fireworks.

Switching to Ubuntu

A couple of months later I reformatted and partitioned the hard drive of my Dell Precision 9300 workstation. I still wanted access to Windows, so did a dual boot with an extra partition for shared files. Apart from my video card packing in unexpectedly (nothing to do with the installation I’m assured) , the installation went without a hitch. I was amazed by how intuitive the Ubuntu interface was. Using the software install centre, I was able to quickly and directly install all the applications I needed. Start up and shut down was a mere seconds and as I was in production mode within a couple of hours. I was more than happy to find Dropbox, Skype and Acrobat Reader were also available. All in all it was pain free, and… Ubuntu is actually graphically beautiful (not something I had anticipated at all).

The dreaded Terminal

Well it’s inevitable even for a designer. The WYSIWYG addicts worst fear, THE TERMINAL!!! agghh!! . A Baptism of code fire at Aptivate, no namby pamby intro here. I had pulled down the designer’s defence gate and in trickled (and sometimes poured) an endless stream of code, configuration files, settings, database fixtures, tables, smart tricks and speedy scripts, screen sharing and networking wizardry. Ah! then I had a cup of tea. I’ll be a terminal ninja one day.

Do I miss Windows?

I don’t miss windows. I booted into it shortly after installing Ubuntu, and found it an empty experience, like going back to a house I you’ve just moved out of but still have keys to. Can’t find the kettle to make a cup of tea so not staying. I recently installed virtual box and now run several different versions of Windows XP to help me cross-browser debug CSS.

Moving forward

Apart from using my new open source toolbox to help Aptivate refresh it’s current website, there are things I want to do. It would be great to contribute to the open source community and generally help to improve already great products such as Inkscape. I’ve already started submitting Inkscape icons to the Open Clip Art Library and it would be great if we developed the image optimiser I mentioned before. There is also a lot of potential in interactive svg’s which would be great to explore especially since Internet Explorer 9 supports them.. oh Microsoft you are never forgotten…

If you are designer reading this, and fancy getting involved with Aptivate’s open source efforts somehow, get in touch. If you are a developer and have a secret optimiser up your sleeve, please let us feed you and keep you amused because we want it!!!

Digital Photography on Linux

Thursday, September 2nd, 2010

Many people think that digital photography on Linux is much harder than on Windows or Macintosh. This is a typical comment:

As a hobby I’m a photographer, so I do lots of graphic work as editing for invites, posters, mag covers etc. does any one know of a software that will be as great as photoshop for linux (Ubuntu), not Gimp, Gimp is very limited. I tried using WINE but it doesnt work well, very slow.

It was true for a long time that working with digital photos on Linux was difficult. The main problem was that the most popular software that users are familiar with, Photoshop and Lightroom, is simply not available for Linux (due to unfortunate business decisions by Adobe).

There have been open source alternatives to Photoshop for a while, notably Gimp, but Gimp is difficult to use and the interface is quite annoying:

I’d say the major limiting factor is the desire to slit your wrists while arguing with the unwieldy interface layout.

Luckily today I came across (thanks to Alastair Otter of GLUG) this excellent article:

Photography with Open Source / Linux

This includes detailed descriptions of the many applications available, most of which I hadn’t heard of before, and look forward to trying. I’ve been particularly impressed by Adobe Lightroom’s ease of use and powerful histogram correction features, and I was considering buying a copy even though it doesn’t run on Linux, but I will definitely try Darktable now.

For users who really, really have to have a Photoshop user interface, GimpShop may help.

And before you say “this is not work related”, I’ve often needed to edit photographs to prepare brochures, promotional materials and for website work :-)

It seems that digital photography on Linux has made excellent progress in the last year, and I’m very happy to see it.

System Imaging for Free using G4L

Thursday, July 22nd, 2010

This is a copy of the notes that I wrote at AfNOG 2010 as a guide to using system imaging at future workshops. Unfortunately that wiki is not accessible without signing up for an account, so I’m posting the information here too.

How to Install Computer Labs

If you ever need to set up a large number of computers in identical configurations, you have a few options:

  • Install each one individually by hand
  • Automate the standard install process, for example using:
  • Configure one machine exactly how you like it, and then exactly duplicate the hard disk to the others (disk imaging)

The first option (manual installation) is extremely slow, tedious, error-prone, unlikely to result in identical machines, and does not speed up future installations or reinstallations.

The second option requires using rarely-used and less tested parts of the installer, slows down badly with multiple simultaneous installations (due to limited network bandwidth and bugs in the inetd TFTP server), and places limits on what you can customise. For example, it seems impossible to customise /etc/rc.conf using the installer on FreeBSD, and pre-installing SSH keys is tricky. I spent days writing a sysinstall script to automate the process. It would have taken just half an hour to set one machine up perfectly by hand, and then copy the system image onto all the other PCs in a few unattended hours.

Therefore I prefer the third option, system imaging.

What is System Imaging

Imaging is the process of making exact copies of one machine’s hard disk, including all partitions, onto another. This only works when the second hard disk is at least as large as the first. It works best when all the PCs are identical.

Imaging is independent of the operating system. You can image Windows, FreeBSD, any version of Linux, dual-boot and triple-boot installations, whatever you like.

We successfully used imaging to set up the PCs for these workshops:

How to Image

Many systems administrators have heard of Norton Ghost and Acronis True Image, two of the most popular commercial applications.

However, open source alternatives such as G4L (Linux-based) and its ancestor G4U (FreeBSD-based) are pretty good, and completely free. G4L however lacks a website, and it’s not obvious how best to use it, hence this post.

G4L is quite similar to G4U, and I could have used G4U instead. But I find the Linux kernel’s hardware support a bit better than FreeBSD’s, and G4L supports multicasting, which enables it to install many machines at the same time with good performance.

Using Ghost for Linux (G4L)

I’ve successfully used Ghost 4 Linux (G4L) versions 0.27 and 0.33 for this process. 0.33 has multicast support, which allows setting up an entire room in one go, without wasting network bandwidth copying the same 4 GB disk image to each of 50 machines independently.

Set up an FTP server on your network with an account that supports downloads and uploads (e.g. on a local server on your network). Make sure it has plenty of disk space free, perhaps 40 GB. Create an “img” directory under the FTP user’s home directory for the images.

Download G4L and burn some CDs, maybe about five copies, or set up network booting (this conflicts with FreeBSD PXE installation and may require BIOS setup changes to enable PXE).

It’s a good idea to explore G4L and get used to the options, but please be very careful, as it has the potential to wipe your hard disk! So please use a machine with a fresh hard disk or which you don’t mind wiping.

To boot into G4L (you will need to do this several times below, but not yet, unless you just want to explore):

  • Reboot or power up the machine
  • Press the key to choose boot device
  • If CD-ROM is not on the list, reboot, go into the BIOS and enable booting from CD-ROM
  • Choose to boot from the CD
  • Choose the default kernel at the GRUB screen (just press Enter)
    • If for some reason the default kernel doesn’t work, the machine hangs or crashes or doesn’t detect the network interface, then try one or two other kernels
  • Wait for the kernel and initrd to be loaded (two long lines of dots)
  • Then you can remove the CD, about one minute from cold boot, and start booting another PC
  • Press space to skip each of the information/advertising screens (about 8 of them)
  • Enter g4l at the prompt (if you go past this and get a shell, just type g4l at the shell prompt)
  • You can access other consoles with Ctrl-Alt-F1 to F4, log in as g4l with no password, and run g4l, ifconfig, ping or whatever
  • Choose Network Use (default)
  • Choose Raw Mode (default)
  • Check that you have an IP address (option B) or try again to acquire one by DHCP
  • If you can’t get an IP address by DHCP, check your cabling and DHCP server

Create a Restore Image (optional)

Back up one of your PCs if necessary (if you plan to restore the PCs later) by:

  • Follow the procedure above to get into Ghost for Linux
  • Enter the FTP server’s IP address, username and password
  • Choose an image name, e.g. backup_original_2010_07_22.img
  • Choose the back up option
  • Press Space to select the entire disk (mark it with an asterisk [*])
  • Start backing up the image

This process can take 1-2 hours. In the mean time…

Set up the Master PC

If you don’t already have a master computer set up, it’s a good idea to WIPE THE DISK first. This makes the image much smaller, and transfer much faster. Please DO NOT do this if you have anything valuable on the master computer, for example an existing operating system installation that you want to keep.

Boot G4L on the PC that you will use as the master. Use DD to wipe the entire disk with zeroes:

dd if=/dev/zero of=/dev/sda bs=1M

Install FreeBSD or whatever operating system(s) on the master PC, and set it up exactly the way you want all of the PCs to be. Examples include:

  • Install Gnome (gnome/gnome2)
  • Install Xorg (x11/xorg)
  • Install Firefox (www/firefox35)
  • Install Xpdf (print/xpdf)
  • Enable gnome and sshd in /etc/rc.conf, and add templates for the IP address configuration (this saves typing when setting all the machines to static IPs):
    hostname="pc01.sse.ws.afnog.org"
    ifconfig_bge0="dhcp"
    # ifconfig_bge0="196.200.219.101/24"
    defaultrouter="196.200.219.254"
    gnome_enable="YES"
    sshd_enable="YES"
    
  • Create a user account (e.g. username afnog, password afnog)
  • Log into Gnome, add firefox, terminal and the Downloads folder to your toolbar, and remove epiphany and evolution
  • Edit /etc/fstab and add the proc filesystem:
    proc /proc procfs rw 0 0
    

    (this allows GDM to display the user list and shut down and restart the machine)

  • Edit /etc/profile and set the default pager to less by adding:
    PAGER=less; export PAGER
    
  • Set the timezone by softlinking /etc/localtime to something like /usr/share/zoneinfo/Africa/Kigali
  • Create /etc/rc.local and have it run /usr/sbin/ntpd -qg to set the time once at boot

I recommend using DHCP on this machine. Otherwise all the imaged machines will boot up with the same IP address, causing IP address conflicts, and you will have to reconfigure them before you can access the Internet at all, or reconfigure them automatically.

Create some SSH keys for use in administering the machines. You may wish to set up the local server already and generate the keys there for security. I recommend adding the keys to /root/.ssh/authorized_keys. Please test that they work, and that sshd comes up automatically after boot!

Imaging the other PCs

On all the PCs (master and clones):

  • Boot G4L as above
  • Check that it has an IP address (option B)

Once a master is online, all the PCs will show “press any key to start”. Pressing any key on any computer will start all the machines imaging. If any PCs are not ready yet, you will have to cancel the imaging process on all of them and start again, or image those PCs later. So:

Start the master last! (when all the other PCs are ready)

Start the clones first, by following these steps on each one:

  • Choose UDP Multicast Client (option U)
  • Select the entire disk, /dev/sda with the space key
  • Say yes, you’re sure
  • When it says “Compressed UDP receiver”, it’s ready and waiting for a master to appear on the network

Then start the master:

  • Get ALL the clones ready, as above, before doing this!
  • On the master, choose UDP Multicast Server (option W)
  • Select the entire disk, /dev/sda, with the space key
  • Leave the options blank
  • Say yes, you’re sure
  • The master start accepting connections from clients, which will happen automatically. The screens on the clients will also change.
  • Please check that every client says “Press any key to start”.
  • If not, please check it for network problems, etc.
  • DO NOT stop or kill the server now, unless you want to visit every client again!
  • You can press Ctrl+C on the client and run g4l again to check the IP address, retry DHCP, and try the UDP Multicast Client option again.
  • This is your last chance to join any remaining clients to the group for this imaging session!
  • When all the clients are ready, press a key on the master to start transfer.

The master will show progress of the transfer, and an error line if any clients fail to respond. Clients that cause too many errors will be kicked out of the group and appear to “finish” early.

It’s difficult to tell if the imaging process finished successfully or failed on the clients. However it appears that FreeBSD is very good at detecting filesystem corruption, and will fail to boot if the image was not completely transferred. So you can test them by trying to boot FreeBSD and seeing if it boots completely or stops with a filesystem error. Ideally this would be improved in future versions of G4L.

Ubuntu in Zambia

Friday, May 7th, 2010

Ubuntu in Zambia was the title of my talk at OggCamp10: I described  our recent work using Ubuntu based low-power computers for training in rural Zambia.Telling this story to the geeks at OggCamp reminded me of the role the Open Source desktop operating system played in this successful project.
(more…)

SSH Port Forwarding

Wednesday, March 10th, 2010

David Sumbler wrote to the LinuxChix mailing list:

She now has two computers connected via an ADSL router. Both computers run Ubuntu (8.06 and 9.10). I have set things up so that I can log into the router, and also SSH to both computers simultaneously: I use two different port numbers…

I now want to be able to see her desktops, but I haven’t figured out how to do this. Having read the Gnome help, I believe that the Gnome remote desktop is inherently insecure: I would prefer to tunnel things over SSH, probably using vncserver and vncviewer (or perhaps Vinagre).

Can anybody explain what I need to do to get this to work, please?

I get asked this kind of question so often that I thought I’d write it up somewhere so I could just point people to the post.

SSH port forwarding is not hard to do, once you get your head around how it actually works. Thanks to Alan for drawing this simple diagram:

SSH port forwarding is not like a VPN and it’s not magic. It’s quite like a proxy server:

  • You tell SSH, with the -L option, to listen for connections on a port on your local side.
  • SSH connects to the remote host immediately as usual, and then starts listening on this port.
  • When it receives a connection on this port, it tells the other side (the SSH server that you connected to) to connect to the remote hostname and port that you specified.
  • If the remote side succeeds, the two SSH processes join the two sides together, forwarding bytes from each side to the other.

(Note: it’s also possible to ask the remote SSH server to listen on a port on its side, with the -R option, and connect to a host and port on the client side, but in the interests of simplicity I will ignore that for today.)

I’ll show you the commands that I suggested to David, and then explain what they do:

ssh username@ip-address-of-ssh-server -p port1 -L 5901:localhost:5900
ssh username@ip-address-of-ssh-server -p port2 -L 5902:localhost:5900
vncviewer localhost:1 (connects to computer 1)
vncviewer localhost:2 (connects to computer 2)

This opens two SSH connections, one to each of the machines behind his firewall, which are completely independent of each other. One SSH connection would actually be enough, as we will see in a minute, but this way fit more logically with my explanation.

These commands contain some placeholders that must be adapted to your situation:

username
The user name that you want to connect as. You can omit the name and the @ sign if it’s the same as your logged-in user on the client.
ip-address-of-ssh-server
The IP address or hostname of the SSH server that you want to connect to. In David’s case, he can’t see the SSH server directly, so he needs to use the public IP address of the router here, and the router will forward the port to the SSH server on his internal network.
port1 and port2
David said that he can “SSH to both computers simultaneously [using] two different port numbers.” Presumably using port forwarding on his router. These are the two port numbers.
vncviewer localhost:1
This runs the VNC viewer on the client and tells it to connect to VNC display 1, which runs on port 5901 (by definition, VNC ports are display number plus 5900), which we already forwarded to computer 1 using SSH.

After running the two ssh commands command, the first SSH client will be listening on port 5901 on the machine that you run it on, and the second will be listening on port 5902.

After this, until you disconnect the SSH sessions or kill the clients in some way, whenever you connect to port 5901 on the client, it will tell the computer it’s connected to (computer 1) to connect to localhost port 5900 (that is, to its own VNC server) and then join the connections together, forwarding any data sent in either direction over the tunnel.

This part of the SSH command:

-L 5902:localhost:5900

tells the SSH client to Listen on port 5902 on the client, and when it receives a connection, to ask the other side (the server) to connect to (what it sees as) localhost port 5900, and SSH will forward communications between the two over the SSH tunnel.

Note first of all that we tell vncviewer to connect to localhost, not to the IP of the remote computer (internal or external). That’s because the client side of the SSH port forwarding is listening on localhost port 5901, and not any other IP address or port. If you connect to anything other than localhost port 5901, you will not end up talking to the local SSH client connected to computer 1.

Note secondly that when we created the tunnels, we told the ssh client to connect them to port 5900, also on localhost. This time, localhost is relative to the remote machine (the server), so we are telling it to connect to itself (not back to you). We could also specify any IP address and port that is reachable to the server, which is acting as our proxy in this case. However, we cannot specify an IP or port that is reachable to the client but not to the server, because the server will not be able to connect to it.

Now let’s imagine that we want to be able to VNC to both computers over a single SSH tunnel. We can do this by forwarding two different local ports, one to localhost, and one to the IP address of the other computer, like this:

ssh username@ip-address-of-ssh-server -p port1 -L 5901:localhost:5900 -L 5902:192.168.10.5:5900
vncviewer localhost:1 (connects to computer 1)
vncviewer localhost:2 (connects to computer 2)

This assumes that computer 2 has the internal (RFC1918) IP address 192.168.10.5, and allows connections from computer 1 to its port 5900.

Port forwarding is unlike a VPN in several ways. The client does not end up with routing to the ultimate destination, nor does it need it. This means that it works even if the client and server have different views of the IP space, for example if they are located in subnets that use the same IP range to refer to different machines.

The server does not try to connect to the ultimate destination until the client receives an incoming connection (e.g. from vncviewer in this case). At this point, it may discover that there is nothing listening on the port to which it was told to connect, or that the destination host is down, or the port is blocked by a firewall. The server informs the client of this, but the client has no way to pass this information onto the connection that it received, which is has already accepted. All it can do is close the connection.

This means, for example, that if you were to sit at the server and type vncviewer 192.168.10.5, and that computer was not running VNC, you might get a Connection refused error. However, if you sit at the client and type vncviewer localhost, you will see the connection is opened and immediately closed, as though the VNC process was listening but refused to talk to you for some reason. Do not be fooled into assuming that VNC is running on the destination. With SSH port forwarding, you have no idea.

You cannot forward ICMP (pings), UDP sockets (DNS) or any other protocol except TCP using port forwarding, so you will never be able to ping remote hosts using this method alone.

It is currently impossible to add new forwarded ports to an existing connection or to change the ultimate destination host and port, so you must disconnect and reconnect with a new command line instead. This is inconvenient in some cases, especially where you have a long-running process open in the shell. I recommend using ssh -N to open an ssh client that does only port forwarding and not a shell; then open a separate shell if you need one.

The ssh client cannot exit while any connection is open, so if you log out with connections open, it will appear to hang. All open connections will be closed if the ssh client is forcibly killed by a signal or escape character.

If your port forwarding doesn’t appear to be working, check that you don’t have another process listening on the same port. For example, in the VNC case, both Gnome and KDE desktop sharing create a VNC server on the standard port, 5900, so you cannot forward the local port 5900 to anywhere if you have remote desktop access enabled on the client. The easiest solution is to listen on different port numbers, like 5901 and 5902, which correspond to VNC displays 1 and 2 in the command examples above.

Finally, please note that the meaning of commands like these is very different depending on where it is run (on the client or on the server):

vncviewer localhost
vncviewer 192.168.10.5

This is because:

  • The meaning of localhost is different depending on where you run it (on the client or on the server); it always means connecting to the same computer that the command is running on.
  • The meaning of 192.168.10.5 (or any other IP address) similarly depends on where you run it (on the client or on the server); it is always relative to the computers that are reachable from the one running the command.
  • Connections always appear to the recipient to be coming from the computer running the command, so when the client or the server connects to 192.168.10.5, even if that’s the same computer for both, it will see the connections coming from different IP addresses.

Tariq adds that you can also run:

ssh -D 9999 username@ip-address-of-ssh-server

where the -D option tells SSH to creates a SOCKS proxy server tunnel. You can then tell your web browser (and other clients with SOCKS support) to use localhost:9999 as a SOCKS proxy server. This will forward all your browsing through the SSH tunnel, which makes it look like you’re in a different location (e.g. to watch iplayer when not in the UK) and protects your unencrypted web browsing from random sniffers on public networks.

pmGraph – Bandwidth Monitoring for Networks

Saturday, February 20th, 2010
pmGraph video screencap

Video introducing pmGraph hosted by Vimeo

pmGraph is a free tool we produce to help administrators monitor bandwidth on networks.

Read more about it or watch the video above.

Many thanks to Mark for putting the video together.

Backup Mail Exchangers

Wednesday, January 28th, 2009

On Monday night, the power supply unit (PSU) in the server that hosts our mail server failed at around 2200 GMT. We don’t have physical access to the server out of hours, so I wasn’t able to replace it until about 1045 the next day, so our main email server was down for nearly 13 hours.

We didn’t have a backup MX because:

  • It usually can’t check whether recipients are valid or not, and therefore must accept mail that it can’t deliver;
  • It usually doesn’t have as good antispam checks as the primary, because it’s a hassle to keep it updated;
  • Spammers usually abuse backup MXes to send more spam, including Joe Jobs.

I thought that this was OK because people who send us mail also have mail servers with queues, which should hold the mail until our server comes back up. It’s normal for mail servers to go down sometimes and this should not cause mail to be lost or returned.

However, we had a report that one of our users did not receive a mail addressed to them, and was told by the sender that it had bounced. I saw the bounce messsage and suspected Exchange, so I decided to check how long Exchange holds messages before bouncing them. Turns out it’s only five hours by default. Most mail servers hold mail for far longer, for example five days, sending a warning message back to the sender after one day.

Bouncing messages looks bad on us. Apart from making our main mail server more reliable :) we need a backup MX to accept mail when the master is down.

However I do still want to minimise the spam problem that this will cause. Therefore I configured our backup MX to only accept mail when the master is down. Otherwise it defers it, which will tell the sender to try sending it to the master (again).

How did I achieve this magic? With a little Exim configuration that took me a day and that I’m quite proud of. I set up a new virtual machine which just has Exim on it, nothing else. I configured it as an Internet host, and to relay for our most important domains. Then I created /etc/exim4/exim4.conf.localmacros with the following contents:

CHECK_RCPT_LOCAL_ACL_FILE=/etc/exim4/exim4.acl.conf
callout_positive_expire = 5m

This allows us to create a file called /etc/exim4/exim4.acl.conf which contains additional ACL (access control list) conditions. The other change, callout_positive_expire, I’ll describe in a minute.

I created /etc/exim4/exim4.acl.conf with the following contents:

# if we know that the primary MX rejects this address, we should too
deny
        ! verify = recipient/callout=30s,defer_ok
        message = Rejected by primary MX

# detect whether the callout is failing, without causing it to
# defer the message. only a warn verb can do this.
warn
        set acl_m_callout_deferred = true
        verify = recipient/callout=30s
        set acl_m_callout_deferred = false

# if the callout did not fail, and the primary mail server is not
# refusing  mail for this address, then it's accepting it, so tell
# our client to try again later
defer
        ! condition = $acl_m_callout_deferred
        message = The primary MX is working, please use it

# callout is failing, main server must be failing,
# accept everything
accept
        message = Accepting mail on behalf of primary MX

The first clause, which has a deny verb, does a callout to the recipient. A callout is an Exim feature which makes a test SMTP connection and starts the process of sending a mail, checking that the recipient would be accepted. This is designed to catch and block emails that the main server would reject. Our backup server has no idea what addresses are valid in our domains; only the primary knows that.

The callout response is cached for the default two hours if it returns a negative result (the recipient does not exist on the master) or five minutes (see callout_positive_expire above) if the address does exist. We use a defer_ok condition here so that if we fail to contact the master, we don’t defer the mail immediately, but instead assume that the address is OK and therefore continue to the next clause.

The second clause of the ACL, which has a warn verb, is what took me so long to work out. Normally, if a condition in a statement returns a result of defer, which means that it failed, the server will defer the whole message (tell the sender to come back later). In almost all cases this is the right thing to do, but it’s the exact opposite of what we want here. We want to accept mail if the callout is failing, not defer it, otherwise our backup MX is useless (it stops accepting mail if the primary goes down).

Because this is such an unusual thing to do, there is no configurable option for it in Exim. The only workaround that I found is that there is exactly one way to avoid a deferring condition causing the message to be deferred: a warn verb. The documentation for the warn verb says:

If any condition on a warn statement cannot be completed (that is, there is some sort of defer), the log line specified by log_message is not written… After a defer, no further conditions or modifiers in the warn statement are processed. The incident is logged, and the ACL continues to be processed, from the next statement onwards.

So what we do is:

  1. Set the local variable
    acl_m_callout_deferred to true;
  2. Try the callout. If it defers (cannot contact the primary server) then we stop processing the rest of the conditions in the warn statement, as described above;
  3. If we get to this point, we know that the callout did not defer, so we set acl_m_callout_deferred to false.

The third clause  of the ACL, which has a defer verb, simply checks the variable that we set above. If we get this far then the primary server is not rejecting this address; and if it’s not deferring either, then it must be accepting mail for the address. In that case, we defer the message, telling our SMTP client to try again later, at which point it will hopefully succeed in delivering directly to the primary.

Callout result caching becomes a problem here. If the master was not reachable, but a previous callout had verified that a particular address existed, and that callout result was cached for the default 24 hours, then the backup MX would defer subsequent mail to that address for the next 24 hours, even if the master went down. This is why we changed the positive callout result caching time to 5 minutes earlier.

The fourth clause  of the ACL, which has an accept verb, is even simpler. It accepts everything that was not denied or deferred earlier. We can only get this far if the master is not accepting or rejecting mail for that address.

So far the configuration appears to work fine and has blocked 14 spam attempts (abusing the backup MX) in 14 hours.

Offline Wikipedia

Friday, November 21st, 2008

I’m working on making Wikipedia, the (in)famous free encyclopaedia, available offline, for a project in a school in rural Zambia where Internet access will be slow, expensive and unreliable.

What I’m looking for is:

  • Completely offline operation
  • Runs on Linux
  • Reasonable selection of content from English Wikipedia, preferably with some images
  • Looks and feels like the Wikipedia website (e.g. accessed through a browser)
  • Keyword search like the Wikipedia website

Tools that have built-in search engines usually require that you download a pages and articles dump file from Wikipedia (about 3 GB download) and then generate a search index, which can take from half an hour to five days.

For an open source project that seems ideally suited to being used offline, and considering the amount of interest, there are surprisingly few options (already developed). They also took me a long time to find, so I’m collating the information here in the hope that it will help others. Here are my impressions of the solutions that I’ve tried so far, gathered from various sources including makeuseof.com.

The One True Wikipedia

The One True Wikipedia, for comparison

MediaWiki (the Wikipedia wiki software) can be downloaded and installed on a computer configured as an AMP server (Apache, MySQL, PHP). You can then import a Wikipedia database dump and use the wiki offline. This is quite a complex process, and importing takes a long time, about 4 hours for the articles themselves (on a 3 GHz P4). Apparently it takes days to build the search index (I’m testing this at the moment). This method does not include any images, as the image dump is apparently 75 GB, and no longer appears to be available, and it displays some odd template codes in the text (shown in red below) which may confuse users.

Mediawiki local installation

Mediawiki local installation

Wikipedia Selection for Schools is a static website, created by Wikimedia and SOS Childrens Villages, with a hand-chosen and checked selection of articles from the main Wikipedia, and images, that fit on a DVD or 3GB of disk space. It’s available for free download using BitTorrent, which is rather slow. Although it looks like Wikipedia, it’s a static website, so while it’s easy to install, it has no search feature. It also has only 5,500 articles compared to the 2 million in Wikipedia itself (about 0.25%). Another review is on the Speed of Creativity Blog. Older versions are available here. (thanks BBC)

Wikipedia Selection for Schools

Wikipedia Selection for Schools

Zipedia is a Firefox plugin which loads and indexes a Wikipedia dump file. It requires a different dump file, containing the latest metadata (8 GB) instead of the usual one (3 GB). You can then access Wikipedia offline in your browser by going to a URL such as wikipedia://wiki. It does not support images, and the search feature only searches article titles, not their contents. You can pass the indexed data between users as a Zip file to save time and bandwidth, and you may be able to share this file between multiple users on a computer or a network. (thanks Ghacks.net)

WikiTaxi is a free Windows application which also loads and indexes Wikipedia dump files. It has its own user interface, which displays Wikipedia formatting properly (e.g. tables). It looks very nice, but it’s a shame that it doesn’t run on Linux.

WikiTaxi screenshot (wikitaxi.org)

WikiTaxi screenshot (wikitaxi.org)

Moulin Wiki is a project to develop open source offline distributions of Wikipedia content, based on the Kiwix browser. They claim that their 150 MB Arabic version contains an impressive 70,000 articles, and that their 1.5 GB French version contains the entire French Wikipedia, more than 700,000 articles. Unfortunately they have not yet released an English version.

Kiwix itself can be used to read a downloaded dump file, thereby giving access to the whole English Wikipedia via the 3 GB download. It runs on Linux only (as far as I know) and the user interface is a customised version of the Firefox browser. Unfortunately I could not get it to build on Ubuntu Hardy due to an incompatible change in Xulrunner. (Kiwix developers told me that a new version would be released before the end of November 2008, but I wasn’t able to test it yet).

Kiwix (and probably MoulinWiki)

Kiwix (and probably MoulinWiki)

Wikipedia Dump Reader is a KDE application which browses Wikipedia dump files. It generates an index on the first run, which took 5 hours on a 3 GHz P4, and you can’t use it until it’s finished. It doesn’t require extracting or uncompressing the dump file, so it’s efficient on disk space, and you can copy or share the index between computers. The display is in plain text, so it looks nothing like Wikipedia, and it includes some odd system codes in the output which could confuse users.

Wikipedia Dump Reader

Wikipedia Dump Reader

Thanassis Tsiodras has created a set of scripts to extract Wikipedia article titles from the compressed dump, index them, parse and display them with a search engine. It’s a clever hack but the user interface is quite rough, it doesn’t always work, requires about two times the dump file size in additional data, it was a pain to figure out how to use it and get it working, and it looks nothing like Wikipedia, but better than the Dump Reader above.

Thanassis Tsiodras' Fast Wiki with Search

Thanassis Tsiodras' Fast Wiki with Search

Pocket Wikipedia is designed for PDAs, but apparently runs on Linux and Windows as well. The interface looks a bit rough, and I haven’t tested the keyword search yet. It doesn’t say exactly how many articles it contains, but my guess is that it’s about 3% of Wikipedia. Unfortunately it’s closed source, and as it comes from Romania, I don’t trust it enough to run it. (thanks makeuseof.com)

Pocket Wikipedia on Linux

Pocket Wikipedia on Linux (makeuseof.com)

Wikislice allows users to download part of Wikipedia and view it using the free Webaroo client. Unfortunately this client appears only to work on Windows. (thanks makeuseof.com)

WikiSlice (makeuseof.com)

WikiSlice (makeuseof.com)

Encyclopodia puts the open source project on an iPod, but I want to use it on Linux.

Encyclopodia

Encyclopodia

It appears that if you need search and Linux compatibility, then running a real Wikipedia (MediaWiki) server is probably the best option, despite the time taken.