Offline Wikipedia
I'm working on making Wikipedia, the (in)famous free encyclopaedia, available offline, for a project in a school in rural Zambia where Internet access will be slow, expensive and unreliable.
What I'm looking for is:
- Completely offline operation
- Runs on Linux
- Reasonable selection of content from English Wikipedia, preferably with some images
- Looks and feels like the Wikipedia website (e.g. accessed through a browser)
- Keyword search like the Wikipedia website
For an open source project that seems ideally suited to being used offline, and considering the amount of interest, there are surprisingly few options (already developed). They also took me a long time to find, so I'm collating the information here in the hope that it will help others. Here are my impressions of the solutions that I've tried so far, gathered from various sources including makeuseof.com.
MediaWiki (the Wikipedia wiki software) can be downloaded and installed on a computer configured as an AMP server (Apache, MySQL, PHP). You can then import a Wikipedia database dump and use the wiki offline. This is quite a complex process, and importing takes a long time, about 4 hours for the articles themselves (on a 3 GHz P4). Apparently it takes days to build the search index (I'm testing this at the moment). This method does not include any images, as the image dump is apparently 75 GB, and no longer appears to be available, and it displays some odd template codes in the text (shown in red below) which may confuse users.
Wikipedia Selection for Schools is a static website, created by Wikimedia and SOS Childrens Villages, with a hand-chosen and checked selection of articles from the main Wikipedia, and images, that fit on a DVD or 3GB of disk space. It's available for free download using BitTorrent, which is rather slow. Although it looks like Wikipedia, it's a static website, so while it's easy to install, it has no search feature. It also has only 5,500 articles compared to the 2 million in Wikipedia itself (about 0.25%). Another review is on the Speed of Creativity Blog. Older versions are available here. (thanks BBC)
Zipedia is a Firefox plugin which loads and indexes a Wikipedia dump file. It requires a different dump file, containing the latest metadata (8 GB) instead of the usual one (3 GB). You can then access Wikipedia offline in your browser by going to a URL such as wikipedia://wiki. It does not support images, and the search feature only searches article titles, not their contents. You can pass the indexed data between users as a Zip file to save time and bandwidth, and you may be able to share this file between multiple users on a computer or a network. (thanks Ghacks.net)
WikiTaxi is a free Windows application which also loads and indexes Wikipedia dump files. It has its own user interface, which displays Wikipedia formatting properly (e.g. tables). It looks very nice, but it's a shame that it doesn't run on Linux.
Moulin Wiki is a project to develop open source offline distributions of Wikipedia content, based on the Kiwix browser. They claim that their 150 MB Arabic version contains an impressive 70,000 articles, and that their 1.5 GB French version contains the entire French Wikipedia, more than 700,000 articles. Unfortunately they have not yet released an English version.
Kiwix itself can be used to read a downloaded dump file, thereby giving access to the whole English Wikipedia via the 3 GB download. It runs on Linux only (as far as I know) and the user interface is a customised version of the Firefox browser. Unfortunately I could not get it to build on Ubuntu Hardy due to an incompatible change in Xulrunner. (Kiwix developers told me that a new version would be released before the end of November 2008, but I wasn't able to test it yet).
Wikipedia Dump Reader is a KDE application which browses Wikipedia dump files. It generates an index on the first run, which took 5 hours on a 3 GHz P4, and you can't use it until it's finished. It doesn't require extracting or uncompressing the dump file, so it's efficient on disk space, and you can copy or share the index between computers. The display is in plain text, so it looks nothing like Wikipedia, and it includes some odd system codes in the output which could confuse users.
Thanassis Tsiodras has created a set of scripts to extract Wikipedia article titles from the compressed dump, index them, parse and display them with a search engine. It's a clever hack but the user interface is quite rough, it doesn't always work, requires about two times the dump file size in additional data, it was a pain to figure out how to use it and get it working, and it looks nothing like Wikipedia, but better than the Dump Reader above.
Pocket Wikipedia is designed for PDAs, but apparently runs on Linux and Windows as well. The interface looks a bit rough, and I haven't tested the keyword search yet. It doesn't say exactly how many articles it contains, but my guess is that it's about 3% of Wikipedia. Unfortunately it's closed source, and as it comes from Romania, I don't trust it enough to run it. (thanks makeuseof.com)
Wikislice allows users to download part of Wikipedia and view it using the free Webaroo client. Unfortunately this client appears only to work on Windows. (thanks makeuseof.com)
Encyclopodia puts the open source project on an iPod, but I want to use it on Linux.
It appears that if you need search and Linux compatibility, then running a real Wikipedia (MediaWiki) server is probably the best option, despite the time taken.
"as it comes from Romania, I don’t trust it enough to run it" Given the stated goal of your blog, such sweeping generalisations seem a little unnecessary.
Hi anon, Thanks for your comment, but I disagree. I think I'm entitled to generalise about my own opinions (i.e. whether or not I trust something). If I'd said "Software from Romania is not trustworthy", that would be a gross generalisation, but I didn't say that. I have a higher skepticism about software from Africa and Eastern Europe than other places, and the way this software is packaged and presented doesn't do enough to assuage my doubts. Call me paranoid or whatever, but I think I'm entitled to choose what software I trust and don't trust.
That's interesting Chris... what are your concerns?
In talking with Wikimedia's CTO a week or two ago, it became painfully obvious that they don't really care much about anything except the live, English Wikipedia site. Small language sites dying off was met with a shrug as if it's expected and talking about providing a proper offline, fully compiled version was avoided. They love the publicity of these things, but don't really seem to put much effort in to maintaining them. I love the Wikipedia, but I am really annoyed and alarmed at Wikimedia's severe lack of transparency and decidedly logjam approach to their technology.
If wikipedia truly wants to be by the people, for the people then it needs to find ways of engaging with the large numbers of people who still have marginal access to the Internet and little bandwidth.
You know, this could be a very interesting project to lay hands on. Are you still needing this platform? Can you provide some more info? I have a small team of competent java developers and I'm sure we can come out with something worthwhile since this is for a good cause. Oh, and just because I say team don't immediately think on a payroll =) Say what is needed and your ideas, but some of the projects you described are really close of what you want, wouldn't it be easier to modify those? Anyway, say something. Best regards!
Hi Jpoa, Thanks for the offer, but we did what we needed to do by using Mediawiki and Wikipedia for Schools. If you're looking for an interesting project to volunteer for, you might like to have a look at our pmGraph or Loband projects, or EpiSurveyor. Cheers, Chris.
Hi there! I was able to run WikiTaxi both on Linux and MacOSX (with Wine and Crossover respectively), don't know if it is still of use for you. I'll check the projects you mentioned. Best regards and thanks!
Personally, I use Thanassis Tsiodras's method. You are mistaken in claiming that it needs twice the space - it does not. After you split the wikipedia .bz2 file, you can remove the original (and easily reconstruct it if you ever want to, by simply cat-ing the pieces together). It is also one of the only two methods (the mediawiki being the other one) that correctly shows mathematical formulae, something which was very important to me.
[...] But what that then means is that organizations that otherwise carefully craft their brand and presence online are effectively deferring to other, likely unknown, people to create and take care of the authoritative Wikipedia page about them. If that’s true then it’s really quite surprising – since Wikipedia is the first place most people look when they want to look something up. Wikipedia content is also released under a Creative Commons license, which means the information on its pages is spread widely including e.g. automatically generated Facebook community pages and offline for use in developing country contexts. [...]
Looks like someone has developed an HTML5 offline reader, which you can find here: http://offline-wiki.googlecode.com/git/app.html