Content indexing in Django using Apache Tika

By Chris Wilson on 01 February 2012

For the Documents module of our new open-source Generic Intranet, we need to be able to extract the text content and metadata from various kinds of documents:

PDF files
Microsoft Office DOC, XLS and PPT files
and the new XML equivalents, DOCX, XLSX and PPTX.

I found various tools online to help extract this text, largely thanks to Stack Overflow here and here. This ended up with a hodgepodge of tools:

PDF Miner for PDF files
python-docx for DOCX files
DocToText for PPTX, XLSX and PPT files
antiword for DOC files

There were a number of problems with this hodgepodge:

I was unable to find any Python or command-line solution for old Excel (XLS) files;
These solutions did not extract metadata, only document text;
The choice of which tool to use depends on the MIME type returned by the file(1) command, which varies depending on the OS (Debian/Ubuntu or CentOS) and which version of the library is installed

Another Stack Overflow post recommended Apache Tika for metadata extraction. It appears to support all the document formats that we need, and to have auto-detection of the document format, which solves all the MIME type problems as well. However, it introduces a new problem: it's written in Java, which is hard to access from Python.

Luckily I found some instructions for building a Python wrapper around Tika, using some tools that I'd never heard of, and this seemed like a good approach. Unfortunately the installation process is very non-standard, which would not fit in with our fabric-based automated deployment process, and would make it harder for users to install the Intranet themselves.

The instructions are somewhat outdated at the time of writing, as they refer to Tika version 0.7, while 1.0 has been released. I was unable to register for an account to update that page, so I wrote to the author with the details that I discovered, and will also document here that the following command works for me:

python ../jcc/jcc/__main__.py \
        --include /usr/share/java/org.eclipse.osgi.jar \
        --jar tika-parsers-1.0.jar \
        --jar tika-core-1.0.jar \
        java.io.File java.io.FileInputStream \
        java.io.StringBufferInputStream \
        --package org.xml.sax \
        --include tika-app-1.0.jar \
        --python tika --version 1.0 --reserved asm

I was able to go further than this, and package Tika in a way that makes it easy to install with Pip, and thus integrate with our deployment process.

The wrapper is written using JCC, which works by generating and compiling C++ code that links to the Java classes, and then a Python wrapper around that C++. This means that it needs to be recompiled for each platform, so I couldn't just distribute a binary blob with the Intranet (I had the same problem with DocToText above).

The version of setuptools on our servers doesn't support JCC's shared library mode. JCC dies with an error if it's not explicitly disabled or the patch applies. I couldn't do either of these as part of our standard deployment process. So I patched JCC to disabled shared mode, since we don't need it anyway. I also added some patches to allow various setup.py commands used by pip to be forwarded through JCC to the setup function call.

This seems to be enough to allow you to install JCC like this:

pip install git+git://github.com/aptivate/jcc.git

I also wrote a setup.py file that handles pip's command line invokations and passes the necessary options to JCC, and JCC's invocation of the setup function. This seems to be enough to install the package using pip:

pip install git+git://github.com/aptivate/python-tika.git

and you can use the last parameter as a package specification in pip_packages.txt, or whatever you pass to pip -r.

You can find the pip-installable Tika package, complete with Tika 1.0 JAR files, in our python-tika repository on Github. This will save you the work of downloading and compiling Tika and all of its dependencies. I have started a discussion with the JCC developers about merging these changes into the upstream project.

April 17, 2012, 6:32 p.m. - A tale of five bugs: Django Intranets on Window...: [...] second issue was with the Apache Tika integration. I already spent several days working out how to integrate a JVM into the Python server process [...]
July 26, 2012, 4:51 p.m. - Two Drifters » Blog Archive RTF to Text (...: [...] OpenSource Apache Tika is a Java-based content analysis toolkit. It is not a ready-to-use program, though – it’s a toolkit for other software applications. It can be scripted with Python. [...]

Blog

Blog

Tags

Content indexing in Django using Apache Tika