Content indexing in Django using Apache Tika
For the Documents module of our new open-source Generic Intranet, we need to be able to extract the text content and metadata from various kinds of documents:
- PDF files
- Microsoft Office DOC, XLS and PPT files
- and the new XML equivalents, DOCX, XLSX and PPTX.
- PDF Miner for PDF files
- python-docx for DOCX files
- DocToText for PPTX, XLSX and PPT files
- antiword for DOC files
- I was unable to find any Python or command-line solution for old Excel (XLS) files;
- These solutions did not extract metadata, only document text;
- The choice of which tool to use depends on the MIME type returned by the file(1) command, which varies depending on the OS (Debian/Ubuntu or CentOS) and which version of the library is installed
Luckily I found some instructions for building a Python wrapper around Tika, using some tools that I'd never heard of, and this seemed like a good approach. Unfortunately the installation process is very non-standard, which would not fit in with our fabric-based automated deployment process, and would make it harder for users to install the Intranet themselves.
The instructions are somewhat outdated at the time of writing, as they refer to Tika version 0.7, while 1.0 has been released. I was unable to register for an account to update that page, so I wrote to the author with the details that I discovered, and will also document here that the following command works for me:
python ../jcc/jcc/__main__.py \ --include /usr/share/java/org.eclipse.osgi.jar \ --jar tika-parsers-1.0.jar \ --jar tika-core-1.0.jar \ java.io.File java.io.FileInputStream \ java.io.StringBufferInputStream \ --package org.xml.sax \ --include tika-app-1.0.jar \ --python tika --version 1.0 --reserved asmI was able to go further than this, and package Tika in a way that makes it easy to install with Pip, and thus integrate with our deployment process.
The wrapper is written using JCC, which works by generating and compiling C++ code that links to the Java classes, and then a Python wrapper around that C++. This means that it needs to be recompiled for each platform, so I couldn't just distribute a binary blob with the Intranet (I had the same problem with DocToText above).
The version of setuptools on our servers doesn't support JCC's shared library mode. JCC dies with an error if it's not explicitly disabled or the patch applies. I couldn't do either of these as part of our standard deployment process. So I patched JCC to disabled shared mode, since we don't need it anyway. I also added some patches to allow various setup.py
commands used by pip
to be forwarded through JCC to the setup
function call.
This seems to be enough to allow you to install JCC like this:
pip install git+git://github.com/aptivate/jcc.gitI also wrote a setup.py file that handles pip's command line invokations and passes the necessary options to JCC, and JCC's invocation of the setup function. This seems to be enough to install the package using pip:
pip install git+git://github.com/aptivate/python-tika.gitand you can use the last parameter as a package specification in pip_packages.txt, or whatever you pass to pip -r.
You can find the pip-installable Tika package, complete with Tika 1.0 JAR files, in our python-tika repository on Github. This will save you the work of downloading and compiling Tika and all of its dependencies. I have started a discussion with the JCC developers about merging these changes into the upstream project.
[...] second issue was with the Apache Tika integration. I already spent several days working out how to integrate a JVM into the Python server process [...]
[...] OpenSource Apache Tika is a Java-based content analysis toolkit. It is not a ready-to-use program, though – it’s a toolkit for other software applications. It can be scripted with Python. [...]