View low bandwidth version

Archive for the ‘Django’ Category

Content indexing in Django using Apache Tika

Wednesday, February 1st, 2012

For the Documents module of our new open-source Generic Intranet, we need to be able to extract the text content and metadata from various kinds of documents:

  • PDF files
  • Microsoft Office DOC, XLS and PPT files
  • and the new XML equivalents, DOCX, XLSX and PPTX.

I found various tools online to help extract this text, largely thanks to Stack Overflow here and here. This ended up with a hodgepodge of tools:

There were a number of problems with this hodgepodge:

  • I was unable to find any Python or command-line solution for old Excel (XLS) files;
  • These solutions did not extract metadata, only document text;
  • The choice of which tool to use depends on the MIME type returned by the file(1) command, which varies depending on the OS (Debian/Ubuntu or CentOS) and which version of the library is installed

Another Stack Overflow post recommended Apache Tika for metadata extraction. It appears to support all the document formats that we need, and to have auto-detection of the document format, which solves all the MIME type problems as well. However, it introduces a new problem: it’s written in Java, which is hard to access from Python.

Luckily I found some instructions for building a Python wrapper around Tika, using some tools that I’d never heard of, and this seemed like a good approach. Unfortunately the installation process is very non-standard, which would not fit in with our fabric-based automated deployment process, and would make it harder for users to install the Intranet themselves.

The instructions are somewhat outdated at the time of writing, as they refer to Tika version 0.7, while 1.0 has been released. I was unable to register for an account to update that page, so I wrote to the author with the details that I discovered, and will also document here that the following command works for me:

python ../jcc/jcc/__main__.py \
        --include /usr/share/java/org.eclipse.osgi.jar \
        --jar tika-parsers-1.0.jar \
        --jar tika-core-1.0.jar \
        java.io.File java.io.FileInputStream \
        java.io.StringBufferInputStream \
        --package org.xml.sax \
        --include tika-app-1.0.jar \
        --python tika --version 1.0 --reserved asm

I was able to go further than this, and package Tika in a way that makes it easy to install with Pip, and thus integrate with our deployment process.

The wrapper is written using JCC, which works by generating and compiling C++ code that links to the Java classes, and then a Python wrapper around that C++. This means that it needs to be recompiled for each platform, so I couldn’t just distribute a binary blob with the Intranet (I had the same problem with DocToText above).

The version of setuptools on our servers doesn’t support JCC’s shared library mode. JCC dies with an error if it’s not explicitly disabled or the patch applies. I couldn’t do either of these as part of our standard deployment process. So I patched JCC to disabled shared mode, since we don’t need it anyway. I also added some patches to allow various setup.py commands used by pip to be forwarded through JCC to the setup function call.

This seems to be enough to allow you to install JCC like this:

pip install git+git://github.com/aptivate/jcc.git

I also wrote a setup.py file that handles pip’s command line invokations and passes the necessary options to JCC, and JCC’s invocation of the setup function. This seems to be enough to install the package using pip:

pip install git+git://github.com/aptivate/python-tika.git

and you can use the last parameter as a package specification in pip_packages.txt, or whatever you pass to pip -r.

You can find the pip-installable Tika package, complete with Tika 1.0 JAR files, in our python-tika repository on Github. This will save you the work of downloading and compiling Tika and all of its dependencies. I have started a discussion with the JCC developers about merging these changes into the upstream project.

Embedding jinja2 templates in Django templates

Tuesday, November 15th, 2011

We recently integrated the Askbot forum into the Django-based websites we developed for the RIMI4AC Project. Askbot uses the Jinja2 templating language but this was incompatible with the standard Django templates we had used up to this point. Here’s how we solved the problem.

When we were asked to recommend a forum to be integrated into the suite of websites we were developing for the RIMI4AC Project, Askbot was the clear favourite, due to its large feature set, ease of customisation, active development team and wide user base. The only drawback was that the templating engine used by Askbot was Jinja2, which would make it difficult for Askbot to be embedded into the websites. Up until then these websites had been developed using standard Django templates.

We came up with the following options to solve this problem:

Create an Askbot skin using Jinja2, which would mimic our existing templates

This would be easy to implement but would incur high maintenance costs, as any changes to the standard Django templates would need to be made also to the Jinja2 templates. This could possibly be scripted to make this easier.

Embed Askbot in an iframe

Again this would be simple to implement but iframes introduce a number of problems themselves with navigation and rendering.

Rewrite Askbot to use Django templates

This would be a lot of work and as we would effectively be forking Askbot we would incur the costs of maintaining our own version.

Rewrite the rest of our websites to use Jinja2 templates

This would be quite a bit of work and any new components we wanted to integrate into our websites would also need to use Jinja2.

Choose a different forum that used Django templates

We really didn’t want to do this as we had had good reasons for choosing Askbot.

Some way of rendering Django templates from within Jinja2

Although Jinja2 supports extensions and we could possibly have written one to render Django templates, this seemed to be the opposite of what we wanted – to embed Askbot as a component of our website and not the other way around.

Some way of rendering Jinja2 templates from within Django templates

This looked like the most promising solution. Fortunately the Askbot developers were good enough to name their views, which meant that we could provide our own urls.py with views of the same name and then any reverse() lookups within Askbot would just work. Any view that rendered into a Jinja2 template by calling the function render_into_skin() would be replaced with this wrapper function:

def render_jinja2_into_django_template(request, jinja2_view, *args, **kwargs):
    response = jinja2_view(request, *args, **kwargs)

    return render_to_response("forum_container.html",
        {'forum_content':response.content},
        context_instance=RequestContext(request))

The wrapper function would call the original Jinja2 view function and we could pull the raw content out of the returned response object. This would go into the forum_content variable that would be passed to our own Django template and simply written out from within there.

Because we wanted Askbot to appear as a component within a page rather than its own standalone application, we would also need to remove the headers and footers from the Askbot Jinja2 templates. Askbot’s skin customisation made this straightforward. The Django container template would need to include any stylesheets or scripts that we had removed from the headers.

With some fixes to the CSS we had successfully embedded Askbot in our website. There would be some maintenance costs in that any future changes to the Askbot views would need to be reflected in our own Askbot views, but this would be far preferable to having to maintain our own set of templates.

Checking missing translations automatically

Tuesday, July 26th, 2011

For our open source openconsent project, which uses the Django framework, we have recently added internationalisation support. Here’s how we’re testing it.

Before any translations are in place, it’s difficult to ensure that all text is appropriately tagged for translation, either with {% trans %} tags in templates or using gettext() and its friends in the code. Checking missing translations by eye is time-consuming and prone to error.

Inspired by the article  Mocking gettext with Django Translations to test that your code is translating by Rory McCann we wrote an automated test to do this:

# coding: utf-8

from publicweb.tests.open_consent_test_case import OpenConsentTestCase
from django.core.urlresolvers import reverse
from django.utils import translation
from lxml.html.soupparser import fromstring
from lxml.cssselect import CSSSelector

class InternationalisationTest(OpenConsentTestCase):

    def setUp(self):
        self.login()

    def test_all_text_translated_when_viewing_decision_list(self):
        self.check_all_text_translated('decision_list')

    def test_all_text_translated_when_adding_decision(self):
        self.check_all_text_translated('decision_add')

    def check_all_text_translated(self, view):
        self.mock_get_text_functions_for_french()

        translation.activate("fr")

        response = self.client.get(reverse(view), follow=True)
        html = response.content

        root = fromstring(html)
        sel = CSSSelector('*')

        for element in sel(root):
            if self.has_translatable_text(element):
                self.assertTrue(self.contains(element.text, "XXX "),
                                "No translation for element " + \
                                str(element) + " with text '" + \
                                element.text + \
                                "' from view '" + view + "'")

    def has_translatable_text(self,element):
        if element.text is None or element.text.strip() == "" \
            or "not_translated" in element.attrib.get('class', '').split(" ") \
            or element.tag == 'script' \
            or element.text.isdigit():
            return False
        else:
            return True

    def contains(self, string_to_search, sub_string):
        return string_to_search.find(sub_string) > -1

    def mock_get_text_functions_for_french(self):
        # A decorator function that just adds 'XXX ' to the front of all
        # strings
        def wrap_with_xxx(func):
            def new_func(*args, **kwargs):
                output = func(*args, **kwargs)
                return "XXX "+output
            return new_func

        old_lang = translation.get_language()
        # Activate french, so that if the fr files haven't
        # been loaded, they will be loaded now.
        translation.activate("fr")

        french_translation = translation.trans_real._active.value

        # wrap the ugettext and ungettext functions so that 'XXX '
        # will prefix each translation
        french_translation.ugettext = \
            wrap_with_xxx(french_translation.ugettext)
        french_translation.ungettext = \
            wrap_with_xxx(french_translation.ungettext)

        # Turn back on our old translations
        translation.activate(old_lang)
        del old_lang

We mock the French ugettext() and ungettext() to prefix any translated strings with XXX. Our automated tests now just need to ensure that any text on the page begins with XXX.

There are two tests in this class, one for each page that we want to check. These both call the method check_all_text_translated(). This sends a GET request for the given view. We use lxml to parse the response. The CSS selector ‘*’ will return us all elements.

Because our database is empty when running these tests, we can be sure that pretty much all of the text nodes should be translated. There are a number of exceptions that we filter out in the method has_translatable_text()

  • White space
  • JavaScript
  • Numbers
  • Anything in a tag with class “not_translated”

The last category is a bit of a hack as it isn’t really used other than in our tests. We couldn’t think of a way around this. There are only a couple of places where we need to do this, for example when displaying the user name of the logged in user.

If none of these exceptions applies and the text does not begin with XXX, we ensure our test fails with plenty of information to track down the missing translation.