Blog

The Digital Agency for International Development

Unit Testing Django with Haystack and Solr

By Chris Wilson on 14 August 2013

The Haystack project adds a powerful free-text search capability to your Django applications, so that users or the public can search your site for pages matching particular keywords and criteria. It provides a standard interface to several different search backends, allowing you to switch easily, and is used by many existing projects such as Django-CMS Search and the Aptivate Intranet.

But we do test-driven development, and testing a search index (which is basically a separate database) is hard because:

  • Test isolation is essential, because you want your tests to run on a carefully controlled database so that your tests have predictable and repeatable results;
  • Django provides a built-in mechanism for creating and resetting a test SQL database, safely isolated from your real database, so you don't have to think about how it works;
  • Django provides no such mechanism for search engines, because it's not a built-in feature of Django;
  • Haystack doesn't provide such a feature either;
  • Haystack 2 at least supports multiple search engines in the same application, but Haystack 1 doesn't;
  • Several search engines are supported by Haystack, but the API only allows putting searchable documents in and out, not creating databases;
  • Many search engines only support one database by default (including Solr), which makes isolation impossible;
  • It's not at all obvious how to configure Solr quickly to support multiple databases.

I had to fix a bug on one of our projects that used Django 1.3, which is not compatible with Haystack 2. The Django-CMS dependencies pulled in Haystack 2 by default, but I couldn't run manage.py:

Traceback (most recent call last):
  File "./manage.py", line 129, in <module>
    execute_manager(settings)
  ...
  File "/home/installuser/projects/website/django/website/.ve/local/lib/python2.7/site-packages/south/management/commands/__init__.py", line 10, in <module>
    import django.template.loaders.app_directories
  File "/home/installuser/projects/website/django/website/.ve/local/lib/python2.7/site-packages/django/template/loaders/app_directories.py", line 23, in <module>
    raise ImproperlyConfigured('ImportError %s: %s' % (app, e.args[0]))
django.core.exceptions.ImproperlyConfigured: ImportError haystack: cannot import name six

It took me a while to realise that it wasn't trying to import the Python library Six itself, but the copy that has been bundled with Django since version 1.4.2, and is therefore not present in Django 1.3.

I didn't want to upgrade Django to fix the bug, so I had to write tests for the bug in Haystack 1, which doesn't support multiple databases. The project also uses the Solr search engine. I tried to switch it to using Whoosh for testing, as it's quite easy to configure Whoosh to use a different database in tests, but that failed with weird errors that I didn't have time to debug. So I ended up on a long journey of discovery into how to configure Solr (which I didn't know at all) to support multiple databases.

I wrote the test first, and it looked like this:

from django_dynamic_fixture import G

class ImpactEvaluationTest(TestCase):
    def test_new_unpublished_impact_evaluation_is_not_indexed(self):
        from models import ImpactEvaluation
        ie = G(ImpactEvaluation, published=False, published_date=None,
            ignore_fields=['image', 'round'], debug_mode=True)

        from haystack.query import SearchQuerySet
        from haystack.utils import get_identifier
        all_results = SearchQuerySet().filter(id=get_identifier(ie))
        self.assertSequenceEqual([], list(all_results),
            "Newly added, unpublished ImpactEvaluations should not be indexed")

The important part is that newly created ImpactEvaluation objects, which are an indexed model with a Haystack SearchIndex class, should only be added to the search engine (only searchable) if they are published, and not otherwise. This test was originally failing, because unpublished objects were ending up in the search index and being returned in search results, which caused a crash when rendering the search results page.

In order to test this properly, I wanted to have the test use an empty Solr database. But I didn't want to delete everything in my development database to do it. I didn't actually have anything in my development database, since I'd never installed Solr before, but I didn't want to rely on this and end up destroying another developer's real Solr database without warning. I'm stubborn, and I wanted to do this properly.

There are several ways to create a separate, isolated database using Solr:

  • Enabling Solr Cores, which are a feature of newer Solr versions that behave like separate databases served by the same Jetty/Solr process;
  • Add a separate web app to the existing container: another copy of Solr in the same process;
  • Add a completely separate installation of Jetty and Solr on a different port.

Since I wanted to minimise the amount of setup that needed to be done to run the tests, I chose the first route, to enable multiple Cores, and proceeded with my first ever Solr installation. And then I tried to add cores, but it just didn't work. I was trying to access the URL:

  • http://localhost:8080/solr/admin/cores?action=CREATE&name=coreX&instanceDir=/tmp

which is almost exactly as given in the documentation, but I was getting a 404 error and I didn't know why. There was very little useful information on Google, but eventually I found this page which said in passing:

[solr.xml <cores> element:]

adminPath: This is the relative URL path to access the SolrCore administration pages. For example, a value of /admin/cores means that you can access the CoreAdminHandler with a URL that looks like this: http://localhost:8983/solr/admin/cores. If this attribute is not present, then SolrCore administration will not be possible.

And of course the standard Ubuntu packages that I'd used to install Solr didn't even have a solr.xml file at all, let alone configured as suggested on that page to enable multiple cores. So I created /usr/share/solr/solr.xml with the following contents:

<solr persistent="false">
  <cores adminPath="/admin/cores" host="${host:}" hostPort="${jetty.port:}">
  </cores>
</solr>

And then, when I restarted Solr (Jetty) I was able to access the URL above to create a new core.

Now I wanted to create a new core in my tests. And I didn't want to use httplib if possible, because manipulating URLs and encoding parameters with it is a pain. I said above that "the Haystack API only allows putting searchable documents in and out, not creating databases." Haystack internally uses a project called Pysolr, which includes a class called SolrCoreAdmin that's supposed to help with creating new cores. Unfortunately it had quite a few bugs that stopped it from working properly. Until and unless they are fixed by the author, you can try our fork instead.

To create a new Core in Solr, you need to specify a directory containing all the configuration files that it needs, such as solrconfig.xml and schema.xml. We don't know where the default configuration will be on the user's system, and we don't necessarily want to use it anyway, because we don't know what the settings are. So I created a copy of the standard configuration on my system, and placed it in my Django project in fixtures/solr_test/conf, containing the following files:

admin-extra.html
elevate.xml
mapping-ISOLatin1Accent.txt
protwords.txt
scripts.conf
solrconfig.xml
spellings.txt
stopwords.txt
synonyms.txt

Note that the usual schema.xml file is not included, as we will generate it automatically during the test. I did something similar for the tests in Pysolr to make them pass too.

Then I created a base class for my tests to inherit from, instead of the default Django TestCase (I could have made it a mixin instead):

class SolrTestCase(TestCase):
    longMessage = True
    solrCoreAdminPath = 'admin/cores'

    def setUp(self):
        self.old_solr_url = settings.HAYSTACK_SOLR_URL

        from haystack.management.commands.build_solr_schema \
            import Command as BuildSolrSchemaCommand
        cmd = BuildSolrSchemaCommand()

        import os
        instance_dir = os.path.join(settings.PROJECT_PATH, 'fixtures', 'solr_test')
        conf_dir = os.path.join(instance_dir, 'conf')
        cmd.handle(filename=os.path.join(conf_dir, "schema.xml"))

        from pysolr import SolrError
        try:
            admin = pysolr.SolrCoreAdmin(url='%s/%s' % 
                (settings.HAYSTACK_SOLR_URL, self.solrCoreAdminPath))
            admin.create(name='test_core', instance_dir=instance_dir)
        except SolrError as e:
            raise
            raise SolrError("Unable to create a new disposable core. Have "
                "you configured Solr to enable multiple cores as described "
                "at http://docs.lucidworks.com/display/solr/Core+Admin+and+Configuring+solr.xml? "
                "%s" % e)

        settings.HAYSTACK_SOLR_URL += '/test_core'
        self.solr = pysolr.Solr(settings.HAYSTACK_SOLR_URL)
        self.solr.delete(q='*:*')

        # poke into haystack backends to change URL of any backend that's
        # already registered.
        from haystack.sites import site
        from haystack.backends.solr_backend import SearchBackend as SolrSearchBackend
        for index in site.get_indexes().values():
            if isinstance(index.backend, SolrSearchBackend) and not getattr(index.backend, '_old_conn', None):
                index.backend._old_conn = index.backend.conn
                index.backend.conn = self.solr

    def tearDown(self):
        settings.HAYSTACK_SOLR_URL = self.old_solr_url
        from haystack.sites import site
        from haystack.backends.solr_backend import SearchBackend as SolrSearchBackend
        for index in site.get_indexes().values():
            if isinstance(index.backend, SolrSearchBackend):
                index.backend.conn = index.backend._old_conn

When setting up the test, this class generates a new schema.xml containing the current configuration generated by Haystack's build_solr_schema management command, and writes it into the fixtures/solr_test/conf directory. Then it tells Solr to create a new core using the contents of that directory. For newer Solr versions, I would need to UNLOAD any existing core first to delete its contents, but I couldn't test that because the version of Solr that I'm using doesn't require it.

Then I delete all objects from the core I've just created, in case it already existed. And finally, since Haystack may initialise itself and create its own Pysolr instance, using the old configuration (with the default search index) while the tests are still being prepared, I go through all the Haystack SearchIndex objects that already exist, replacing their connections (Pysolr instances) with the one that I created to do deletion with.

Finally I had a failing test, and I was able to fix the original bug. I discovered that the ImpactEvaluation SearchIndex defines an index_queryset method which excludes unpublished objects from being indexed, but only when the index is generated from scratch:

def index_queryset(self):
    return ImpactEvaluation.objects.filter(published=True) # Only add already published

However, it's missing the equivalent code to check whether newly created objects should be indexed or not. Therefore objects that are not published will be added to the search index, but regenerating the index from scratch will exclude them.

In general I don't think that objects excluded by the index_queryset should be added in the first place, so I wrote a generalised mixin that can be added to any model to protect against this:

class NoIndexHiddenObjectsMixin(object):
    def should_update(self, instance, **kwargs):
        """
        Unpublished items should NOT be indexed, because they cause errors
        when trying to retrieve their data in search results.
        """
        try:
            self.index_queryset().get(pk=instance.pk)
            return True
        except instance.DoesNotExist:
            return False

I think there are other cases that should be handled too, such as an object being changed from published to unpublished, which should remove it from the index. But this will do for a start. I think this might be a common developer error when using Haystack, and it might be a useful feature to enable by default.

Update: we have submitted a pull request to Haystack which, if accepted, will resolve this for most users automatically.