View low bandwidth version

Archive for the ‘Database’ Category

Rough Guide to rural data collection with ODK

Monday, December 5th, 2011

This post has three purposes, which I think overlap sufficiently to combine them:

  • A User Guide for the system that we developed for UNICEF, IDS and RuralNet Zambia
  • A Developers’ Guide for anyone wishing to build something similar
  • Notes on lessons learned that may assist future implementers

Project goals

Automate the data entry part of a long paper-based survey, by replacing the paper forms with electronic devices.

Hardware and application selection

The survey has several long and complex questions, and long sets of multiple-choice answers. The data collection needs to be done in dusty rural Zambia, and the devices might need to be used for a full day without power. Collected data should be sent wirelessly to a secure data repository at some time after collection.

Text entry is required for many fields. That means either a real keyboard with keys, or a sufficiently large touch screen to type comfortably on. Use of the device camera, and presentation of reports and graphs on the same device, might be required in future.

Two possible hardware platforms were identified:

  • Tablet laptops with touch screens
  • Tablet mobile devices (iPad or Android tablet)

We selected the latter for this project due to lower cost, lighter weight, better usability and longer battery life.

The available software options that we identified were:

  • EpiSurveyor (Java J2ME, partly closed source, we have used before and fixed bugs)
  • OpenXdata (Java J2ME, open source, developed and supported by an Aptivate alumnus among others)
  • Open Data Kit (ODK) (Android, open source, active community)
  • Bespoke online/offline survey in HTML5

Of these, we eliminated EpiSurveyor and OpenXdata due to lack of compatibility with the hardware platform(s) we had chosen.

We chose ODK over a bespoke system due to limited time available for development, and ability to easily take photos and record GPS coordinates using the device’s hardware.

Of the available Android tablet devices, we chose the Samsung Galaxy Tab for the pilot project, due to its high quality construction. For future projects we would probably use a lower cost device; see the lessons learned for details.

Form creation

Since the survey is quite long (about 230 questions) we wanted an easy way to enter the questions. The ODK application requires the form to be in XForms format. We identified the following tools for creating XForms:

We decided to use XLS2XForm, which enabled us to enter the large number of questions easily in Excel. The others all have graphical builders, which have advantages and disadvantages for less technical users:

  • More visually appealing
  • All available options presented visually (types of controls, groups, etc.)
  • Less likely to make a mistake and produce an invalid form
  • Cumbersome user interface slows down data entry

Unfortunately, none of these designers were able to import an existing form in XForms format, which means that the modifiable “source code” of the form must be maintained in a “proprietary” format in each case, and it’s difficult to switch between tools.

You can download the conversion tools, and the Excel spreadsheet with the completed questionnaire as we delivered it to RuralNet, here. RuralNet staff, please use the latest version of the spreadsheet that you can find locally. To use the tools, you will need to download and install Python 2.7 and Java (JRE). Then download the tools as a ZIP file and extract it somewhere. I recommend that you keep the master copy of the spreadsheet in Dropbox to ensure that it’s backed up, and it’s always clear what the latest version is.

For help in building surveys using XLS2XForm, please see the documentation. In addition to the question types listed there, we have used the following shortcuts, which also work in this customised version of XLS2XForm:

  • text is short for add text prompt (a text field, such as a person’s name)
  • note is short for add note prompt (a read-only field, providing additional information for the user)
  • time is a time field without a date (for example, survey start and end times)

To compile the spreadsheet into an XForms form, run the build_and_validate.py script by double-clicking on it. If it works, it will show the message “Success!”, otherwise it will show an error message, usually caused by a mistake in the Excel spreadsheet. If it works, it will create (replace) the file called zambia-ranq-round3.xml in the same directory. If your spreadsheet has a different name, you can create a shortcut to call build_and_validate_custom.py with the name of the spreadsheet on the command line.

Software components

ODK Aggregate is the software that powers the Internet server. It is a repository for blank forms (designs) and completed forms (data). Our server is located at http://partimob.appspot.com/. This server is currently paid for by us, and will need to transfer to RuralNet at some point.

ODK Collect is the application runs on the device, and users interact with it to complete the survey. It’s essentially a user interface for XForms. It can download blank forms (designs) from an ODK Aggregate server, and upload completed forms (data) to the Aggregate server as well.

ODK Briefcase is the software that downloads completed forms (data) from the Aggregate server and convert them into CSV (spreadsheet) format, which can be loaded into

Customised ODK Collect

We are using a custom version of ODK Collect. You can download the source code for it here, or the compiled application here. You can also find it in the ZIP file download. If you prefer, you can use the latest official version of ODK Collect. The two are compatible, but our version adds the following useful features:

  • Use supplied login and password by default to save a round trip and a prompt.
  • Add keyboard navigation, useful for form filling on android-x86 because the mouse interface is pretty clunky.
  • Restore ability to modify completed and submitted forms on the device, which was removed from the official version in 1.1.7.
  • Improved error messages and progress indication during form uploads.
  • Allow setting the instance name on the first page of the survey.
  • Allow saving incomplete surveys on required questions (in case a survey is interrupted; almost all of our questions are required).

There are several ways to install ODK Collect on a device:

  • Download it from the Android Market (official version only, not our customised version)
  • Copy the APK file onto a microSD card, insert the card into the device, and use the My Files application find and open it from the SD card.
  • Attach the USB cable from the device to a computer, enable mass storage mode on the device, and on the computer, drag and drop the APK file onto the device’s internal memory, then use the My Files application to find and open it.
  • Attach the USB cable from the device to a computer, and use ADB‘s install command to install the APK file.

It’s useful to put the application onto the device’s desktop. To do that, open the Applications list, find ODK Collect, and press and hold it with your finger for a few seconds. The background will change to the desktop; release your finger to drop the application there.

It’s also useful to remove all the other junk from the desktop. For each icon and widget on the desktop, press and hold it with your finger for a few seconds, until the trashcan icon appears, then drag your finger to the trashcan and release it there.

Form management on the device

There are several ways to put blank forms (designs) onto the tablets:

  • Download them from the ODK Aggregate server using ODK Collect.
  • Copy them onto a microSD card, insert the card into the device, and use the My Files application to copy them from the SD card to the /sdcard/odk/forms directory.
  • Attach the USB cable from the device to a computer, enable mass storage mode on the device, and on the computer, drag and drop the form into the /sdcard/odk/forms directory.
  • Attach the USB cable from the device to a computer, and use ADB or DDMS to push the file onto the device, into the /sdcard/odk/forms directory.

Of these methods, ADB or DDMS is recommended for rapid development, and using the Aggregate server is recommended for production use, since the form must be installed on the Aggregate server for it to be able to accept submissions.

Similarly there are several ways to copy completed forms (data) off the device:

  • Upload them to the ODK Aggregate server using ODK Collect.
  • Use the My Files application to copy them from /sdcard/odk/instances to a microSD card, then remove the card and connect it to the computer, and drop the files into the ODK Briefcase data directory.
  • Attach the USB cable from the device to a computer, enable mass storage mode on the device, and on the computer, drag and drop the files from the /sdcard/odk/instances directory to the ODK Briefcase data directory.
  • Attach the USB cable from the device to a computer, and use ADB or DDMS to pull the file from the device’s /sdcard/odk/instances directory to the ODK Briefcase data directory.

Of these methods, using ODK Aggregate is recommended for development and production use.

Since the Aggregate server is on the Internet, this method requires that the device have Internet access. So it either needs a valid SIM card installed with credit and a data bundle, or a WiFi network connected. We had many problems with using SIM cards for data, so WiFi is preferred if possible.

The directories mentioned above will not exist until ODK Collect is installed on the device and run for the first time. Forms downloaded from the Aggregate server will also be placed in the /sdcard/odk/forms directory. Forms completed on the device will be placed in the /sdcard/odk/instances directory.

Configuring ODK Collect

Collect needs to know the details of the ODK Aggregate server to log into it, download blank forms and upload completed forms.

Open the ODK Collect application, press the Settings button and click on Change Settings. Click on URL and enter https://partimob.appspot.com. Similarly, complete the Username and Password using the details that you’ve been given by the Aggregate server operator, or the account that you’ve created on the Aggregate server. This account should only have Data Collector permissions, no more. Press the Back key to get back to the main menu of ODK Collect.

Downloading forms using ODK Collect

Open ODK Collect on the device, and click on the Get Blank Form button. Collect will try to log into the Aggregate server using the details that you’ve provided, and get a list of forms on the server that have the Downloadable box ticked. This is on by default for newly uploaded forms.

Tick the box next to all the forms that you want to download, and click on the Get Selected button.

Filling forms on the device

Open ODK Collect on the device, and click on the Fill Blank Form button. All the forms in the device’s /sdcard/odk/forms directory should be listed. Choose the form that you want to complete.

You will see an introductory screen showing how to move between questions by swiping your finger across the screen, from right to left or left to right. This screen has a text box at the bottom, which you can use to name the form that you’re completing. Naming forms is useful if your data collection is interrupted and you need to resume it later. It’s much easier to identify the form using its name, rather than opening it and flicking through to find some identifying information. You might name the form based on the household code that you’re surveying.

Depending on your answers to some questions, others may be hidden, or their text might change.

At the end of the form there is another chance to Name this form, and a tickbox to Mark form as finalized. Before you can upload the form to the Aggregate server, this box must be ticked, and you must press the Save Form and Exit button. Otherwise Collect will consider that the form is incomplete.

Sending completed forms to Aggregate

Open ODK Collect on the device, and click on the Send Finalized Form button on the main menu. Tick the box next to all the forms that you want to upload to Aggregate, and click on Send Selected. After the upload is complete, you should see the Upload Results message. Every form should have “Success” next to it, otherwise it was not sent successfully.

Downloading forms using Briefcase

We are using a customised version of ODK Briefcase with the following changes:

  • Fix the export of repeated groups, which before only worked for the first row (issue 461).
  • Shorten exported column names, to allow the CSV file to be imported into Access.
  • Allow the server name, username and password to be provided on the command line (or via a shortcut).

You can find the source code here and the pre-compiled version here, as an executable JAR file. You can also find it in the ZIP file download. If you make changes to the source and want to build the executable JAR again, install Maven and use the mvn package command.

To download the completed forms, open Briefcase by double-clicking on the briefcase-1.0-jar-with-dependencies.jar file. On the Transfer tab, click on the Connect button. For the URL, enter https://partimob.appspot.com, and for the user name and password, give the details of an ODK Aggregate account with Data Viewer permissions.

Then you should see a list of forms appear under the heading Forms to Transfer. Tick the box next to the one that your users have been completing, and then click on the Transfer button. If you do this after all the completed forms (data) have been submitted to the ODK Aggregate server, you will not need to do it again for that form template (design).

Now switch to the Transform tab and see if the form appears in the Form list. If it doesn’t, then exit and restart the Briefcase application (issue 464).

For Output Type, choose .csv and media files. For Output Directory, choose the directory where you’d like to save the CSV files. Note that any previous files exported to that directory from the same form will be overwritten without warning, even if they have been modified (cleaned). Click on the Output button to write the CSV files.

Cleaning data in Excel

You can find the Excel spreadsheet that we use for data storage and cleaning here. Note that Excel is a long way from the best way to store and manipulate data like this. Microsoft Access would be far more appropriate. Yet again I wish there was a sufficiently powerful open source alternative.

Because the spreadsheet contains cleaned data, which is “better” than the raw data which is included in the CSV export, we don’t want to overwrite existing rows. For the main section of the questionnaire (the so-called Single Responses) you can include only the new data like this:

  • Open the main spreadsheet and switch to the Single Responses tab
  • Highlight all rows from 3 down to the bottom, and Sort them by the SubmissionDate column.
  • Note the last submission date on this spreadsheet.
  • Open the newly exported CSV file for the single responses (something like RANQ-2011-Round-4-v5.csv).
  • Sort this file by the SubmissionDate column as well.
  • Highlight and copy all the rows whose submission date is later (more recent) than the last one in the main spreadsheet.
  • Paste them at the bottom of the Single Responses tab of the main spreadsheet, below the other data.

For the other tables, this process needs to be done completely manually at present.

You can then check and clean the data by viewing and modifying it in Excel. Note that each sheet has one or two columns at the end, which are filled by formulae that look up values from the Single Responses sheet, such as the Household Code.

Using the Android x86 Emulator

To be written.

Lessons learned

To be written.

Hibernate, EJB and the @Unique Constraint

Friday, November 26th, 2010

What are Hibernate and EJB

As a bit of background introduction, Hibernate is a Java library that allows Java objects to be loaded and saved from a database. (It is other things as well, but for simplicity I can ignore those for now). It handles loading, creating, updating and searching for objects by generating SQL queries for us.

Hibernate is an implementation of IBM and Sun’s Enterprise Java Beans (EJB) specification. You can argue about which came first, Hibernate or EJB, but Hibernate is a key member of the EJB board and most new EJB-related standards follow Hibernate’s de facto lead, and key Hibernate developers like Emmanuel Bernard are leaders of the EJB specification teams.

Insanity Rules

Let’s start with the theory. I’m going to argue that EJB is insane. I mean it. I’ve been telling people that for nearly a year, and nobody has been able to prove me wrong.

It’s insane because it’s trying to solve the wrong problem, an impossible problem. It’s trying to keep your in-memory Java objects perfectly in sync with the database contents at all times. If you don’t believe me, check out the manual (under Do not treat exceptions as recoverable).

Of course that’s impossible because other people, and other instances of the application, can be modifying the database under your feet, and you have no way to know until you try to save the object, which is when it fails. But the only way to find out is to commit your transaction, and you might not want to do that because you might not be ready to actually save the object yet, or you might want to recover gracefully if it fails (see below).

Instead, EJB forces you to pretend that everything is just rosy, and let it throw an exception when the inevitable happens, and someone modified the record under your feet, or some other constraint is violated (such as uniqueness). The worst thing about this exception is that you can’t recover from it. That’s because the faulty object is still managed by EJB.

If you try to recover from the exception (for example to display a nice message to the user instead of dumping core all over the shop floor), and you touch the database session in any way, you risk that EJB will try to save the object again… which fails again… which throws another exception.

If you discard the session, you’d better not dare touch any object that you loaded from the database before, because it might be lazily loaded (not yet loaded), and throw another exception when you try to actually do anything with it.

There is no way out of this trap, at least officially:

Do not treat exceptions as recoverable: This is more of a necessary practice than a “best” practice. When an exception occurs, roll back the Transaction and close the Session. If you do not do this, Hibernate cannot guarantee that in-memory state accurately represents the persistent state.

We’ve implemented a workaround in RITA, which tries to identify the offending object when an exception occurs and evicts it from the cache, but it’s pretty scary and will never be officially supported by Hibernate.

Performance

The other problem with this approach is that it forces the EJB implementation to constantly check all of your objects to see if any of them have changed, and if so try to persist them to the database. Your application knows which objects have changed and is best placed to handle any errors in trying to save them to the database, but apparently the designers of EJB know better. Or something.

Validation

One of the nicer things about EJB is that it lets you annotate your data-storage classes with extra information that controls how they are saved to the database by EJB. For example, this allows you to specify the table and column names for your classes and properties, as well as information about indexes.

A sub-standard of EJB is Bean Validation (JSR 303), which allows you to write code that checks whether your object is valid before trying to save it to the database. In some cases, this can save you from falling into the trap above, because it allows you to validate your object when you want, before saving it, rather than following the whims of your EJB implementation.

So, what can you do to validate your object? Well, you can check that some fields are not null, or that their value follows a certain pattern. You can write your own custom validator that’s adapted to your specific objects, by checking for invalid combinations of field values. And… that’s about it.

Notably missing from this list is the ability to check anything in the database. The philosophical reason for this is that EJB is completely database-agnostic, and in fact it’s trying to pretend that there is no database and apart from one magical call to a save() method, your objects live forever in some kind of implementation-independent limbo. Of course there’s no universal way to access that limbo, so if you want to do it then you can kiss your platform-independence goodbye.

Validating Uniqueness

There’s not even a generic way to implement something like the @Unique validation, which is so obvious that people keep asking for it. It would simply ensure that a property is unique for that kind of object, so that for example you don’t have two User objects with the same name or emailAddress. But it doesn’t exist in the Bean Validation specification. The official reason is that:

@Unique cannot be tested at the Java level reliably but could generate a database unique constraint generation. @Unique is not part of the BV spec today.

In other words, “life isn’t perfect so we’re not going to bother trying to make it better.” Perhaps they’re trying to save us from our own foolishness (moral hazard), that we might actually believe that it’s enough to check for this and we’ll never fail when writing to the database, the server will never crash or explode in flames, etc…

Incidentally, committing the transaction to check for uniqueness might force your code to go through unreadable contortions to avoid saving an invalid object or inconsistent state in the database (a preference for academic perfection over clean, maintainable code seems to be common among designers of Java standards). And committing a transaction early is also dangerous because the database will no longer detect and warn you about conflicting changes in a concurrent transaction, so you could end up silently overwriting someone else’s changes.

Hibernate is more pragmatic, but @Unique doesn’t even exist there, at least not officially, although there is a sample implementation on the community wiki. I’m not clear exactly why it’s not official, although that page says that “accessing the Session/EntityManager during a valiation is opening yourself up for potential phantom reads”, whatever that means. It is true that:

  • executing many individual reads would be difficult to manage efficiently, if you had many of these annotations;
  • reading while a flush() is in progress may trigger another flush(), leading to an infinite loop;
  • it requires you to jump through hoops in an extremely ugly way just to get a usable Hibernate session object.

Anyway, even if Sun and Hibernate don’t want to write this validator because it’s not technically perfect, many people are going ahead and writing it themselves, even complaining that it’s “harder than you think”.

Our Implementation

So I wanted to talk about what we’ve done to work around this, apart from me swearing never again to use EJB or Hibernate, in RITA. I don’t like the above approach because we wrap an object of our own around the Hibernate session, to keep some of this crazyness locked away in a well-guarded cellar of the application. Their approach gives us no way to access our wrapper object. And it’s not a standard or anything so it doesn’t matter much if we ignore it.

We already have a Hibernate Interceptor, which already does the following:

  • logs object changes in the audit log; (this appears to be the most common use of interceptors in Hibernate)
  • uploads modified records from a local instance to the master, if working online;
  • goes offline if the upload fails;
  • updates version numbers of owned objects on a local instance;
  • updates version numbers of all objects on a master.

We added a checkUniqueConstraints function, which is about 40 lines long (much shorter than the example on the community wiki), that looks for @UniqueInDatabase annotations on properties and runs a quick and dirty Criteria query to verify that no conflicting values are present in the database at that time.

Further Work

It might be a good idea in future to separate these different functionalities into separate layers using something like Listeners. Interceptors are more convenient because they have access to the state of the object when it was loaded, and the current state, which is handy for audit logging.

I think it would be handy if Java (or Hibernate) would provide an easy way to iterate over an object’s properties (whether annotated on their fields or getters or setters) and retrieve a specific annotation, class of annotations, or all annotations. I think this code already exists in Hibernate’s AnnotationConfiguration, and it’s a shame to have to write it again. Our method would be half as long if it could reuse this.

Capturing Prepared Statement Parameters

Wednesday, November 3rd, 2010

I’m using Hibernate for a project, and sometimes I have problems with saving records because the values in the Java object don’t fit within the database columns (e.g. large floating point numbers in a DECIMAL column, or long strings).

Hibernate often executes the INSERT operations in batches, which means that the actual failing values are not visible, because the PreparedStatement API gives you no way to get them out, and Hibernate doesn’t let you intercept them being set. The insert can also happen long after you created the object. These facts makes it very hard to find and fix the invalid data.

I decided to write a wrapper for PreparedStatement to capture the values being set by Hibernate, and a new Batcher to wrap the PreparedStatements returned by the driver in wrapper objects of my class. I was about to start laboriously writing yet another delegator class that does the boring work of implementing 100 methods and delegating each one individually to the wrapped class. I love Java so much.

Luckily I stopped and figured that someone might have done this before, and indeed I found an implementation by Holy. I adapted it slightly and integrated it into RITA.

To replace the default batcher with my own, to enable the wrapping of statements, I just had to add the following line to my Hibernate configuration properties:

hibernate.jdbc.factory_class = org.wfp.rita.db.CapturingBatcher$Factory

Thanks, Jakob Holy!

Consistency, Portability and Backwards Compatibility

Wednesday, October 6th, 2010

Michał Purzyński reported a problem with our pmGraph software, when using a PostgreSQL database:

I’d like to report a bug – pmgraph is unable to get data from postgresql database to which nfacctd is writing…

javax.servlet.ServletException: org.postgresql.util.PSQLException:
ERROR: column "src_port" does not exist

It turns out that pmacct, up to and including version 0.12.4, uses a different column name for the source port if the database is MySQL or SQLite:

if (!strcmp(config.type, "mysql") || !strcmp(config.type, "sqlite3")) {
  strncat(insert_clause, "src_port", SPACELEFT(insert_clause));
  strncat(where[primitive].string, "src_port=%u", SPACELEFT(where[primitive].string));
}
else {
  strncat(insert_clause, "port_src", SPACELEFT(insert_clause));
  strncat(where[primitive].string, "port_src=%u", SPACELEFT(where[primitive].string));
}

I really wish applications wouldn’t change their behaviour arbitrarily like this. To work around it, we would have to hard-code the database types in pmGraph as well, or add an option to switch the column names. Since pmGraph uses JDBC to access the database, it’s not even obvious which driver names are accessing an (underlying) MySQL or SQLite database. So we need to switch the column names, but if we can get pmacct fixed then we can ease the pain for new users in future.

I reported a bug to Paolo Lucente, the lead developer of pmacct, through their mailing list. Paolo agreed to change this behaviour, even though it will break backwards compatibility. We spent some time discussing how to do this in a way that would minimise any impact on existing users.

To do this, we took advantage of pmacct’s existing system of database table versioning, which means that you can still use the oldest table structures, even with the most recent version of pmacct. Paolo agreed to create a new schema version that uses the same column names for all databases, so that pmacct will remain backwards-compatible for all users unless they deliberately choose to change their database schema version.

As we chose to standardise on the PostgreSQL column names, the column names will change for MySQL users between schema versions 7 and 8, so we’ll need to add a configuration option to pmGraph to allow users to choose whether they want the old or the new column names. This is the very same switch that I wanted to avoid in pmacct, but pmGraph has fewer users so it has less impact.

Writing Database Migrations

Tuesday, March 30th, 2010

As part of our work on RITA, we will need to make schema changes (such as creating tables and adding columns) to live production databases during software upgrades without losing data. Here I will show how migrations can be used to implement these changes. Although aimed at Migrate4J users, some of this applies to Rails Migrations as well.

We use Migrate4J to implement database migrations in this Java application. This requires us to write Java code to migrate up to, and down from, each specific database version, by making the required database changes: adding tables and fields, changing field names and types, and modifying data.

However, in our team the database designer is not the person writing these migrations. The designer is working on his copy of the database design, keeping in mind backwards compatibility with the LCTT Access database, and giving me Postgres schema dumps. I have to compare these dumps to identify what has changed, and write the migration code.

What Changed?

First of all, how does one compare dumps? I found Subversion and Diff to be very helpful. We keep the currently-implemented schema checked into Subversion here as a Postgres dump. When I receive a new one, I replace this file, but don’t immediately check it in. I can use the svn diff command, or the Subclipse plugin’s Compare With feature, to see all the changes since the last revision.

Unfortunately Postgres dumps contain some lines that change every time and which aren’t helpful to me, so after I update the dump, I run a command to remove them:

sed -i.orig -e '/^-- TOC entry/d' -e '/^-- Dependencies:/d' master-schema-from-aaron.sql

And then show the differences:

svn diff --diff-cmd=diff -x "-u -F TABLE" master-schema-from-aaron.sql > master-schema-from-aaron.diff

which produces a file that I can load into a syntax highlighting editor (I often pipe it into less instead), and which looks like this:

@@ -554,7 +596,7 @@ -- Name: bundle_type_group; Type: TABLE;
 CREATE TABLE bundle_type_group (
     id integer NOT NULL,
     description character varying(255) NOT NULL,
-    is_qty_allowed smallint,
+    is_qty_allowed smallint NOT NULL,
     record_version bigint NOT NULL,
     is_deleted smallint NOT NULL
 );

This is an extract from a unified diff. The first line, starting with @@, is a header that begins a new section: a block of changed lines, also called a changed hunk or chunk. It includes line numbers from the old and new dump files. It shows three lines of unchanged context above and below the lines that changed.

In this case the line CREATE TABLE bundle_type_group identifies the table being modified, but sometimes the context may not be enough. The last line containing the word TABLE is shown in the header, and normally this helps to identify the table as well.

So this section represents a change to the bundle_type_group table. What changed? A line has been deleted from the dump, and a line has been added. The deleted line is prefixed with - (minus) in the difference file, and the added line is prefixed with + (plus). These lines represent columns in the table.

In this case, the column removed and the column added are both called is_qty_allowed. Because the name is the same, but the types are different, this almost certainly represents a type change to an existing column. If the names were different but the types were the same, it probably represents a renamed column, and if the names and types both differ, it’s probably a deletion of one column and creation of another, discarding the old contents of the column.

It’s worth discussing any unclear changes with the database administrator to be sure exactly what needs to be done. Sometimes there will be data-only migration changes that don’t appear in the schema at all. For example you might decide one day that all people currently called John in the database should now be called Jean, or you might need to add a row to a system table. These can also be done with Migrate4J, but they are not structural (schema) changes.

Creating a New Migration

Assuming that you already have migrations configured in your application, you will have a migration package, where all the classes are named Migration_number. In our case, the migration package is org.wfp.rita.db.migrations. Identify the next migration number in this package, which is usually one higher than the highest number present. Create a class in the package with this name, using this template:

package org.wfp.rita.db.migrations;

/* cleaner sources: */
import static com.eroi.migrate.Execute.*;
import static com.eroi.migrate.Define.*;

public class Migration_2 implements Migration
{
    public void up()
    {
    }

    public void down()
    {
    }
}

Now you can write code to implement the database changes (both schema and data) that you discovered earlier. Each new change is part of an upward migration, and the code that implements it should go into the up method.

It’s important to be able to reverse changes as well. If a schema update fails, you may want to back down to a previous schema, fix the problem that caused it to fail, and try to update again. The code to reverse the change, which is called a downward migration, goes into the down() method.

Note that most migrations lose data in either the forward or the reverse direction (up or down respectively), so you would be well advised to make an automated backup of the database before applying any migrations, in addition to your standard database backup procedures.

Creating Tables

The Execute.createTable() method takes the table name, and an array of Columns. You can create a new Column with one of these constructors:

  • new Column(String columnName, int columnType)
  • new Column(String columnName,
    int columnType,
    int length,
    boolean primaryKey,
    boolean nullable,
    Object defaultValue,
    boolean autoincrement)
columnType
The type of the column, from java.sql.Types, e.g. Types.INTEGER, Types.FLOAT, Types.VARCHAR.
length
The length of CHAR and VARCHAR columns. The length of all other column types, particularly DECIMAL, must be specified in another way, see below.

primaryKey
True if this column should be part of the primary key, or false otherwise (the default). You can have any number of columns in the primary key, and RITA uses composite primary keys extensively.
nullable
True if this column should be allowed to contain NULL values, and false otherwise.
defaultValue
The default value for new rows. If you set this to null, and the column is not nullable, then a value must be supplied for each record inserted.
autoincrement
True if the column should contain automatically-assigned numbers, using the AUTO_INCREMENT attribute in MySQL, or IDENTITY columns or sequences on databases that support them.

To create a new table called persons, with three columns:

ID
an automatically-assigned integer primary key
fish
a float
rope
a string, 40 characters long, not nullable, defaulting to nylon

we could use the following code in the up migration:

Execute.createTable(new Table("persons", new Column[]{
    new Column("id", Types.INTEGER, -1, true, false, null, true),
    new Column("fish", Types.FLOAT),
    new Column("rope", Types.VARCHAR, 40, false, false, "nylon", false)
}));

Unfortunately this syntax doesn’t allow specifying unique keys, indexes, foreign keys, and precision and scale of decimal columns when the table is created. There is another, shorter syntax which allows specifying the precision and scale:

createTable(table("persons",
    column("id", INTEGER, notnull(), primarykey()),
    column("fish", NUMERIC, precision(8), scale(5)),
    column("rope", VARCHAR, length(40), notnull(), defaultValue("nylon")),
    ));

If that still seems like too much work, and you have a database dump of your new schema, have a look at generating from Postgres dumps below.

The reverse, which you would normally put into the down() method, is simply to drop the table.

Dropping Tables

Dropping a table is as simple as:

Execute.dropTable("persons");

Note that all data in the table will be lost. To recreate the empty table structure in the reverse migration, just create it again.

Adding Columns

To add an INTEGER column called hairs to the persons table, you would add the following code to the up() method:

Execute.addColumn(new Column("hairs", Types.INTEGER), "persons");

The addColumn method takes a Column object, which you can create using either of the methods new Column(...) or column(...) described under creating tables above. The column(...) method is shorter, and the only way to specify the scale and precision of decimal (NUMERIC) columns.

If the change is adding a column, the reverse is to remove the column again, which belongs in the down() method:

Execute.dropColumn("hairs", "persons");

Note that your newly added column will contain default values for all records. If you know what the values should be, or can recreate them using a query, you could execute SQL queries to populate it. Also note that if you migrate down past this version, the column will be dropped and all data contained in it will be lost.

Removing Columns

This is the exact opposite of Adding Columns above. Put the dropColumn() in the up migration, and the addColumn() in the down migration.

Note that migrating down past this migration will not restore the data that was in your column before. If you know what it was, or can recreate it using a query, you could reinsert it using SQL queries.

Renaming Columns

Changing the name of a column does not lose any data. For example, we can rename the column called fish to hats in the persons table, and hope that people don’t try to wear their pet haddock:

Execute.renameColumn("fish", "hats", "persons");

The down() migration trivially renames the column from the new name back to the old name.

Indexes

You can add indexes to columns, both to improve search performance, and to enforce the uniqueness of values in certain columns. The addIndex() method takes an Index object, which you can either create by calling its constructor, or more concisely by calling index() or uniqueIndex(). Both take the same parameters:

index(String indexName, String tableName, String... columnNames)

indexName is the name of the index, which can be null to generate a name automatically. However, such indexes cannot reliably be removed, so I recommend always naming your indexes explicitly. tableName is the name of the table that the index will be applied to, and columnNames is a list of names of columns that will be included in the index.

For example, to uniquely index the fish and rope columns in the persons table:

Execute.addIndex(uniqueIndex("uk_fish_rope", "persons", "fish", "rope"));

You can drop an index, for example for downward migration, using the index name and the table name:

Execute.dropIndex("uk_fish_rope", "persons");

Foreign Keys

Foreign keys link one table to another, to enforce referential integrity between tables. You can create them with Execute.addForeignKey(), which takes a ForeignKey object. There are four ways to construct a ForeignKey:

  • ForeignKey(String name, String parentTable, String parentColumn, String childTable, String childColumn)
  • ForeignKey(String name, String parentTable, String parentColumn, String childTable, String childColumn, CascadeRule deleteRule, CascadeRule updateRule)
  • ForeignKey(String name, String parentTable, String[] parentColumns, String childTable, String[] childColumns)
  • ForeignKey(String name, String parentTable, String[] parentColumns, String childTable, String[] childColumns, CascadeRule cascadeDeleteRule, CascadeRule cascadeUpdateRule)

As you can see, these are just the four combinations of whether parentColumns and childColumns are single column names or arrays of column names, and whether the cascade rules are specified or not (they default to “none” if not supplied).

For example, to force a person’s fish_id column to point to the ID of a record in the fish table, you could use this:

Execute.addForeignKey(new ForeignKey("fk_persons_fish", "persons", "fish_id", "fish", "id"));

You can drop a foreign key, for example for downward migration, using the key name and the child (referenced) table name:

Execute.dropIndex("fk_persons_fish", "fish");

Executing Queries

You can execute any arbitrary SQL statement, for example to insert rows into a newly created table or populate a newly created column:

Execute.executeStatement(Configure.getConnection(),
    "INSERT INTO users SET name = 'fred', password = 'flintstone'");
Execute.executeStatement(Configure.getConnection(),
    "UPDATE users SET age = 42 WHERE name = 'barney'");

Although data modification language is much more standard across databases than data definition language, it’s important to be careful only to use ANSI SQL in such statements if cross-database compatibility is important for your application (or might become important in future).

Generating Automatically

If you already have a table structure in a database somewhere, for example if you are retrofitting migrations to an existing project, or if you prefer using GUI tools to design databases, and to reduce the risk of errors, you may want to generate the migration code automatically.

I wrote a script to create Migrate4J migrations automatically from Postgres database dumps. It’s not perfect, it probably only handles the SQL that we actually use, and it’s not well tested, but it may help you. Just run it with the name of the exported schema dump file as its parameter, and it will generate Java code on the standard output, that you can copy and paste into a Java source file.

If the schema will continue to change, and you want help with creating new table definitions in future, you can save the generated output to a file under version control. When you need to generate migration code for a new schema, just overwrite that file, and use svn diff as before to show the differences. They will now be expressed in Java code, which is easier to copy and paste into a new migration.

Applying Manually

In Eclipse, with a migrate4j.properties file on your classpath, you should be able to open the Migrate4J JAR file in Eclipse, expand the com.eroi.migrate package, right-click on Engine and choose “Run As/Java Application”.

Applying Programmatically

As we are using Hibernate, we get a database connection using its Work class, and use it to invoke the migration engine:

// set up Migration schema and run all migrations
m_Session.doWork(new Work()
{
    public void execute(Connection connection) throws SQLException
    {
        Configure.configure(connection, "org.wfp.rita.db.migrations");
        Engine.migrate();
    }
});

Version Control

If I don’t check in the master schema changes immediately, when does it happen? I try to wait until I have all the schema changes implemented in Hibernate annotations and migrations, and run as many tests as I feel the need to run, before checking everything in.

This ensures that the documentation checked in is consistent with the code at that point in time, that I can see the changes to the SQL dump, the Hibernate mappings and the migrations for a single schema update and compare them side-by-side, and reduces the risk of checking in broken code.