View low bandwidth version

Archive for March, 2010

Writing Database Migrations

Tuesday, March 30th, 2010

As part of our work on RITA, we will need to make schema changes (such as creating tables and adding columns) to live production databases during software upgrades without losing data. Here I will show how migrations can be used to implement these changes. Although aimed at Migrate4J users, some of this applies to Rails Migrations as well.

We use Migrate4J to implement database migrations in this Java application. This requires us to write Java code to migrate up to, and down from, each specific database version, by making the required database changes: adding tables and fields, changing field names and types, and modifying data.

However, in our team the database designer is not the person writing these migrations. The designer is working on his copy of the database design, keeping in mind backwards compatibility with the LCTT Access database, and giving me Postgres schema dumps. I have to compare these dumps to identify what has changed, and write the migration code.

What Changed?

First of all, how does one compare dumps? I found Subversion and Diff to be very helpful. We keep the currently-implemented schema checked into Subversion here as a Postgres dump. When I receive a new one, I replace this file, but don’t immediately check it in. I can use the svn diff command, or the Subclipse plugin’s Compare With feature, to see all the changes since the last revision.

Unfortunately Postgres dumps contain some lines that change every time and which aren’t helpful to me, so after I update the dump, I run a command to remove them:

sed -i.orig -e '/^-- TOC entry/d' -e '/^-- Dependencies:/d' master-schema-from-aaron.sql

And then show the differences:

svn diff --diff-cmd=diff -x "-u -F TABLE" master-schema-from-aaron.sql > master-schema-from-aaron.diff

which produces a file that I can load into a syntax highlighting editor (I often pipe it into less instead), and which looks like this:

@@ -554,7 +596,7 @@ -- Name: bundle_type_group; Type: TABLE;
 CREATE TABLE bundle_type_group (
     id integer NOT NULL,
     description character varying(255) NOT NULL,
-    is_qty_allowed smallint,
+    is_qty_allowed smallint NOT NULL,
     record_version bigint NOT NULL,
     is_deleted smallint NOT NULL
 );

This is an extract from a unified diff. The first line, starting with @@, is a header that begins a new section: a block of changed lines, also called a changed hunk or chunk. It includes line numbers from the old and new dump files. It shows three lines of unchanged context above and below the lines that changed.

In this case the line CREATE TABLE bundle_type_group identifies the table being modified, but sometimes the context may not be enough. The last line containing the word TABLE is shown in the header, and normally this helps to identify the table as well.

So this section represents a change to the bundle_type_group table. What changed? A line has been deleted from the dump, and a line has been added. The deleted line is prefixed with - (minus) in the difference file, and the added line is prefixed with + (plus). These lines represent columns in the table.

In this case, the column removed and the column added are both called is_qty_allowed. Because the name is the same, but the types are different, this almost certainly represents a type change to an existing column. If the names were different but the types were the same, it probably represents a renamed column, and if the names and types both differ, it’s probably a deletion of one column and creation of another, discarding the old contents of the column.

It’s worth discussing any unclear changes with the database administrator to be sure exactly what needs to be done. Sometimes there will be data-only migration changes that don’t appear in the schema at all. For example you might decide one day that all people currently called John in the database should now be called Jean, or you might need to add a row to a system table. These can also be done with Migrate4J, but they are not structural (schema) changes.

Creating a New Migration

Assuming that you already have migrations configured in your application, you will have a migration package, where all the classes are named Migration_number. In our case, the migration package is org.wfp.rita.db.migrations. Identify the next migration number in this package, which is usually one higher than the highest number present. Create a class in the package with this name, using this template:

package org.wfp.rita.db.migrations;

/* cleaner sources: */
import static com.eroi.migrate.Execute.*;
import static com.eroi.migrate.Define.*;

public class Migration_2 implements Migration
{
    public void up()
    {
    }

    public void down()
    {
    }
}

Now you can write code to implement the database changes (both schema and data) that you discovered earlier. Each new change is part of an upward migration, and the code that implements it should go into the up method.

It’s important to be able to reverse changes as well. If a schema update fails, you may want to back down to a previous schema, fix the problem that caused it to fail, and try to update again. The code to reverse the change, which is called a downward migration, goes into the down() method.

Note that most migrations lose data in either the forward or the reverse direction (up or down respectively), so you would be well advised to make an automated backup of the database before applying any migrations, in addition to your standard database backup procedures.

Creating Tables

The Execute.createTable() method takes the table name, and an array of Columns. You can create a new Column with one of these constructors:

  • new Column(String columnName, int columnType)
  • new Column(String columnName,
    int columnType,
    int length,
    boolean primaryKey,
    boolean nullable,
    Object defaultValue,
    boolean autoincrement)
columnType
The type of the column, from java.sql.Types, e.g. Types.INTEGER, Types.FLOAT, Types.VARCHAR.
length
The length of CHAR and VARCHAR columns. The length of all other column types, particularly DECIMAL, must be specified in another way, see below.

primaryKey
True if this column should be part of the primary key, or false otherwise (the default). You can have any number of columns in the primary key, and RITA uses composite primary keys extensively.
nullable
True if this column should be allowed to contain NULL values, and false otherwise.
defaultValue
The default value for new rows. If you set this to null, and the column is not nullable, then a value must be supplied for each record inserted.
autoincrement
True if the column should contain automatically-assigned numbers, using the AUTO_INCREMENT attribute in MySQL, or IDENTITY columns or sequences on databases that support them.

To create a new table called persons, with three columns:

ID
an automatically-assigned integer primary key
fish
a float
rope
a string, 40 characters long, not nullable, defaulting to nylon

we could use the following code in the up migration:

Execute.createTable(new Table("persons", new Column[]{
    new Column("id", Types.INTEGER, -1, true, false, null, true),
    new Column("fish", Types.FLOAT),
    new Column("rope", Types.VARCHAR, 40, false, false, "nylon", false)
}));

Unfortunately this syntax doesn’t allow specifying unique keys, indexes, foreign keys, and precision and scale of decimal columns when the table is created. There is another, shorter syntax which allows specifying the precision and scale:

createTable(table("persons",
    column("id", INTEGER, notnull(), primarykey()),
    column("fish", NUMERIC, precision(8), scale(5)),
    column("rope", VARCHAR, length(40), notnull(), defaultValue("nylon")),
    ));

If that still seems like too much work, and you have a database dump of your new schema, have a look at generating from Postgres dumps below.

The reverse, which you would normally put into the down() method, is simply to drop the table.

Dropping Tables

Dropping a table is as simple as:

Execute.dropTable("persons");

Note that all data in the table will be lost. To recreate the empty table structure in the reverse migration, just create it again.

Adding Columns

To add an INTEGER column called hairs to the persons table, you would add the following code to the up() method:

Execute.addColumn(new Column("hairs", Types.INTEGER), "persons");

The addColumn method takes a Column object, which you can create using either of the methods new Column(...) or column(...) described under creating tables above. The column(...) method is shorter, and the only way to specify the scale and precision of decimal (NUMERIC) columns.

If the change is adding a column, the reverse is to remove the column again, which belongs in the down() method:

Execute.dropColumn("hairs", "persons");

Note that your newly added column will contain default values for all records. If you know what the values should be, or can recreate them using a query, you could execute SQL queries to populate it. Also note that if you migrate down past this version, the column will be dropped and all data contained in it will be lost.

Removing Columns

This is the exact opposite of Adding Columns above. Put the dropColumn() in the up migration, and the addColumn() in the down migration.

Note that migrating down past this migration will not restore the data that was in your column before. If you know what it was, or can recreate it using a query, you could reinsert it using SQL queries.

Renaming Columns

Changing the name of a column does not lose any data. For example, we can rename the column called fish to hats in the persons table, and hope that people don’t try to wear their pet haddock:

Execute.renameColumn("fish", "hats", "persons");

The down() migration trivially renames the column from the new name back to the old name.

Indexes

You can add indexes to columns, both to improve search performance, and to enforce the uniqueness of values in certain columns. The addIndex() method takes an Index object, which you can either create by calling its constructor, or more concisely by calling index() or uniqueIndex(). Both take the same parameters:

index(String indexName, String tableName, String... columnNames)

indexName is the name of the index, which can be null to generate a name automatically. However, such indexes cannot reliably be removed, so I recommend always naming your indexes explicitly. tableName is the name of the table that the index will be applied to, and columnNames is a list of names of columns that will be included in the index.

For example, to uniquely index the fish and rope columns in the persons table:

Execute.addIndex(uniqueIndex("uk_fish_rope", "persons", "fish", "rope"));

You can drop an index, for example for downward migration, using the index name and the table name:

Execute.dropIndex("uk_fish_rope", "persons");

Foreign Keys

Foreign keys link one table to another, to enforce referential integrity between tables. You can create them with Execute.addForeignKey(), which takes a ForeignKey object. There are four ways to construct a ForeignKey:

  • ForeignKey(String name, String parentTable, String parentColumn, String childTable, String childColumn)
  • ForeignKey(String name, String parentTable, String parentColumn, String childTable, String childColumn, CascadeRule deleteRule, CascadeRule updateRule)
  • ForeignKey(String name, String parentTable, String[] parentColumns, String childTable, String[] childColumns)
  • ForeignKey(String name, String parentTable, String[] parentColumns, String childTable, String[] childColumns, CascadeRule cascadeDeleteRule, CascadeRule cascadeUpdateRule)

As you can see, these are just the four combinations of whether parentColumns and childColumns are single column names or arrays of column names, and whether the cascade rules are specified or not (they default to “none” if not supplied).

For example, to force a person’s fish_id column to point to the ID of a record in the fish table, you could use this:

Execute.addForeignKey(new ForeignKey("fk_persons_fish", "persons", "fish_id", "fish", "id"));

You can drop a foreign key, for example for downward migration, using the key name and the child (referenced) table name:

Execute.dropIndex("fk_persons_fish", "fish");

Executing Queries

You can execute any arbitrary SQL statement, for example to insert rows into a newly created table or populate a newly created column:

Execute.executeStatement(Configure.getConnection(),
    "INSERT INTO users SET name = 'fred', password = 'flintstone'");
Execute.executeStatement(Configure.getConnection(),
    "UPDATE users SET age = 42 WHERE name = 'barney'");

Although data modification language is much more standard across databases than data definition language, it’s important to be careful only to use ANSI SQL in such statements if cross-database compatibility is important for your application (or might become important in future).

Generating Automatically

If you already have a table structure in a database somewhere, for example if you are retrofitting migrations to an existing project, or if you prefer using GUI tools to design databases, and to reduce the risk of errors, you may want to generate the migration code automatically.

I wrote a script to create Migrate4J migrations automatically from Postgres database dumps. It’s not perfect, it probably only handles the SQL that we actually use, and it’s not well tested, but it may help you. Just run it with the name of the exported schema dump file as its parameter, and it will generate Java code on the standard output, that you can copy and paste into a Java source file.

If the schema will continue to change, and you want help with creating new table definitions in future, you can save the generated output to a file under version control. When you need to generate migration code for a new schema, just overwrite that file, and use svn diff as before to show the differences. They will now be expressed in Java code, which is easier to copy and paste into a new migration.

Applying Manually

In Eclipse, with a migrate4j.properties file on your classpath, you should be able to open the Migrate4J JAR file in Eclipse, expand the com.eroi.migrate package, right-click on Engine and choose “Run As/Java Application”.

Applying Programmatically

As we are using Hibernate, we get a database connection using its Work class, and use it to invoke the migration engine:

// set up Migration schema and run all migrations
m_Session.doWork(new Work()
{
    public void execute(Connection connection) throws SQLException
    {
        Configure.configure(connection, "org.wfp.rita.db.migrations");
        Engine.migrate();
    }
});

Version Control

If I don’t check in the master schema changes immediately, when does it happen? I try to wait until I have all the schema changes implemented in Hibernate annotations and migrations, and run as many tests as I feel the need to run, before checking everything in.

This ensures that the documentation checked in is consistent with the code at that point in time, that I can see the changes to the SQL dump, the Hibernate mappings and the migrations for a single schema update and compare them side-by-side, and reduces the risk of checking in broken code.

R&D: Robots and Development

Tuesday, March 16th, 2010

The Samfya Resource Centre is a telecentre (Internet cafe) run by young women in Samfya, Zambia for other local young women – a component of Camfed‘s “CAMA” network. As part of our ongoing support for the Samfya Resource Centre, we have sent them robots.

drawbot

(c) Fatuma Iseje

Why robots? As a child I grew up programming the Sinclair ZX81. A small computer about the size of a book, with a touch-(in)sensitive membrane keyboard and 1kB of memory. With computers we spend long hours staring at the screen interacting with a virtual world made of intangible pixels. I remember being constantly frustrated that my computer couldn’t do anything in the REAL world. It couldn’t turn lights on and off, it couldn’t move and back then it couldn’t even makes noises. In fact I killed the poor thing when I dropped a soldering iron through the CPU while trying to interface it to my physical world.

It’s all different now. With the advent of devices like the Arduino it’s now easy and cheap to get the computer to reach out into our world. And we can reach into its world. Not only does an Arduino let us physically interact with the computer it can also be automonous. We can make things that interact with us without needing the computer attached to the other end of its USB umbilical cord.

For those that haven’t come across it yet, an Arduino is a simple micro-controller board. OK… what’s so amazing about that, there’s been micro-controller boards around for decades? Well, a few things. Firstly it has a USB port so you can easily connect it to your computer. It has its own simple programming environment using a variant of the C programming language. So it’s not dumbed down at all – no visual programming environment or anything like that. And critically there is a large on-line community eager to share and support people who do not come from a technical background. Its design is open source, it’s cheap (£20) and you can find even cheaper clones.

This exploding field called “Physical Computing” is making computing more fun, more compelling and also sometimes more useful.

Ok robots might be fun, but why send them to Zambia?

The computer used to be like a calculator on steroids. It could “compute” things. If you knew a bit about SIN or COS you could make it draw a circle. Or calculate a volume. Now, although computers are enormously more powerful than the Sinclair ZX81, most of us use them like some mash-up of a TV and a phone. They give us access to information and lets us communicate. These are important and useful functions. But I worry most of us are in danger of treating our computers like new cars – like devices with “no user serviceable parts inside“.  Instead of seeing them as tools that we can fix, modify, improve, re-invent and tailor to our needs.

It’s more than that. Computers aren’t just useful for the stuff they do. I believe that programming computers, tinkering with them, making electronic circuits and building robots builds highly transferable skills. Critical thinking, logical and abstract reasoning and problem solving. In addition, programming an Arduino makes you a dab hand at searching the Internet.

I got into robots at BuildBrightonBrightonRobot – the Brighton hacker-spaces based out of The Skiff – and in particular through Steve Carpenter’s amazing kits.

Eva with the Drawbot - (c) Fatuma Iseje

As an introduction to robotics, we’ve used the “DrawBot” packaged up by Steve with excellent instructions. How complicated and expensive do you think an autonomous drawing robot would be? How about 6 components that require no soldering costing about £3?

Steve has also designed a great robot kit based around the Arduino – the BoxBot. The kit is designed using the Interlocking T-Bolt Construction method, a cost-effective technique for low volume manufacture. With servos and everything you need minus the Arduino, the kit’s about £60. It’s a very flexible kit, you can bolt sensors or motors all over it – but this is not Lego. You need to drill holes in it.

Reading the educational posts about robotics I can see that from a teacher’s point of view an Arduino and the BoxBot may not be ideal. They don’t neatly fit into lesson plans. They don’t just leap into life in a way that keeps a lesson of 30+ students moving along. They take a bit more application and research to use. But unlike Lego, Meccano and other modular systems, they blur the distinction between the kit and the rest of the world. Once you’re empowered to drill holes and wire up your BoxBot you find you’re up for drilling holes and wiring up anything.

Using robots in this context is an experiment for us. We don’t know how it will go. But it’s an experiment the young women of the Samfya Resource Centre are excited about.

We’ll keep you posted on the progress of our latest bit of R&D.

Hi” and “Mulishani” to Bridget, Everlyn, Fatuma, Mary, Mildred and Penelope at the SRC from the Aptivate gang!

SSH Port Forwarding

Wednesday, March 10th, 2010

David Sumbler wrote to the LinuxChix mailing list:

She now has two computers connected via an ADSL router. Both computers run Ubuntu (8.06 and 9.10). I have set things up so that I can log into the router, and also SSH to both computers simultaneously: I use two different port numbers…

I now want to be able to see her desktops, but I haven’t figured out how to do this. Having read the Gnome help, I believe that the Gnome remote desktop is inherently insecure: I would prefer to tunnel things over SSH, probably using vncserver and vncviewer (or perhaps Vinagre).

Can anybody explain what I need to do to get this to work, please?

I get asked this kind of question so often that I thought I’d write it up somewhere so I could just point people to the post.

SSH port forwarding is not hard to do, once you get your head around how it actually works. Thanks to Alan for drawing this simple diagram:

SSH port forwarding is not like a VPN and it’s not magic. It’s quite like a proxy server:

  • You tell SSH, with the -L option, to listen for connections on a port on your local side.
  • SSH connects to the remote host immediately as usual, and then starts listening on this port.
  • When it receives a connection on this port, it tells the other side (the SSH server that you connected to) to connect to the remote hostname and port that you specified.
  • If the remote side succeeds, the two SSH processes join the two sides together, forwarding bytes from each side to the other.

(Note: it’s also possible to ask the remote SSH server to listen on a port on its side, with the -R option, and connect to a host and port on the client side, but in the interests of simplicity I will ignore that for today.)

I’ll show you the commands that I suggested to David, and then explain what they do:

ssh username@ip-address-of-ssh-server -p port1 -L 5901:localhost:5900
ssh username@ip-address-of-ssh-server -p port2 -L 5902:localhost:5900
vncviewer localhost:1 (connects to computer 1)
vncviewer localhost:2 (connects to computer 2)

This opens two SSH connections, one to each of the machines behind his firewall, which are completely independent of each other. One SSH connection would actually be enough, as we will see in a minute, but this way fit more logically with my explanation.

These commands contain some placeholders that must be adapted to your situation:

username
The user name that you want to connect as. You can omit the name and the @ sign if it’s the same as your logged-in user on the client.
ip-address-of-ssh-server
The IP address or hostname of the SSH server that you want to connect to. In David’s case, he can’t see the SSH server directly, so he needs to use the public IP address of the router here, and the router will forward the port to the SSH server on his internal network.
port1 and port2
David said that he can “SSH to both computers simultaneously [using] two different port numbers.” Presumably using port forwarding on his router. These are the two port numbers.
vncviewer localhost:1
This runs the VNC viewer on the client and tells it to connect to VNC display 1, which runs on port 5901 (by definition, VNC ports are display number plus 5900), which we already forwarded to computer 1 using SSH.

After running the two ssh commands command, the first SSH client will be listening on port 5901 on the machine that you run it on, and the second will be listening on port 5902.

After this, until you disconnect the SSH sessions or kill the clients in some way, whenever you connect to port 5901 on the client, it will tell the computer it’s connected to (computer 1) to connect to localhost port 5900 (that is, to its own VNC server) and then join the connections together, forwarding any data sent in either direction over the tunnel.

This part of the SSH command:

-L 5902:localhost:5900

tells the SSH client to Listen on port 5902 on the client, and when it receives a connection, to ask the other side (the server) to connect to (what it sees as) localhost port 5900, and SSH will forward communications between the two over the SSH tunnel.

Note first of all that we tell vncviewer to connect to localhost, not to the IP of the remote computer (internal or external). That’s because the client side of the SSH port forwarding is listening on localhost port 5901, and not any other IP address or port. If you connect to anything other than localhost port 5901, you will not end up talking to the local SSH client connected to computer 1.

Note secondly that when we created the tunnels, we told the ssh client to connect them to port 5900, also on localhost. This time, localhost is relative to the remote machine (the server), so we are telling it to connect to itself (not back to you). We could also specify any IP address and port that is reachable to the server, which is acting as our proxy in this case. However, we cannot specify an IP or port that is reachable to the client but not to the server, because the server will not be able to connect to it.

Now let’s imagine that we want to be able to VNC to both computers over a single SSH tunnel. We can do this by forwarding two different local ports, one to localhost, and one to the IP address of the other computer, like this:

ssh username@ip-address-of-ssh-server -p port1 -L 5901:localhost:5900 -L 5902:192.168.10.5:5900
vncviewer localhost:1 (connects to computer 1)
vncviewer localhost:2 (connects to computer 2)

This assumes that computer 2 has the internal (RFC1918) IP address 192.168.10.5, and allows connections from computer 1 to its port 5900.

Port forwarding is unlike a VPN in several ways. The client does not end up with routing to the ultimate destination, nor does it need it. This means that it works even if the client and server have different views of the IP space, for example if they are located in subnets that use the same IP range to refer to different machines.

The server does not try to connect to the ultimate destination until the client receives an incoming connection (e.g. from vncviewer in this case). At this point, it may discover that there is nothing listening on the port to which it was told to connect, or that the destination host is down, or the port is blocked by a firewall. The server informs the client of this, but the client has no way to pass this information onto the connection that it received, which is has already accepted. All it can do is close the connection.

This means, for example, that if you were to sit at the server and type vncviewer 192.168.10.5, and that computer was not running VNC, you might get a Connection refused error. However, if you sit at the client and type vncviewer localhost, you will see the connection is opened and immediately closed, as though the VNC process was listening but refused to talk to you for some reason. Do not be fooled into assuming that VNC is running on the destination. With SSH port forwarding, you have no idea.

You cannot forward ICMP (pings), UDP sockets (DNS) or any other protocol except TCP using port forwarding, so you will never be able to ping remote hosts using this method alone.

It is currently impossible to add new forwarded ports to an existing connection or to change the ultimate destination host and port, so you must disconnect and reconnect with a new command line instead. This is inconvenient in some cases, especially where you have a long-running process open in the shell. I recommend using ssh -N to open an ssh client that does only port forwarding and not a shell; then open a separate shell if you need one.

The ssh client cannot exit while any connection is open, so if you log out with connections open, it will appear to hang. All open connections will be closed if the ssh client is forcibly killed by a signal or escape character.

If your port forwarding doesn’t appear to be working, check that you don’t have another process listening on the same port. For example, in the VNC case, both Gnome and KDE desktop sharing create a VNC server on the standard port, 5900, so you cannot forward the local port 5900 to anywhere if you have remote desktop access enabled on the client. The easiest solution is to listen on different port numbers, like 5901 and 5902, which correspond to VNC displays 1 and 2 in the command examples above.

Finally, please note that the meaning of commands like these is very different depending on where it is run (on the client or on the server):

vncviewer localhost
vncviewer 192.168.10.5

This is because:

  • The meaning of localhost is different depending on where you run it (on the client or on the server); it always means connecting to the same computer that the command is running on.
  • The meaning of 192.168.10.5 (or any other IP address) similarly depends on where you run it (on the client or on the server); it is always relative to the computers that are reachable from the one running the command.
  • Connections always appear to the recipient to be coming from the computer running the command, so when the client or the server connects to 192.168.10.5, even if that’s the same computer for both, it will see the connections coming from different IP addresses.

Tariq adds that you can also run:

ssh -D 9999 username@ip-address-of-ssh-server

where the -D option tells SSH to creates a SOCKS proxy server tunnel. You can then tell your web browser (and other clients with SOCKS support) to use localhost:9999 as a SOCKS proxy server. This will forward all your browsing through the SSH tunnel, which makes it look like you’re in a different location (e.g. to watch iplayer when not in the UK) and protects your unencrypted web browsing from random sniffers on public networks.