View low bandwidth version

Archive for the ‘Microsoft’ Category

Content indexing in Django using Apache Tika

Wednesday, February 1st, 2012

For the Documents module of our new open-source Generic Intranet, we need to be able to extract the text content and metadata from various kinds of documents:

  • PDF files
  • Microsoft Office DOC, XLS and PPT files
  • and the new XML equivalents, DOCX, XLSX and PPTX.

I found various tools online to help extract this text, largely thanks to Stack Overflow here and here. This ended up with a hodgepodge of tools:

There were a number of problems with this hodgepodge:

  • I was unable to find any Python or command-line solution for old Excel (XLS) files;
  • These solutions did not extract metadata, only document text;
  • The choice of which tool to use depends on the MIME type returned by the file(1) command, which varies depending on the OS (Debian/Ubuntu or CentOS) and which version of the library is installed

Another Stack Overflow post recommended Apache Tika for metadata extraction. It appears to support all the document formats that we need, and to have auto-detection of the document format, which solves all the MIME type problems as well. However, it introduces a new problem: it’s written in Java, which is hard to access from Python.

Luckily I found some instructions for building a Python wrapper around Tika, using some tools that I’d never heard of, and this seemed like a good approach. Unfortunately the installation process is very non-standard, which would not fit in with our fabric-based automated deployment process, and would make it harder for users to install the Intranet themselves.

The instructions are somewhat outdated at the time of writing, as they refer to Tika version 0.7, while 1.0 has been released. I was unable to register for an account to update that page, so I wrote to the author with the details that I discovered, and will also document here that the following command works for me:

python ../jcc/jcc/__main__.py \
        --include /usr/share/java/org.eclipse.osgi.jar \
        --jar tika-parsers-1.0.jar \
        --jar tika-core-1.0.jar \
        java.io.File java.io.FileInputStream \
        java.io.StringBufferInputStream \
        --package org.xml.sax \
        --include tika-app-1.0.jar \
        --python tika --version 1.0 --reserved asm

I was able to go further than this, and package Tika in a way that makes it easy to install with Pip, and thus integrate with our deployment process.

The wrapper is written using JCC, which works by generating and compiling C++ code that links to the Java classes, and then a Python wrapper around that C++. This means that it needs to be recompiled for each platform, so I couldn’t just distribute a binary blob with the Intranet (I had the same problem with DocToText above).

The version of setuptools on our servers doesn’t support JCC’s shared library mode. JCC dies with an error if it’s not explicitly disabled or the patch applies. I couldn’t do either of these as part of our standard deployment process. So I patched JCC to disabled shared mode, since we don’t need it anyway. I also added some patches to allow various setup.py commands used by pip to be forwarded through JCC to the setup function call.

This seems to be enough to allow you to install JCC like this:

pip install git+git://github.com/aptivate/jcc.git

I also wrote a setup.py file that handles pip’s command line invokations and passes the necessary options to JCC, and JCC’s invocation of the setup function. This seems to be enough to install the package using pip:

pip install git+git://github.com/aptivate/python-tika.git

and you can use the last parameter as a package specification in pip_packages.txt, or whatever you pass to pip -r.

You can find the pip-installable Tika package, complete with Tika 1.0 JAR files, in our python-tika repository on Github. This will save you the work of downloading and compiling Tika and all of its dependencies. I have started a discussion with the JCC developers about merging these changes into the upstream project.

Move over microsoft: Design’s going open source

Thursday, May 19th, 2011

I’ve been designing websites since 1999 but switching from Microsoft Windows to Ubuntu has been one of those pivotal experiences worth sharing. Joining Aptivate as the in-house designer recently has given me the opportunity to challenge some pretty old work-flows and move towards a totally open source design practice.

Aptivate work almost exclusively with open source software so it seems a great idea to give Microsoft the push, and frankly I’d had enough of waiting 9 minutes for my laptop to reboot. That’s enough time for making 6 people a cup of tea, water the plants and rearrange the desk; all good things ONCE a day not 5 times when the PC crashes. 3 years of filling up with junk makes a Windows PC very very sluggish and an unhappy designer with a tidy workspace.

Changing over – a gradual process

Leaving Microsoft was never going to a straight switch. Leaving a web platform if you are dependent on it for your income is a scary thing.

Since i cut my design teeth on Apple Macs and Adobe software in 1995, I have moved gradually to Windows in a bid to be one step closer to the end user who on the whole use this platform. I was still, however heavily dependent on Adobe products, primarily Fireworks and Dreamweaver for interface design and development, but also Illustrator for vector graphics and Photoshop for photo editing. I know they are good, but really? couldn’t I achieve a good web result without them?

Open source alternatives within Windows.

Going back a bit to before the anger towards the laptop really kicked in. Finding open source alternatives was an exciting challenge. These were readily available for Windows so I didn’t have to switch, just try them out. I found a really useful website www.osalt.com which helps you to find open source alternatives to commercial software.

Web development – Aptana 2

Essentially I wanted an html/css editor with code completion, ftp client and project manager.. I tried Amaya, Bluefish KompoZer and Mozilla SeaMonkey which all had great features, but none of them did ALL the things I wanted together and I really was a bit spoilt by having them all rolled into one with Dreamweaver. Finally I found Aptana 2 which whilst a bit fiddly to get started with seemed to have everything I needed, hurray!

  • Best thing about Aptana – has to be the available plugin support for different types of project such as php, python/django, javascript, svn and git. It’s very comprehensive.
  • What I miss the most from Dreamweaver – nothing… had I still been dependent on the WYSIWYG editor in Dreamweaver then I could argue that this would be a bit deal, but since most of my work is now with dynamic database driven sites, I tend to use Firebug to make visual tweaks before committing them to code.

Web design – Inkscape

A key part of my work involves developing brand assets, icons and other vector graphics for both web and print design. For this I had depended on Illustrator. Quite quickly however I discovered Inkscape . What an amazing product! it’s a bit clunky but it lets me do 90% of the key design tasks I did in Illustrator and 60% of tasks I used to do in Fireworks. Essentially it is the easiest transition I have made on this journey and means that I now only use Inkscape for designing for the web.

  • Best thing about Inkscape – The ability to export 32 bit png files from any selection, the page or object. The other best bit is all the native SVG features I have yet to discover!
  • What I miss the most from Illustrator – nothing worth mentioning.
  • What I miss the most from Fireworks – Image optimisation – no image editing software that I’m aware of does a better job of compressing and optimising all nature of image files than Fireworks. It creates an 8 bit alpha png with a tiny size and smooth, edging where transparent bits kick in. This for me is the single most important missing feature of any open source alternative. Inkscape needs a good image optimising tool.

Photo editing – Gimp

Scaling, cropping, optimising photos; that’s a big part of creating photographic content for a website. Mainly I leave that to the end user but sometimes I create photo-based graphics too.
Gimp seems the most logical choice, as it suggests so itself, but I’m not finding it intuitive and it seems to crash more often than not. Fireworks has a limited range of bitmap editing tools compared to Photoshop but integrates really well within the context of creating an interface design as bitmap and vector graphic objects live happily on the same layer or multiple layers. Now there are several open source alternatives for this work such as but I’ll admit that I’m not finding a great solution and again it’s the optimisation issue that makes me frustrated.

Designing for Low Bandwidth

If we need to create websites with small file sizes for countries with low bandwidth then we need powerful optimisers. I found that Fireworks was great because of the high compression rates achievable which result in files more than half the size achievable using other programmes, open source or commercial. So here is a proposal for a future project for Aptivate – create an 8 bit Alpha transparency image optimiser that challenges Fireworks.

Switching to Ubuntu

A couple of months later I reformatted and partitioned the hard drive of my Dell Precision 9300 workstation. I still wanted access to Windows, so did a dual boot with an extra partition for shared files. Apart from my video card packing in unexpectedly (nothing to do with the installation I’m assured) , the installation went without a hitch. I was amazed by how intuitive the Ubuntu interface was. Using the software install centre, I was able to quickly and directly install all the applications I needed. Start up and shut down was a mere seconds and as I was in production mode within a couple of hours. I was more than happy to find Dropbox, Skype and Acrobat Reader were also available. All in all it was pain free, and… Ubuntu is actually graphically beautiful (not something I had anticipated at all).

The dreaded Terminal

Well it’s inevitable even for a designer. The WYSIWYG addicts worst fear, THE TERMINAL!!! agghh!! . A Baptism of code fire at Aptivate, no namby pamby intro here. I had pulled down the designer’s defence gate and in trickled (and sometimes poured) an endless stream of code, configuration files, settings, database fixtures, tables, smart tricks and speedy scripts, screen sharing and networking wizardry. Ah! then I had a cup of tea. I’ll be a terminal ninja one day.

Do I miss Windows?

I don’t miss windows. I booted into it shortly after installing Ubuntu, and found it an empty experience, like going back to a house I you’ve just moved out of but still have keys to. Can’t find the kettle to make a cup of tea so not staying. I recently installed virtual box and now run several different versions of Windows XP to help me cross-browser debug CSS.

Moving forward

Apart from using my new open source toolbox to help Aptivate refresh it’s current website, there are things I want to do. It would be great to contribute to the open source community and generally help to improve already great products such as Inkscape. I’ve already started submitting Inkscape icons to the Open Clip Art Library and it would be great if we developed the image optimiser I mentioned before. There is also a lot of potential in interactive svg’s which would be great to explore especially since Internet Explorer 9 supports them.. oh Microsoft you are never forgotten…

If you are designer reading this, and fancy getting involved with Aptivate’s open source efforts somehow, get in touch. If you are a developer and have a secret optimiser up your sleeve, please let us feed you and keep you amused because we want it!!!

Simple Cisco VPN How-To

Tuesday, August 3rd, 2010

One of our fellow Humanitarian Centre organisations, Engineers Without Borders UK (EWB), asked for our help in setting up a virtual private network (VPN), so that their remote workers can access their file server.

This is something that ought to be really simple. It’s probably the most common use case of VPNs, Windows has a built-in VPN client, and Cisco routers can be used as VPN servers. EWB want it to be simple, because they have non-technical remote workers. It turned out to be much harder and take much longer than I expected.

Information Overload

One of the biggest problems was the lack of useful information, and the profusion of useless. The information fell mainly into four categories:

  • Cisco marketing materials touting the benefits of VPNs and their expensive Concentrator and WebVPN products;
  • Cisco knowledge base articles describing the setup of complex VPN scenarios;
  • Cisco command references with little or no details on what each command actually does, or how to use them together;
  • Cisco exam study sites with inaccurate, out-of-date or cookie-cutter command sequences, with even less explanation of what the commands actually do.

Because I couldn’t find what I was looking for, and had to work it out the hard way, I’ve written it up in the hope that it will help others.

I would recommend any organisations that simply want to share files to seriously consider a file-sharing service like DropBox or raw Amazon S3 instead of a local file server and VPN. In many cases the low upload bandwidth of ADSL connections, combined with internal office use of the connection. will make a VPN impractically slow, especially compared to Amazon’s unlimited upload and download bandwidth. But EWB already had the file server and they just wanted to access it remotely, not to change how they work.

Our scenario is simple: an internal office network with private IP addresses, a Cisco 1800 router providing ADSL connectivity for the office, and remote field workers running Windows desktops.

Getting the Client

For simplicity, we and EWB had hoped to use the built-in VPN client on Windows, which would remove the need to download and install software on the remote workers’ machines. But unfortunately the Cisco 1800 does not support this. Windows uses L2TP over IPSEC for modern, secure VPNs, as a replacement for the old insecure PPTP protocol. But Cisco has crippled the L2TP support in this router, and it only supports raw IPSEC. Only their more expensive routers support serving L2TP over IPSEC, allowing simple direct connections from Windows.

Raw IPSEC is the only remaining option on this router, but it’s difficult to configure due to its complexity, and the number of choices that need to be made. The standard requires both sides to have the same settings configured, but provides no way to do this automatically. Manual configuration would make life very hard for the remote workers. To solve this problem, Cisco has a non-standard protocol for auto-configuration of the clients:

Establishing a VPN connection between two routers can be complicated, and it typically requires tedious coordination between network administrators to configure the two routers’ VPN parameters.

The Cisco Easy VPN Client feature eliminates much of this tedious work by implementing Cisco’s Unity Client protocol, which allows most VPN parameters to be defined at [the] IPSec server.

Cisco Easy VPN Client for the Cisco 1700 Series Routers

So we needed to find a replacement client that was easy to use and could talk to the Cisco. Preferably a free one.

Then we discovered that although Cisco’s own VPN client is technically free, you can’t actually download it without a support contract, which neither we nor EWB have.

In the end we found that if you go to Cisco’s VPN client software page, find the filename of the latest version of the client, and Google it, you’ll find that several people have had enough of this nonsense and posted the client online, so it can be downloaded.

Of course it’s important to be aware of the potential for viruses in copies that you download from random sites on the Internet, as well as fake download sites that lead you around in circles of free registrations, credit card details and pop-up porn adverts. This site worked fine for me, but it may have been taken down by Cisco’s attack dogs by the time you read this.

Security with Obscurity

We decided to choose a configuration that trades some security for ease of use. So instead of authenticating with certificates, we used pre-shared keys. The VPN server has its own login system anyway, which provides an additional layer of security once the remote user is connected to the VPN.

Names and Addresses

Connecting clients need to be allocated an IP address to use over the VPN. We could have used public IPs, or private IPs in the same subnet (with proxy ARP), but we chose to use private IPs in a different subnet. This makes the routing easier, as clients and local network servers will know that they have to route the traffic via the router anyway, and it allows EWB to implement stricter network access policies for VPN clients, if they wish.

We needed to create a local pool (not a DHCP pool) to draw these addresses from:

ip local pool vpnpool 192.168.2.100 192.168.2.200

Keys to the Kingdom

We created an ISAKMP (IKE) policy to specify the authentication method and the level of encryption to be used for negotiation of IPSEC Security Associations (SAs). We chose to make this the first, highest priority policy, and to use AES-256 encryption (strong and fast), Group 2 (1024-bit) Diffie-Hellman key exchange, and pre-shared keys for client authentication as noted above:

crypto isakmp policy 1
 encr aes 256
 authentication pre-share
 group 2

Then we specified the pre-shared key itself. This is the only thing that stops random clients on the Internet from connecting to your local network, so it’s even more important than a strong wireless network key. Of course this is not the real key:

crypto isakmp key ThisKeyMustBeKeptSecret address 0.0.0.0 0.0.0.0

We specify that any IP address can use it by using the wildcard address, 0.0.0.0 0.0.0.0.

At the End of the Tunnel

It seems to be common in corporate environments that, when a user is connected to a VPN, all of their Internet traffic is routed through the VPN. It certainly makes it easier for the network administrators, as they don’t have to define specific routes for the tunnel, but it wastes their bandwidth and makes Internet access much slower for the remote workers, so we decided not to do this.

Just routing a single subnet through a tunnel is called a split tunnel. I couldn’t find simple documentation on setting it up, so I used the Cisco Easy VPN Remote example, extracting just the bits we needed to route only the 192.168.1.0/24 subnet through the tunnel.

First we have to create an access control list (ACL) that defines, on the local (source address) side, what traffic clients should route into the tunnel:

ip access-list extended ewb_office_split_tunnel
 remark Defines which local (office) networks a remote VPN client will route to
 permit ip 192.168.1.0 0.0.0.255 192.168.2.0 0.0.0.255

I’m not sure if the second half of the ACL is actually necessary. It doesn’t appear to make any difference if I specify any instead of 192.168.2.0 0.0.0.255.

Client Configuration

We use Cisco’s EzVPN (Unity) protocol, as described earlier, to configure connecting clients automatically. To do this, we have to tell the server what configuration should be sent to clients when they connect:

crypto isakmp client configuration group EWB
 key ThisKeyMustBeKeptSecret
 dns 192.168.1.1
 wins 192.168.1.2
 pool vpnpool
 acl ewb_office_split_tunnel
 netmask 255.255.255.0

A little explanation about what these options do:

crypto isakmp client configuration group [name]
The name must match the group name that the client uses when it connects. This is how the server decides which configuration to send to the client.
key
For some reason the client needs to be told what key to use, even though it’s already been entered by the user, and the client knows it because it wouldn’t be able to get this far in the negotiation without it!
dns
Tells the client which DNS server to use, for resolving local (private) hostnames, or resolving inside the split horizon. You can specify a second DNS server after the primary one. You probably only need this if you’re running a Windows domain, in which case it should point to the domain controller, or if you have split horizon DNS.
wins
Tells the client which WINS server to use, for resolving local SMB server names. Again, you probably only need this if you’re running a Windows domain, in which case it should also point to the domain controller.
pool
Tells the server which local pool (not DHCP pool) to assign the client’s address from. You can specify any name here, even a pool that doesn’t exist, but clients won’t be able to connect unless the pool name is a valid local pool.
acl
This ACL, which we defined earlier, is used to tell the clients which subnets are reachable through the connection (split tunnel mode). If no acl statement is used, the tunnel is not split, and a default route is set through the VPN tunnel instead.
netmask
Defines the network mask that the client will apply to its client interface, in combination with the IP address assigned from the pool.

Profiling

Next, we create an ISAKMP profile on the server which tells the server to assign IP addresses automatically, and which virtual template to use when creating the virtual-access interfaces for the server side of the tunnel. We haven’t defined the virtual template yet, but we will in a second.

crypto isakmp profile ewb_isakmp_profile
   match identity group EWB
   isakmp authorization list sdm_vpn_group_ml_4
   client configuration address respond
   virtual-template 1

When a client connects using the group name EWB, it will check for network authorization using the AAA list name sdm_vpn_group_ml_4 (or default if that list doesn’t exist), respond to IP address requests from the client (using the pool defined in the client configuration above), and create a local virtual-access interface based on virtual template number 1.

You should use the same group name that you used for the client configuration above, instead of EWB, unless you’re EWB of course.

Strong Encryption

Now we define the level of encryption used for data communications with hosts on the internal network, as opposed to securing the negotiation process. We start by defining a transform set which uses 256-bit AES encryption, the SHA hash algorithm and LZS compression for data packets:

crypto ipsec transform-set ewb_encryption esp-aes 256 esp-sha-hmac comp-lzs

Then we create an IPsec profile that links these settings to the ISAKMP profile that we defined above:

crypto ipsec profile ewb_ipsec_profile
 set transform-set ewb_encryption
 set isakmp-profile ewb_isakmp_profile

Virtual Template

Now we define the template for the virtual interfaces, that we referenced above in the ISAKMP policy:

interface Virtual-Template1 type tunnel
 ip unnumbered Vlan1
 zone-member security in-zone
 tunnel mode ipsec ipv4
 tunnel protection ipsec profile ewb_ipsec_profile

We use ip unnumbered Vlan1 to set the IP address of the virtual-access interfaces to the address of the router on the local LAN (in this case it’s a VLAN bridge), which allows you to ping the router using its internal IP address (192.168.1.1 in our case) when you’re connected to the VPN, which is a useful connectivity test.

We place the virtual interfaces into the in-zone (internal zone) which means that they have full access to the local network, which is not very secure, but simplifies things. We also specify that this interface accepts only traffic encrypted with IPsec and bound to the profile that we created earlier. I’m not sure why it needs to be bound in both directions, as the IPsec profile is connected to the ISAKMP profile which is connected to this virtual interface already.

Client Setup

That should be it for the server-side setup. To configure a client, install the VPN software you downloaded earlier, start it, create a new IPsec configuration, and enter the following details:

Server
The public IP address of the VPN server
Group Name
The same group name that you used on the server earlier
Pre-Shared Key
The same key that you entered on the server earlier

Now click on the Connect button, and after a few seconds the window should minimize to the system tray, and you should be connected to the VPN. You can check this by pinging the internal IP address of the router (e.g. 192.168.1.1) and if that works, the IP addresses of whatever internal servers you want to connect to.

If it doesn’t work, use the Log menu to enable logging, try to connect again, and check the results on the Logging tab. You can also try enabling IPsec debugging on the router, in run mode (not configuration mode):

debug crypto engine packet
debug crypto ipsec error
debug crypto isakmp error
debug crypto verbose
terminal monitor

When the configuration works, write it to the router’s non-volatile memory to ensure that you don’t lose it when you next reboot the router:

write

And that’s it!

References

Here are some random unsorted links to pages that I found useful while figuring out how to do this: