Wednesday 2 February 2011

Binary files

I have downloaded lots of OSM data from the API, XAPI, the Export tab and from other sources such as Geofabrik and Cloudmade. This has usually been XML files which are often compressed and are easy to understand and deal with, if a bit bulky. The ubiquitous XML format means there are XML parsers available for the languages I use and choices in the way of approaching parsing depending on what you are trying to do, though XML parsing remains a fairly slow process especially given that it is rarely an end in itself so the real use of the data cannot start until the XML has been deciphered.

The volume of data for a given area has steadily increased as detail has increased and the one thing XML is not good at is being compact. The process of moving OSM data around involves moving data across networks and this is sometimes a slow process - my own broadband is not very quick and I have no choice of supplier so there is no competitive pressure to improve the performance or the dreadful customer service, but that's another story. One way, potentially, to improve this is to use a binary file format which could be much smaller even than a compressed XML file and might be quicker to encode and decode. Such a format has been created by Scott Crosby based on the Google Protocol Buffers (protobuf). The files are known as .pbf files.

I have not used the protobuf file formats before so I read the Google pages which seemed to make some sense, though the examples are simplistic.  I turned to the OSM wiki page describing how Scott has created the protobuf layouts and it left me baffled, so I looked at the source code of various utilities that now incorporate support for .pbf files.

I wanted to write something using a .pbf file and, having done this kind of thing before, I know that it is easy to copy other people's code and use it without really understanding it; understanding the use of the .pbf files was an important objective for me. Google provide direct support for using protobuf in C, Java and Python, but there were no Python OSM examples to be found, so I decided to start by writing a Python pbf parser so I couldn't just copy someone else's code verbatim.

I downloaded a .pbf file from Geofabrik to work on. Examining the layout of the file was not easy. If you try to use a hex viewer things are not clear because chunks of the data can be compressed and even uncompressed parts are difficult because the protobuf format squashes data, especially numbers, into the minimum number of bytes to save space. Most of the text used in tags are in a string table, so each string only occurs once in each block of data. In the end I simply wrote code to work through the file extracting and examining each part step by step. The OSM wiki page did help in some places, but I got most help by looking at  other code, the protobuf definitions and the Python files the protobuf compiler creates.

I now have a Python script to parse .pbf files, so I can use the data in the same way as I would having parsed it from XML. I have used the Python OSM classes that I created some time ago to store the data so I can write XML as a quick test. If you are interested in seeing the result of my work you can download it from here.

I have tried to parse the downloaded file for England from Geofabrik. It couldn't load all of the nodes in my 3Gb memory and was killed. I removed the code to store the nodes, ways and relations just to let the code run through the whole file. After more than two hours it had run through, but I couldn't have done very much with it as none of the expanded data was saved. It works well for a smaller area, such as a county or a city, but it doesn't handle big files very well at all.


vgps said...

It is really hard to understand and parse the new osm pbf file format. Why openstreetmap just use a plain binary format?
nodeid, lat/lon, tags
wayid, (lat/lon,lat/lon,lat.lon), tags

Chris Hill said...

The OSM files are very, very big. Moving them around the internet can be slow and for some people expensive. The .pbf files make the files as small as possible, as well as providing a parseable format. The protobuf formats were designed by Google, who understand the problem well. The big reduction in file size makes the complexity worthwhile. You can always use one of the growing number of tools to convert the .pbf file to an ordinary .osm file once it is on your own network or computer.