Sunday 22 January 2012

Homogenised data? No thanks

I follow the changes made in the East Yorkshire area. Andy Ayre created a twitter feed for the area which generates tweets as osmeastriding. AndrewArm creates an RSS feed for the area too. Then there is the ITOWorld tool which is helps me look at the editors for a selected area.

Andy Ayre's twitter feed is based on a rectangle that encloses the county, so it also includes York, part of North Yorkshire and part of northern Lincolnshire too. It is very quick so edits appear only a few minutes after they were made and I check it every day. Andy's algorithm looks for changesets that fall within the rectangle, so big changesets that span the county are ignored whether objects in the county are edited or not.

AndrewArm's RSS feed is based on the polygon that describes the county and so it is more focussed. It is up to a day behind the edit. It does, however, show any edits in the county, including changesets that cover bigger areas than the county, so-called big edits.

ITOWorld's system will be familiar to many. It allows you to choose a rectangle to monitor and provides an RSS feed for changes in the selected area. This is a day or two behind the edits and the size of rectangle is limited and is much smaller than the size of East Yorkshire.

Monitoring an area reveals new mappers. Once they have made a few edits I usually send them a message saying 'Hi' and offering info or help. Newbies often want to check something or ask how to do something.

Monitoring also shows up vandalism. Speedy action means vandalism can be reverted without much risk. Reverting a vandal's edit means he may be dissuaded from doing it again, but at least the harm is limited.

Another thing that monitoring shows up are mass edits. Recently there seems to be a spate of them and some are harmful and annoying. When someone replaces a misspelt tag that is helpful. Changing Name=* to name=* or buildng=yes to building=yes would both benefit the object. However that is not what many mass edits are about. Some seem to be about homogenising the data, replacing the value part of tags with something generic. Sometimes a tag is removed and a replacement tag is added in its place.

I challenge the people who fire off these homogenising edits, which I dislike. The commonest excuse is that the wiki says the tag should be like this. This is infuriating. The wiki is full of controversy with edits chopping and changing the meaning of a tag back and forth. The open part of OpenStreetMap can be interpreted in many ways, and one way is that the tags that are used are open, that is any tag key and value is allowed. I also hear that TagInfo shows that value x is the most popular so every object must be changed to x. What TagInfo actually shows is that there are many uses of the key and that x is not the only way to use the tag, so why should they all be forced to be the same?

The second reason quoted is that how can the data be used if it is not consistent. Ontologies are mentioned. The odd thing is that people who say this are not the people who use the data for rendering, routing, analysis or whatever. The OSM data as derived from the API as XML (or pbf) needs to be processed to be used. That processing often loads the raw data into a database or processes it into a working file and that loading process can include catching a range of values that are useful for the render, analysis or whatever. If the variations of the key values are useful then they are used, but if some mass edit has flattened out the data into a monotonous grey, then those differences are not available. This processing will be a few lines of code, written once and used over and over without harming the detail at all. This is how landuse=cemetery and amenity=graveyard can be treated as the same thing or treated differently.

I know that a carefully chosen tag that gets splatted into some other meaning by a thoughtless mass edit is annoying. If that annoyance persuades a mapper to stop contributing to OSM that is a problem. Our most precious resource are our mappers and if a mass edit risks losing just one of them then that is a very serious indeed. 

I have been accused of not wanting objects I created being changed; of being too protective. I'm not and that misses the point: OSM is a wiki. The objects are all open to edit every day. I just want those edits to improve the quality not squash it out.


Tom said...

I sympathise, but do lean slightly towards working to homogenise tags a bit more. It is a frustrating extra barrier that many data users might not be aware of. In the projects where I've used OpenStreetMap I've just assumed I'm losing a certain percentage of data because of this diversity.

Perhaps a more polite way of going about this would be to produce a web based tool similar to the OSM Inspector, pre-loaded with a matrix of possible mistakes, showing all the metadata in your area that might benefit from some attention?

That way local mappers can review and act on these using their own judgement.

A feature request for Frederik?

A second idea would be to develop this matrix on the wiki and make it easy to download in a CSV or similar format, to give data users a head start.

Chris Hill said...

I'm interested in how you use OSM data. Do you load it into a database or do you process the raw .osm or .pbf file directly? What kind if diversity causes you problems? I'm asking because, after a discussion, I agree that there should be better guidelines about mass edits. Understanding what causes people a problem may help to write the guidelines and may produce tools to reduce the problem without reducing the diversitywhich someone else might find useful and interesting.

The idea of introducing a lint layer for local mappers to use to act on is interesting, but it is often lint tools that encourage the mass edit authors to make their changes in the first place. I'm not sure how you would stop a local just flattening their local diversity.

The matrix of tags that mean similar things is an interesting idea. That would help data users work with the diversity removing the need for mass edits removing the diversity in the first place. It could drive lint tools to not flag these differences and possibly be added to data extraction tools like osm2pgsql.

Tom said...


On using data, here's an example:

It's not very elegant or sophisticated, a reflection of its author's technical skills! I crudely hacked some code I found on the wiki to crunch through OSM XML, without any pre-processing at all. I don't want to try and cover straight forward spelling mistakes, cases of people using "deprecated" tags and people making up their own tagging schema. Peter Reed has done some great work showing how this is particularly noticeable for keys like shop and features like university campuses.

Now I know there are people interested in using OSM geodata who might previously have gone to the OS or the A-Z lot and done clever stuff with the data. For them, some guidance along the lines of an indicative matrix would probably be helpful.

But I got interested in OSM before the Google Maps API came along because I immediately saw that lots of voluntary groups, charities, small businesses and hobbyists could find OSM really useful. Making it usable for them would mean a more useful project and many more enthusiastic contributors. Sadly, OSM for most of them a very confusing world with walls they won't bother to scale. This diverse tagging is one brick in the wall.

I see your point about misusing the "lint" layer. It's a delicate matter, one of many where we rely on mapping principles to be almost self-evident to people who click on the edit tab or even (shudder) get sucked into the wiki. That opens up an entirely different can of worms! One could at least have some big bold text in the popup information saying "THIS MAY BE VALID" with an easy shortcut to contact the author if unsure?

gom1 said...

This is a really interesting issue. Much as we may want to find a single, simple, general solution, there isn't one.

Which is why I'm much closer to Chris's position.

In my view there are two areas where it makes sense to automatically fix data. One is simple and obvious spelling mistakes, such as the examples that Chris gives. That is relatively uncontentious (I think). The other area worth discussing autmated fixes is a small set of core features that are used in a few high-profile renders. That's because these highly visible services rely on crunching large volumes of data, so there's a practical limit to how much data cleaning they can maintain and handle.

Outside those cases I think it is better for data users to clean / homogenise stuff outside the OSM database, at the point at which they use it.

In my experience it isn't difficult to explore the keys and values that people have used in the area that I am interested in processing, so that I capture the bulk of the data that is of use to me. This seems to be how it is handled by the more complex renders (cycle maps etc).

Of course this approach means missing a few things that have been contributed, but whenever I have taken this approach it's been obvious that a few alternatives pick up almost everything that has been contributed. The number missed is swamped by the number that have not yet been added to the database at all.

To me, there are two huge problems with trying to homogenise the data itself.

One is that nobody can envisage all the different ways that the data might be used. Homogenising it around one particular view is too restrictive, and imposing one view is arrogant.

The other risk is that we lose a lot of the expressive way that contributors chose to tag nuances in what they see in the real world.

There are loads of examples where different terms describe the same thing at a superficial level, but convey nuances that may be important to some data users.

At the current state of development there is a growing amount of stuff in OSM where it would be better (in my view) to think of tagging as a living and evolving language for describing geographic features - rather than as a set of values in a database. To say "OSM is not a database" might be going too far for some, but I think there is more than a degree of truth in it.

So rather than try to impose rules, it's better to find ways of ensuring that contributors (always with varying levels of expertise) are aware of the different options that they can chose to describe a feature. And at the same time building some common ground between contributors and data users on how these values should be interpreted.

Andrew said...

I realise that this is a complicated issue without an easy solution.

I joined the project with the hope that I could create something that people can actually use; as such I value having an understandable database and will criticise anyone who doesn’t share this value.

I was willing to speak up against people with an excessive sense of data ownership when I contributed to Wikipedia and I’m willing to do that here; this applies to both sides of the argument.

Arguments about tag misuse also cut both ways; I’ve been tackling a town where the property boundaries of terraced houses had been mapped as areas tagged as barrier=fence.

Is it so bad to remove tags such as oneway=true that no-one uses in new mapping any more from the database?

In any case it’s still important for everyone to carry on being polite to people whose mapping approach they disagree with.

Chris Hill said...

adding a tag that describes something that is clearly not there is not improving the quality. There is no fence between two terraced houses. There may be one between two gardens, but it may be a wall, a hedge or nothing. I would contact the mapper and explain the issue.

I have mapped all the gardens in a village (North Ferriby) and I won't be doing that again in a hurry. I did try to reflect the fence/wall/hedge/nothing barriers from survey and aerials and I did not continue the barrier through buildings which I see odd. It seems lazy to me to do so, but anyone mapping terrace layouts and the property boundaries clearly is not being lazy.

Oneway=true/yes may be equivalent, but I would only mass edit a change after *extensive* consultation. Some people may still be using 'true' and they may have a reason for that. Not everyone uses the presets or the same presets in editors and I'm happy about that.