Distribution of Geodata

A big problem with geographic data is distributing. Geographic data is huge. In GML the roads of Los Angeles are 300 megabytes. The Google street map for just the US is terabytes of data. We have compelling 3d mapping applications, like Google Earth and WorldWind, and indeed a very high degree of responsiveness is now expected of flat web maps thanks to Google Maps. Traditionally one would just use a desktop GIS system to display new geographic information, but increasingly we want to be able to just view layers online.

Unfortunately these applications make incredibly high demands on existing interoperable servers. The tradition has been to dynamically generate a new map for every request – the server would get a request and go to its database and draw it. I believe the JRC in Italy banned WorldWind from hitting their server because the load was too much, and the bandwidth was costing them.

The first step is simple, and we’re hoping to get things going in GeoServer very soon, and is being done in clients like ka-map (though I wish it worked with all WMS and not just mapserver), and openlayers (which I love). You divide up the world in to tiles, just like google maps, and cache the results on the server. That way the server has a saved version of the results, and can just return it rather than generate it dynamically. What is needed past what happens now is a standard, so that clients tile up the earth in the same way and make the same requests. This way a request that one client had previously made would be available to another client looking at the same area, as long as the server is caching. But if there’s a way of dividing up the earth, by using simply the WMS protocol and Squid, you can get some better performance. See our initial experiment (we’re working on better styling), the initial views and zoom levels will be fast as they are cached by Squid. As you move to other areas of the map it may slow down a bit, but when someone else looks at them some other time they will be cached, generating no additional load on the dynamic server. This is a first step, as it eases the processing load on the server, and makes things faster, but unfortunately it wouldn’t help with the bandwidth costs associated with standing up a popular geospatial server.

This is where an architecture of participation can come in to play. Peer to Peer technology has been evolving significantly, and it could likely handle the same tiles that a server would cache. So instead of asking the server to return a set of tiles that represents an area, it could ask a p2p network instead of hitting the server. The problem that may initially look hard is how you get the tiles on the p2p network in the first place. This could actually be incredibly easy, if built in to thick 3d clients like Google Earth and WorldWind. These clients are already caching gigabytes of data, in tiles. We just need to make sure that the tiles meet a standard for how they divide up the earth. But the p2p technology could easily be built in to the clients and automatically start up and start sharing (though may need some configuration help to get past firewalls). Hopefully you’d also make it easy for people with extra server resources to help out, they could install a simple program and give it a certain amount of hard drive space, and it’d fill up with tiles for users – a voluntary supernode of sorts

We could actually likely do one better as well, and allow thin clients like google maps/openlayers/ka-map to talk to a p2p network. There are p2p clients that are web servers in their own right. So one could download and install a p2p client, and it would act as a ‘map accelerator’ for when uses the lightweight services. The web based client would ask the local p2p program if it had the tiles requested, which in turn would ask the whole p2p network. If any tiles weren’t there it could ask the real server (which is likely caching them anyways). And it would then insert the tiles in to the p2p network for use by other clients. Ideally this would work if users already had a p2p client, or they could download a generic p2p client that would be rebranded as ‘map accelerator’. This would also provide more persistent caching for web maps, since if you visited an area you previously visited, you’d have a fast local copy. Currently you only get local caching in each browser session. And those on your LAN would also greatly benefit from your caching.

As for implementations, Schuyler presented many of these ideas at the last OSGeo conference (see his presentation (powerpoint warning)), and had an interesting thought for actually building it. Using a Distributed Hash Table based on location, under the assumption that those who are physically near you are more likely to be looking at the same tiles as you, since people tend to look more at maps of the area they are in. Another idea I hadn’t thought of that John Graham brought up is ‘GeoTorrents’. I had always just thought of torrents being useful for distributing large datasets efficiently, but that you’d have to get the whole dataset. The idea John pointed to is that since a bittorrent already divides up large files in to much smaller files, you could bootstrap on that, just having the way it divides up a large map image be in to the pre-set tiles. I’ve never looked extensively at how bittorrent works, but if one is allowed to request only a small portion of the whole then this could be an ideal solution. Many p2p clients seem to be not so optimized for smaller files, but most do split up large files and ‘swarm’ it in a bunch of parts. So we could consider the full set of tiles the large file, and individual tiles the parts that make it up. The difference for us is that being able to just view a coherent set of the parts is probably more useful than the whole. But the point is that there should be some way to leverage the p2p concept so that those who have already viewed an area of the map can serve it to those who haven’t, instead of everyone just swamping a central service.

The alternative to going this route is just letting the big guys dominate the mapping world. Not only are they the only ones who can afford the data, they are the only ones who can stand up a service that can handle millions of users. The damn brilliant thing about using an architecture of participation for geospatial data information is that as a layer gets more popular it scales perfectly, since more people downloading and checking out the layer means that more people are serving it up to others. So even the smallest provider can afford to stand up a server and not have it slow to a crawl if it gets suddenly popular.

Advertisements

2 thoughts on “Distribution of Geodata

  1. Hi Chris,
    I stumbled onto your blog a couple of days back and have really enjoyed reading through your posts.
    Two of your recent topic are ideas ive been churning around in my head for a while (collaborative geodata and p2p geodata distribution).
    p2p geodata distribution built into client software, I think, has the potential to lead to a more open approach from data owners by allowing a very simple method of geodata distribution.

    My background is in spatial analysis / environmental modelling as a result I can see p2p applications for the distribution of datasets (WCS and WFS for example) with WMS improvments being an added bonus. There is a lot more power in overlaying the data then the map.

    Before i go and try to invent something on my own do you know of anyone who is making any significant attempt at making any aspects of p2p geodata distribution a reality?

    In my searching i havent found many significant attempts except for the following few.
    -There is the very basic geotorrent http://www.geotorrent.org/
    -An attempt to build p2p into an early version of worldwind
    -J Bergamini’s masters project (strangely also called geotorrent)
    here ( i think you even commented on this page)
    http://thread.gmane.org/gmane.comp.gis.udig.devel/2387/focus=2439
    and
    http://www.ucgis.org/summer2006/studentpapers/bergamini-ucgis-final.pdf

    Have i missed any?
    Thanks for the great blog.
    Art

  2. Pingback: I take your S3 and raise you an EC2 « Into The Pudding

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s