I take your S3 and raise you an EC2

Just read Chris’s post about using Amazon’s S3 as a home for caches. The Amazon service I’ve actually been contemplating for tiling purposes is actually their Elastic Compute Cloud (EC2) . But before we get in to it, a bit on S3 and tiles. I’d actually still like the distributed peer to peer tile cache, as I talked about in my post on geodata distribution. But it makes a lot of sense to bootstrap existing services on the way to get there. S3 could certainaly help out as a ‘node of last resort’ – it’s nice to know that the tiles will definitely be available somewhere, if the cache isn’t yet popular enough to be distributed to someone else’s p2p cache on your more local network. I agree that bittorrent and coral aren’t up to snuff, but I do believe that distributing mapping tiles will work as p2p technology evolves. But first we have to get our act together with tiling in the geospatial community, so we can go with something concrete to the p2p guys. Which is why I’m excited about the work being done to figure this out.

As for EC2, I’ve been thinking about it in the context of doing caching with GeoServer. We’ve got some caching working with OpenLayers or Google Maps combined with either OSCache (with a tutorial for GeoServer) and Squid. I want to get it to the point where there’s not even a separate download, you just turn caching on, and then have a big ‘go’ button that walks the whole tile area caching it all on the way. The problem with it though is that huge datasets can take days, weeks and months to fully process. So this is where I think it could be kickass to use EC2 – provide a service to people where their ‘go’ button links to EC2 and it can throw tens, hundreds, or even thousands of servers to churn away at creating the tiles. Then return those to the server GeoServer’s on, or leave them on S3 – indeed this would save on the tile upload costs that Chris writes about, as you’d just send an SLD and the vector data in some nice compressed format. I imagine you could save on upload costs for rasters too, as you’d just upload the non-tiled images and do the tiling with EC2.

A next step for this tiling stuff would be to make a caching engine that can both pre-walk a tile set and be able to expire a spatial area of a tile set. The caching engine should store the cache according to the tile map service specification, but with the additional smarts the engine could be uploaded on to EC2 along with the tile creation software (GeoServer or MapServer), and just pre-walk the tiles, iterating through all the possible requests. And then it could also listen to a WFS-Transactional server that operates against the data used to generate the tiles in the first place. If a transaction takes place against a feature in a certain area, then that part of the cache would be expired, and could be either automatically or lazily regenerated (either send all the expired requests to the server right away, or wait until a user comes along and checks out that area again).

I like Paul’s WebMapServer href attribute in the tile map service spec, but I wonder if it’s sufficient… It might be nice if it had enough information for one to formulate enough of a GetMap request to replicate a given tile map service repository. I’m thinking the name of the layer and the ‘style’ (a named style or a link to an SLD). Maybe I’m missing something, but all the other information seems to be there. With that information then perhaps a smart tiling map service client could look at multiple repositories and realize that they were generated from the same base WMS service in the same way. Then it could swarm from multiples simultaneously. This starts to hint at the way forward for p2p distribution – for each WMS service just keep a global index of where tiling map server repositories live and let clients figure out which is fastest or hit all of them at once – including potentially other clients. A catalog that has metadata plus information of where to get even faster tiles would definitely be a popular – especially if registering there automatically put a caching tile map service in front of your WMS. You could also register say the feed of latest changes (or even just the bounding boxes of latest changes) of the WFS-T that people use to update the WMS, and smart clients can just listen and expire the tiles in a given area when they get notification from the feed.


17 thoughts on “I take your S3 and raise you an EC2

  1. I equal that bet Chris because using EC2 will ultimately use S3 as a storage mechanism anyhow~

    P2P certainly still has its merits for distribution, but i dont see that model succeeding if the dataset isnt actively available in the first place. Kinda like the critical mass required to make Bittorrent succeed …

    I much prefer Ec2 + S3 + tile wms href. The idea of lazy regeneration of tiles is a great idea imo

  2. Yeah, that’s what I was getting at, that S3 would be the ‘node of last resort’, a place the cache would for sure be available, and then you can build the p2p distribution around that backbone. Indeed just having the initial WMS on one’s own server plus the lazy loading caching engine implementation of the tile map service on EC2 storing on S3 could be a cool thing, as the original WMS should only get hit once per tile. The S3 cache becomes the next cache up, and then you can grow the p2p infrastructure on top of that. After that’s an accepted practice you could cut out S3, but I agree, we’re going to need a half way point and S3 will get people used to another computer storing their tile cache, and hopefully flush out the problems of that before doing full on distribution of partial caches at various peers.

  3. I originally ChrisT was a bit cracked with the S3 suggestion, but some investigation of S3 reveals it to be a nearly perfect fit to the tile map specification. You could upload a static TMS instance and access it directly from S3, no URL re-mapping intermediary required — it would be 100% spec compliant. Yet more confirmation that a RESTful implementation pays huge dividends.

    One thing I want to check out is whether S3 will respect and propagate cache-control headers. Because you don’t really need a fancy p2p solution if you get good shared caching going on, and since tiles are very static data they are very amenable to caching. Respecting cache-control headers is against S3’s financial interest though, because if data is cached elsewhere it doesn’t need to be downloaded from S3, so they lose data transmission revenue.

  4. You made me chuckle Paul 🙂

    I was planning to do a mockup on S3 a few weeks ago but just dont have time … if anyone is willing to give it a go let me know and i may be able to help.

    As far as cache control headers, using S3’s various authentication methods you have explicit control over pretty much everything.

    eg. I was planning to use … mod_proxy/mod_rewrite (as per tile spec) -> php s3 REST authentication -> set cache headers and return image

    My initial tests make this a very attractive option

  5. Pingback: Technical Ramblings » TileCache: Map Tile Caching
  6. Pingback: 65bc5bb118d0852de7811467789c709f
  7. Pingback: daniel powter bad day
  8. Pingback: naked filipino

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s