I take your S3 and raise you an EC2

Just read Chris’s post about using Amazon’s S3 as a home for caches. The Amazon service I’ve actually been contemplating for tiling purposes is actually their Elastic Compute Cloud (EC2) . But before we get in to it, a bit on S3 and tiles. I’d actually still like the distributed peer to peer tile cache, as I talked about in my post on geodata distribution. But it makes a lot of sense to bootstrap existing services on the way to get there. S3 could certainaly help out as a ‘node of last resort’ – it’s nice to know that the tiles will definitely be available somewhere, if the cache isn’t yet popular enough to be distributed to someone else’s p2p cache on your more local network. I agree that bittorrent and coral aren’t up to snuff, but I do believe that distributing mapping tiles will work as p2p technology evolves. But first we have to get our act together with tiling in the geospatial community, so we can go with something concrete to the p2p guys. Which is why I’m excited about the work being done to figure this out.

As for EC2, I’ve been thinking about it in the context of doing caching with GeoServer. We’ve got some caching working with OpenLayers or Google Maps combined with either OSCache (with a tutorial for GeoServer) and Squid. I want to get it to the point where there’s not even a separate download, you just turn caching on, and then have a big ‘go’ button that walks the whole tile area caching it all on the way. The problem with it though is that huge datasets can take days, weeks and months to fully process. So this is where I think it could be kickass to use EC2 – provide a service to people where their ‘go’ button links to EC2 and it can throw tens, hundreds, or even thousands of servers to churn away at creating the tiles. Then return those to the server GeoServer’s on, or leave them on S3 – indeed this would save on the tile upload costs that Chris writes about, as you’d just send an SLD and the vector data in some nice compressed format. I imagine you could save on upload costs for rasters too, as you’d just upload the non-tiled images and do the tiling with EC2.

A next step for this tiling stuff would be to make a caching engine that can both pre-walk a tile set and be able to expire a spatial area of a tile set. The caching engine should store the cache according to the tile map service specification, but with the additional smarts the engine could be uploaded on to EC2 along with the tile creation software (GeoServer or MapServer), and just pre-walk the tiles, iterating through all the possible requests. And then it could also listen to a WFS-Transactional server that operates against the data used to generate the tiles in the first place. If a transaction takes place against a feature in a certain area, then that part of the cache would be expired, and could be either automatically or lazily regenerated (either send all the expired requests to the server right away, or wait until a user comes along and checks out that area again).

I like Paul’s WebMapServer href attribute in the tile map service spec, but I wonder if it’s sufficient… It might be nice if it had enough information for one to formulate enough of a GetMap request to replicate a given tile map service repository. I’m thinking the name of the layer and the ‘style’ (a named style or a link to an SLD). Maybe I’m missing something, but all the other information seems to be there. With that information then perhaps a smart tiling map service client could look at multiple repositories and realize that they were generated from the same base WMS service in the same way. Then it could swarm from multiples simultaneously. This starts to hint at the way forward for p2p distribution – for each WMS service just keep a global index of where tiling map server repositories live and let clients figure out which is fastest or hit all of them at once – including potentially other clients. A catalog that has metadata plus information of where to get even faster tiles would definitely be a popular – especially if registering there automatically put a caching tile map service in front of your WMS. You could also register say the feed of latest changes (or even just the bounding boxes of latest changes) of the WFS-T that people use to update the WMS, and smart clients can just listen and expire the tiles in a given area when they get notification from the feed.

Advertisements

I work for a dot-org

So back in May I wrote a post that touched on the need for a name the kinds of hybrid organization that don’t fit nicely in to the non-profit vs. for-profit binary view of the world. The best we came up with was ‘non-corporation’, which still suffers from the problem of being defined by what it’s not, not by what it is. Since then I’ve heard ‘for-benefit’, which I liked a bit more, but am not in love with. And when introducing TOPP I generally just say ‘high tech non-profit’. But I think I’ve finally come upon the name to use, and it was sitting right under my nose all along.

‘I work for a dot-org’.

Try it out, let me know what you think. It’s obviously a play on the ‘dot-com‘ – which has been well established as something other than working for a big corporation (even though many of them have since become big corporations). It is not a narrow definition, which I like, as I think it’s far too soon to define what a ‘dot-org’ is, and what isn’t one. This parallels the ‘dot-com’, which seemed to be any company that was doing internet stuff. I like that it softly emphasizes a high tech nature, but the only real criteria is that the organization classifies itself online with a .org top level domain name. But I do think that non-profits doing the traditional non-profit thing should not be considered ‘dot-orgs’, just like citibank didn’t become a ‘dot-com’ when they put up a site at citibank.com.

Ok, so now that we’ve got a name the next thing to do is to spread the meme. We probably need some nice manifesto, or at least some concrete definition of what it is, even if it is something broadly inclusive. Then spread it widely, get the organizations that would be obvious dot-orgs to start identifying themselves as such, and then get our friends in the media to start writing up stories. I think success will be when a kid graduating from college can tell her parents that she’s going to work for a dot-org when she graduates, and have them not only know what that is, but be psyched that their daughter is going to be doing something good for the world and will be able to pay off her student loans before she’s 40. Or at least that will be the first success, the final success will be when the standard way to set up a new venture will be something more just and better structured to do good than the corporations running rampant today.

Proprietary vs. FOSS in the Geospatial Web

Thanks for the prod Chris, an ideal world that brings open source collaboration to geospatial data does beg the question as to what software will look like. I do strongly suspect that the core components of an architecture of participation for geospatial information would need to be open source, see my post on holarchies of participation. But I think that the edges will likely be proprietary. So the core collaboration server components will be open source, and the easy to use software pieces that aren’t whole hog GIS will be open source. But there will still be proprietary desktop GIS systems, that just have integration with the collaboration components. There is a lot of advanced functionality which it will just not make sense for the OS community to hit.

Weber provides a good lens to examine the implications of open source taken further:

The notion of ‘open-sourcing’ as a strategic organizational decision can be seen as an efficiency choice around distributed innovation, just as ‘outsourcing’ as an efficiency choice around transaction costs.

The simple logic of open sourcing would be a choice to pursue ad hoc distributed development of solutions for a problem that exists within an organization, is likely to exist elsewhere as well, and is not the key source of competitive advantage or differentiation for the organization.

So pieces of the stack that aren’t a source of competitive advantage for anyone will be those most likely to be open sourced. We see this with Frank Warmerdam’s GDAL library, which the All Points blog reports is included in ESRI’s crown jewel software. Why would the most proprietary software of the GIS world start using open source software? Because the task of reading in a variety of different formats isn’t a competitive advantage for them, so it makes more sense to cooperate than compete. How will this play out in the longer run? Data formats make the most sense, and along the same lines is projection libraries. The next step I see past that is the basic user interfaces.

This is starting to happen with new pluggable GIS systems like uDig. I see it quite likely that such toolkits that handle the reading and writing of formats and basic UI’s will have proprietary functionality built on top of them relatively soon. There will continue to be innovations in GIS analysis, new operations to be performed on data, better automatic extraction from vectors, ect. as well as innovations in visualization and more compelling user interfaces. These will be sold as proprietary software which integrates with the open source systems. The cool thing about this is it lowers the barrier to entry to new innovations in GIS, since a new company won’t have to write a full GIS system, and they won’t have to be dependent on a single company (like the current ArcGIS component sellers who are hosed if ArcGIS decides to replicate their functionality). And you will likely still have proprietary databases for advanced functionality – Oracle has great topology and versioning support that is not yet there in PostGIS. PostGIS will catch up in a couple of years, but by that time Oracle should have even more advanced functionality.

Another place we might see proprietary at the edges is open standards. We’ll likely see the basic standards – WMS, WFS, WCS – mostly fulfilled by open source. But proprietary software will likely do the more interesting analysis, the real web service chaining thing. Just like you’ll have proprietary plug-ins to uDig, so too will there be plug-ins for the Web Processing Service specification. One will be able to take an open source WFSor WCS and pass it to a proprietary WPS for some special processing (generalization, feature extraction, ect.), displaying the results on an open source WMS. I also suspect geospatial search will be best done by proprietary services, as is the trend in the wider web world. Of course Google and Yahoo run open source software extensively, but they keep their core search logic private. So geospatial web services that require massive processing power will likely have core logic proprietary, but will base it on open source software. This again follows Weber’s point – the basic functionality isn’t a core differentiator, so there will be collaboration on basic functionality – returning WMS and WFS of processed data or search results, for example – and proprietary innovation on the edges (more advanced processing algorithms on huge clusters of computers).

In short, proprietary software will continue to exist, it just won’t play the central role. It will be forced to push the edges of innovation even more to stay afloat, but I suspect it will always be a leading edge. Of course I believe open source will innovate as well, especially in this geospatial collaboration area. But the ideal is a hybrid world with the right balance of cooperation and competition to push things forward faster than we could alone.