Architectures of Participation for metadata.

So if metadata is the problem, what’s the solution?

I believe there are two parts. The first is to automate as much as possible, the second is to involve people as much as possible.

Automation:

The remote sensing community figured out awhile ago that the best way to ensure that there’s alway some decent metadata is to embed it in the data itself. In geotiff, jpeg2000, and others, the headers of the files always say the time the image was created and the area that it represents. In vector formats, there is nothing like this. What needs to happen is that the tools that process the data need to automatically annotate what the original sources and what was done to them and by whom. Someone is always logged in to a computer, even Office includes who the author of a file was. Those who gather and create data should not be saddled with the responsibility of also creating good metadata, instead as much as possible should be created automatically at creation time, with them just making sure that their settings are right.

People:

There is a saying in the open source software, that ESR dubbed ‘Linus’ Law‘: ‘Given enough eyeballs, all bugs are shallow’. My law for geospatial metadata would be ‘Given enough eyeballs, good metadata will emerge’. Given a large enough user base, problems with metadata will be obvious and can be corrected and annotated to be better by someone. But the pre-requisite is that the control of the metadata must be in the hands of all, just like open source software is. Of course that may not mean everyone necessarily has write access to the metadata (though it might), but instead that there is an architecture of participation around metadata that allows feedback loops that find errors in metadata to feed back in to the original.

How will this work? There are many models out there, and I’m not sure exactly what the architecture will look like, but I have a couple of ideas. The follow up to Linus’s law is that the person who finds the bug is usually not the person who fixes it. Similarly the person who creates the geospatial data should not usually be the one who creates the metadata. It’s the last task that they’re excited about doing, it’s similar to how coders don’t like to comment their code, but it’s even worse, since there are arcane standards that require ‘training‘. .

So what we need to do is open the door for others to edit the metadata. The most obvious solution is wiki-able metadata, editable by anyone. This is certainly a step in the right direction, but I think we could do better. The next solution could look something like amazon, with their listings of books. People can write comments about the books, adding additional information, and giving subjective opinions. Others can then ‘rate’ their comments, with the ‘was this review helpful to you?’ functionality, so that the best comments rise to the top. So to in geospatial metadata, those who have downloaded or browsed the dataset, studied it extensively, know the field, ect. are the most able to comment on the data. Others rate the comments that were helpful to them, and so one can easily see what others thought of it.

Beyond that I think it’d be interesting to add a social component to it. For example you could fill out a profile, and perhaps the domain you’re most interested in is ornithology. You would obviously care about the datasets that others in the same domain are interested in, and especially datasets that they rate highly. Ideally you could bring this back to the automation, and have not only the creation of metadata be automated, but also the addition of additional metadata. Your desktop GIS, or 3d browser would automatically tally what datasets you’ve looked at, and it would rate higher ones that you consistently come back to. Of course you could manually lower the rating, maybe you come back to one since you’re masochistic and like using really bad data – you as the user should be able to override the default value, but it’s really nice to have the automatically created defaults.

If the clients to browse and use data are a part of the architecture of participation, the creation of good metadata becomes much easier. The problem then shifts back to machines, to being able to process all this metadata that people are generating in to something useful. Right now the best we’ve probably got in the way of organizing this stuff is maybe the Google Earth Community board or a few of the sites that organize mash-ups. These are the Yahoo!’s of the geospatial arena, humans running around trying to organize what’s out there. I hope that we will see massive amounts of valuable data that will just beg for an innovative company to come along and help organize it all. The other way to bootstrap could be to just start a nuetral catalog where anyone can register, including registering other people’s services. And build a layer on top that allows additional comments and ratings.

The Metadata problem. Or, the problem with metadata

In the geospatial domain, a big problem that many worry about is 'metadata'. Metadata is the information about the data: who collected it, when it was last updated, how accurate it is, how it was made, who to contact to get it, ect. For many years the FGDC, the coordinating organization for sharing geospatial information in the US, has primarily focused on getting people to write metadata for their datasets, and to put the metadata in catalogs of information so that others know what information is out there.

Unfortunately, though millions of dollars have been spent educating people on metadata standards and how to fill them out, there is still shockingly little metadata, let alone actual data, available. Many believe that metadata has to be the basis of the coming 'geospatial web', that being able to at least search who has what information is the first step towards getting even more data available. If I know who has what information, then at least I can seek them out and offer to pay them for it, at least it exists in some form, or so the argument goes.

The big counter to this argument, however, is the World Wide Web. When you write a web page, how much metadata are you required to fill out? Absolutely none. Yes, there are some meta-tags in html, but none are required, and your web page will still be found if there are none. Why isn't this metadata needed? Because a whole industry has been built around helping you search web pages, indeed, to judge by what sites get the most traffic, it's definitely the most important. Why did these search engines come about? Because there was data. Lots of it. And people needed help finding it. In the early days it was Yahoo!, which was able to hire a bunch of people to search the web and categorize it. As the web started growing faster than a team of monkeys clicking all over the place could handle, automated techniques began to be used, with Google emerging as the clear winner.

And the web continues to innovate, with blogs that one person can follow for some other individual's recommendations of information that may be relevant to them, with community rated sites like slashdot and digg, and community tagging on sites like flickr and del.icio.us. Many people are looking to apply such things to geospatial, but what needs to happen first is to put data online.

Unfortunately many of the largest organizations that have data don't put it online. One argument is technical, that it costs too much and is too hard to set up a server to get the data out there. I hope that GeoServer, my main focus in the last few years, is able to offer a cost free easy to use alternative to make that argument less effective. But I believe there's a deeper issue, mostly related to psychology, with individuals being scared to put their data out there. Why? Because the individuals who produce it fear that what they've made isn't good enough, that it has to be perfect, or people will think less of them. And it gets even worse, since there's this whole metadata pressure, that says they better have good metadata if they want to put things out there.

I understand the fear well, when my boss first asked me to release my code to the public repository, where anyone could look at it, it freaked me out. I asked him for an extra week, and spent it adding more comments, redoing the quicker hacks I did for cleaner code, ect. At the end of the week he asked me again, and I still didn't feel ready. What if someone read it and realized I was a bad coder? It might hurt my chances of a future job. It was putting a piece of myself out there for others to judge, and it was very scary. But I eventually got over it, because I realized that even very code that I wrote is generally better than their alternative, which is nothing.

In the geospatial domain, for the most part, we get nothing. People are afraid others might find errors, or they don't have the time to fill out the appropriate metadata. And past that they lack the skills to set up a server, or a good place to just post their data. Though there is a freedom of information act in the US that basically requires most any information by the government to be available to all taxpayers, there is still just a tiny percentage of geospatial information available, let alone accessible to an average user.

I think one of the biggest things needed is a shift in thinking. Metadata needs an architecture of participation, and there needs to be a culture of encouragement. Indeed we need an architecture of participation around geospatial data, so that releasing it isn't opening yourself up to criticism, but instead it puts the onus on others to make what you've put out better, or to move on. This is how it works in the Open Source movement, code released is always seen as a good thing, even if it's not what I need. Once the data starts to get out there, I believe it will begin to make economic sense for companies to build search engines and participation based organization schemes that will organize it. The problem is not a lack of metadata, instead it's the focus on metadata that's slowing down getting real data out there for real innovation. I'll write more about what I think can help in a future post.

(for a great piece about metadata in general, see: Is it time for a Moratorium on  Metadata?)

Building the GeoSpatial Web, Introduction

Ok, it's finally time for me to get to the whole reason I started this blog, which I thought I'd have done in the first month or two. The thing I'd like to examine is how architectures of participation can apply to geospatial. I started a journal paper when I was in Africa, with the help of Mike Gould and Andreas Wytzisk. Unfortunately it was far too speculative for academia, and we've had a hard time turning it in to a real journal article, with appropriate references for everything. I learned a lot from their help, but as I have no immediate plans to be an academic, and more just want to try to see the ideas come in to being, I decided it makes sense to just present what I came up with in a blog so it's out, so the things I thought about my year in Zambia don't just live in my head.

The subject of the paper was applying 'open source principles' to 'spatial data infrastructures'. I've since taken to terms that I like a bit better for each, which I've been using in this blog. Architectures of Participation I use for 'open source principles', so as to not confuse the issue as much (see my original post, linked above, for more). And I've found the term SDI too over-used so it now means too many different things to people. So I prefer 'GeoSpatial Web', which I believe would have the same end result that the SDI builders desire – just as the World Wide Web is the way of organizing and searching tons of text information without ever making people fill out metadata and put their data in catalogs.

There are four areas where I feel geospatial could benefit from increased architectures of participation. This should strike as a lot less radical than a couple years ago, as 'web 2.0' is on the cover of newsweek and geospatial leads the 'mash-up' charge (see for example http://www.programmableweb.com/apis, which at the time of the writing had 30% of mashups written with Google Maps (#1), plus 5% GeoNames, and 5% Yahoo Maps). But I believe that something much bigger could happen than a bunch of one off mashups. I believe that there is the potential for a true 'GeoSpatial Web', where geodata is so available and so rich that it becomes a texture, a foundation on top of which new services can be made. Just as the World Wide Web enables these geospatial mashups – one architecture of participation begets many more – so too will the geospatial web ultimately enable new layers of openness and collaboration. The things that jump to me are urban planning: open source traffic modeling, true citizen participation in planning – able to adjust simulation variables online, being able to walk by an empty lot and learn what's going to be built there next, and to use one's mobile phone to leave feedback on the plan. But I imagine it will mean different things to different domains, just the WWW has enabled connections that no one could anticipate. But the point is we won't just have mash-ups, instead we'll have mashups that talk to each other, that live as information in their own right, to be re-used and re-mashed, not merely the end combination of commercial APIs as they tend towards now.

The four areas where I initially see architectures of participation being applied are (in no particular order)

  • Metadata – creation and updating
  • Geospatial data – generation and maintenance
  • Software for geospatial (open source)
  • Distribution of data

If these four things come in to place, we will start to have holarchy of participation around geospatial data. I believe we will no longer talk about building 'spatial data infrastructures', but that things will just come together in an integrated web of information. It will no longer be a matter of paying people to learn how to fill out metadata and put their data online, but instead citizens will ask the question 'why isn't this data available?', just like we now wonder how companies and governments don't have a web presence. The SDI builders should focus on enabling bottom up participation, innovating to enable a new kind of infrastructure, instead of relying on past tech metaphors.

Over the next bunch of posts I hope to explore in depth what can be done in the geospatial domain to bring more participation. I believe that if done right it will enable what we dream of, and more.

Open Source Lego Metaphor

So it's been too long since I've posted anything. Thought I'd throw up a piece that I've used in a couple papers, which can be a nice way for me to explain open source, so that I can reference it in future posts.

The term Open Source (OS) refers to a set of licenses that require unfettered access to the human-readable source code from which all computer programs are made. There are many explanations of Open Source, this is an attempt to highlight the collaborative spirit of the Open Source process. A helpful metaphor is to think of source code as a numberof LEGO® pieces. This is not far from reality, as source code is not just a long stream of text, instead it is a number of small files that work together and build something morecomplex than the sum of their individual parts. One can imagine software as a very complex LEGO house, a functional unit built up from individual pieces that is used by consumers.Most commercial software is sold already built into its final form – similar to an already built house, (or a car, a firetruck, or a school). Unlike LEGO sets, which include a detailedinstruction booklet, proprietary software does not include instructions for its inner workings. This is satisfactory for most people, since they just want a house for their LEGO people tolive in. But it's a travesty for anyone who plays with LEGOs, or who might want to modifytheir house after they buy it, or use the parts to build an entirely new type of structure. Open Source software requires that the instructions to build the house are always included. This may seem like a very minor point, but for people who build with LEGOs it is quite important.

A key property of all software is that it can be infinitely copied. In economics terms this is called a non-rival good – one which can be used by two people with no loss toone another. This makes software a good type of thing to share, since no one loses anything when someone else is using it. This property is what has enabled open source software, since it costs me close to nothing for you to use the same code that I did.

For a long time Open Source software was something that no one except the LEGO builders cared about. They formed communities and built all sorts of LEGO structures, freely sharingthem with one another. They built a culture around sharing, and formed governing structures to organize themselves to build even larger houses, which anyone could then freely copy anduse. If someone built a nice extra room on their house, then they could contribute it to the large house that everyone was working on, and all would then gain the benefits of hisinnovation. As more people started copying and building better houses, the builders came upwith even more innovative ways to organize and coordinate everyone's efforts. But at the root was the right that I am allowed to walk in to someone's (digital) house, and if there's a light fixture that I like, I'm allowed to rip it off the wall and put it in my house. Which isn't a big deal, since ripping it off their wall doesn't actually take it from them, I just take a copy out of the wall.Soon the world started noticing that they were building some really nice houses. Not always the best looking houses, but generally some of the strongest. The Open Source community stipulated that these houses should be freely copyable, modifiable, and able to be used as pieces of even larger buildings. Commercial software, on the other hand, continues to notonly hide the instructions, but also is usually bound by a license to prevent people from copying the LEGO house they bought. Open Source communities, however, establish thecopying of software as a right, leading to a conception of property "configured fundamentallyaround the right to distribute, not to exclude" (castells)

Past the property conceptions, what the open source software movement has done is innovate on governance and development process to enable diverse groups of people to collaborate on a common good. It is about the community, not about the mere fact that one can copy software for free. It is people working together on a common good. Open Source Software is the foremost example of Architectures of Participation, and indeed each individual project is different from the next in how they are run – each has their own architecture. But there are now fairly established best practices in how to run a successful software, Fogel's Producing Open Source Software does an amazing job of gathering them together. I hope that we can build and discover such best practices for architectures of participation around geospatial information.