MIT’s GeoWeb Repository of Data

March 16th, 2008by Sean Gorman

We came across a small blurb in the MIT news today about the release of “MIT GeoWeb

“… a new interface to the MIT Geodata Repository, enables users to access Geographic Information Systems (GIS) data, once accessible only in ArcGIS, through a standard web browser.”

MIT_geoweb4

The MIT GeoWeb provides a Google Maps interface to their extensive repository of geodata in shapefile format. In short you can search the MIT repository of data by geographic region, keyword or browse, then visualize the file that you find on Google Maps in the same browser. If you like what you find you can check out the metadata and/or download the shapefile. While the user interface is not the prettiest thing I’ve ever seen it looks to be effective with has a nice array of data you can browse. The quick visualization of lines, points and polygons is also a very nice feature.

On the downside you can’t click on the data rendered in Google Maps to see the information behind it. You also can’t download the data in a file format other than shapefile, so accessing the data is still restricted to GIS applications. Although the biggest kicker is to access to the application at all you have to be a MIT student or employee. That puts a bit of a damper on the whole thing, but still a clever implementation further pushing the frontier of open data access.

There is a nice screencast of the application here.

Popularity: 21% [?]

Myths of Crowdsouring

November 3rd, 2007by Sean Gorman

Figured I would keep on the crowdsourced data theme going with some myths I’ve seen crop up in many people’s perception of crowdsourced data and its reliability. First lets take a step back and look at a definition of crowdsourcing, ” [the] act of taking a job traditionally performed by an employee or contractor, and outsourcing it to an undefined, generally large group of people, in the form of an open call.” The fact this “group” is not paid or under contract leads many to believe what they produce cannot be trusted. I think this general assumption leads to a number of myths about crowdsourced data.

Crowdsourced Data and Official Data are Mutually Exclusive

There is a common perception (especially from traditional data providers) that data comes from an official source and is guaranteed accurate or it is crowdsourced and you have no clue if it is accurate or not. Encyclopedia Britannica articles come from an official source and Wikipedia is crowdsourced. NAVTEQ street data comes from an official source and OpenStreetMaps is crowdsourced. We can trust Encyclopedia Britannica and NAVTEQ because we pay them to provide us an accurate product, but we are not sure if we can trust Wikipedia and OpenStreetMap because we do not have a contract for them and any willy nilly crazy person could enter bad data. The issue is seen in black and white - non-trusted and trusted.

In reality crowdsourcing is a tool to collect data. Sometimes it is an end in and of itself like Wikipedia and OpenStreetMap. Other times it is an enabler - like voting news stories from third party sources on Digg. Digg does not user generate the stories but crowdsources the determination of which stories are most worth reading. More recently Tom Tom has used crowdsourced data to enhance their official base data. Perhaps the greatest potential of the crowdsourcing model is a hybrid working with traditional/official data sets. Not only mixing the two together, but using crowdsourcing to enhance the accuracy and validity of existing official data. For instance a map of toxic dumping sites from the EPA is interesting by itself, but it is imminently more valuable if you can add your own data of the schools, playgrounds, and friend’s houses your kids play at. Secondly, if you would like to add evidence to the map supporting the damage caused by the dumping site or add evidence showing the dumping site has been cleaned up then everyone has better context for the original data set and its validity. In both cases crowdsourcing is being used to enhance existing data and does not stand by itself.

Official Data is Automatically Accurate than Crowdsourced Data is Not

Their is a pervasive myth that if data comes from an official source or has official metadata then it must be accurate. Vice versa if it is crowdsourced it must be inaccurate. The truth of the matter is official data and metadata has inaccuracies and crowdsourced data has inaccuracies. In fact the vast vast majority of data in the world has inaccuracies. To quote Chris (our beloved Heretic Alpaca and CTO), “your data sucks and my data sucks - now that we have that settled we can go do something.” The fact that people think corwdsourced data is inaccurate is truly a good thing because they think about what they are consuming and are looking to see if there are problems. The beauty is that when they find problems they can actually go and fix them. The worst thing about official data is that we blindly assume that everything is perfect and when we find that perfection lacking there is no recourse to fix it.

Metadata is the Panacea

Many a GIS wonk has preached without metadata geographic information is just content. Once there is metadata the professionals have entered the room and all concerns evaporate. When people ask me about metadata in GeoCommons, especially our government customers, I say sure we can include your metadata. We can even make it mandatory to include metadata before inclusion if that is your preference, but just having metadata we do not think is sufficient. Metadata can often be anonymous and there is seldom repercussions or rewards if you are sloppy and quick putting in your metadata or thorough and diligent. When you fuse metadata with a crowdsourcing approach there can now be accountability. I create and contribute the data and that data is attached to me. You can click on the source and you get my profile. If the data rocks - kudos and praise for me, if the data blows - everyone knows I was the slacker who put it in.

Recently I did some digging back into the arguments around FGDC metadata when it came out in the early nineties. The standards was not without criticism and suggestions for improvement (Dutton 1994), “The metadata standard is per force formulated from a producer’s perspective. It is, one assumes, the responsibility of data producers to document published datasets, and there is not much consumers can do other than to offer feedback on the adequacy of the organization, usability and quality of datasets they acquire.” We now have the technological means by which to address what could not be addressed then, yet we are to ensconced in the statas quo and dogma to embrace the opportunity to improve the system.

Crowdsourcing is the Wild West of Data

Crowdsourcing is often conflated with “no rules” or “anything goes”, thus leading to a perception of not being trustworthy. While you can crowdsource with no rules it does not mean you are not allowed to have rules. Further, those rules can result in highly trusted content. Think of academic publishing, one of the most successful crowdsourced experiments of all time. Anyone can submit an article have it reviewed by a group of peers anonymously and published in some of the most trusted publications on the planet. No one pays me to publish an article. I have no economic incentive for the data in my paper to be accurate yet I would trust information from “The New England Journal of Medicine” way before I would anything out of the Encyclopedia Britannica. But…you say…academic journal are written by professionals! Not necessarily true - anyone can submit to an academic journal. You need no pedigree, and articles have been published by undergraduates that have no degree at all. The same kid with a Facebook page and 254 friends. Academic journals are trusted because of the peer driven culture that surrounds them, not economic incentive or accuracy standards that must be adhered to. A crowdsourced system can be highly trustworthy depending on how it is structured and the rules that are put in place. I do believe there is a trade off between the number of rules and requirements and the level of participation and innovation in a crowdsourced system. The more rules and requirements the higher the level of trust, but the less participation and possible innovation. Those that can maximize trust and participation in a crowdsourced application will be those who succeed.

Conclusion

In short I think crowdsourced data and tools often get an undeserved stereotype. People tend to lump it all together instead of looking at opportunities to leverage a new tool to enhance their competitiveness. I think this is often the result of fearful knee jerk reactions. Crowdsourcing does have the ability to disintermediate market places, but those who figure out how to harness that to their advantage will be the ones who succeed. Defensive criticism is usually a sign you are strategically headed in the wrong direction.

Popularity: 8% [?]

Andrew Turner has a great series of blog posts on the future of KML that were the product of meetings at the OGC on the topic a week or so ago. Lots of interesting content in Andrew’s series, but the one most near and dear to us is the discussion on metadata. Chris made it out to the meeting with Andrew to throw our 2 cents into the discussion, and convey Chris’s thoughts on the schema tag and how attributed data can be embedded into it. We should not confuse adding attribute data to KML to adding metadata to KML as Sean Gillies points out in response to Andrew’s post. Both are important but serve two different and distinct functions.

Our use of the schema tag is to allow additional data to be added to KML to describe a location on the map. Natively KML supports the ability to add a description and Z coordinate to a location. So, you can describe a push pin with text, HTML and/or a picture then add a Z coordinate that provides a metric to that push pin. This allows you to do many things and has created a lot of great KML, but there are limits. Namely you can only really add two attributes - a description and a metric. Lots of locations descriptions and data in general is multi dimensional.

Lets take a simple example of one of the first Google “My Maps” mashups of the 2004 US Presidential Election. The election mashup is a nice thematic map of Bush (red states) versus Kerry votes (blue states), and when you click on a state it shows you the percent of votes for each candidate. The data on the percentage of votes for Bush and Kerry is placed in the description field of the KML requiring the user to color code each state to create the thematic map. This is quite a bit of work since your are using a qualitative data field to try and do something quantitative.

This is something we would like to change, by making it a lot easier for anyone to create KML that easily handles quantitative data. The geoweb, to date, has done a great job of opening up mapping by allowing anyone to create a qualitative description (text, HTML, pictures) of a location. This is what KML is currently geared to support, but there are an increasing number of people that would like to expand quantitative data beyond a single Z attribute.

In his post Andrew pointed to our use of the schema tag to enable thematic mapping, and that is accurate, but only the tip of the iceberg of what is possible. Once you have access to multiple data descriptors about a location it enables a range of decision making tools. KML currently reflects the “read - write” functionality of Web 2.0, but in order to evolve to a “read-write-execute” web it will need the ability to support quantitative functions that allows users to be enabled by decision support.

Since things are always clearer with examples and our favorite example is finding bars and single (men/women) let me give it a shot. Currently we would search for bars and get back KML that describes the bar - name, address, user comments, maybe a user rating. The KML and current applications cover this very well - we can “read” and “write” back to the KML - very Web 2.0. What is missing is any analysis of those bars that tell me the best one to go to.

Lets say the application already knows a few things about me - I am a 33 years old, single, male, work in IT, and I am a Taurus. This information and much more could be easily picked up from a social network profile like Facebook or MySpace. If I now did a search on bars and the KML had embedded feature attribute data for the bars and the surrounding contextual data I could be directed to the bars that had the highest correlation with women that are single, in an adjacent age bracket, and work in IT. If I had a good experience at the bar I could post back my comment to the bar further reinforcing that quantitative correlation with user generated validation. Now my KML has enabled a “read-write-execute” application that is both qualitative and quantitative. That I believe is the long term value proposition for KML 3.0.

Popularity: 17% [?]

OK – the title is a bit over the top sensationalistic, but the metadata debate opens up the larger topic of technology being used to increase participation. There is a long history of technology increasing participation – the PC Revolution with the microcomputer, word processor, spreadsheet, etc – Web 1.0 with online auctions, web home pages, online communities, etc. – Web 2.0 with blogs, social networks, citizen journalism etc. If you really wanted to push the argument you could go back to the assembly line, the steam engine, or really stretch it back to crop rotation. I’d argue that the real power of Web 2.0 has been the democratization of participation through technology. Blogs are allowing anyone to have a voice - participatory media sites like Digg, Newsvine, StumbledUpon, Furl are allowing the public to vote what is news – self broadcasting platforms like YouTube, Vimeo, Blip.tv will put anyone on TV – participatory office applications like Writely and Google Spreadsheets are all changing the face of how the public interacts with technology and each other.

Mapping has very much been a part of this story, with Google Earth/Maps, Microsoft Virtual Earth, Yahoo! Maps and new projects like Open Street Maps all playing a role. In fact it was mapping applications that kicked off the mashup phenomenon with the combination of Google Maps and Craig’s List rental listings. Not surprisingly participatory mapping mashups sprung up in short order with innovative sites like Platial, Tagzania, Frappr and others. In these applications anyone could create a location on a map and tag it with social information like photos or descriptions about why they created it. These efforts were very much in the Web 2.0 model of mass participation where anyone could contribute information. For the most part, though, the data was fun and not what the GIS world would consider substantive. Sometimes this movement is called neogeography, web mapping, or a leading part of the larger geoweb.

In the GIS world it is a much different model where a small number of highly trained professionals have access to data and tools with which they render maps to be distributed to everyone else. As technology has advanced these maps started to be delivered to web browsers and have some interactivity. The model always remained the same though – professional gate keepers that brokered knowledge out to the masses. As Google and other mapping applications API’s have proliferated, the worlds of neogeography and traditional GIS have begun to intersect. Now the major GIS vendors are offering API’s to their technologies and there are new more dynamic ways for maps and information to be delivered. While the new technologies coming from the GIS vendors all have the right buzz words they still work on the very same model. A small group of trained professionals acting as gate keepers to the masses – whether their maps are delivered to you as piece of paper or a rich media Ajax application.

This is the crux, I believe, of the metadata debate. Let’s be honest adding a metadata link to a system like ours or anyone else’s is not really the issue. Adding in the link is not so tough and we’ll figure out an effective way to link to metadata if it is there. The issue is opening up geographic data and analysis tools to the masses. Metadata is a convenient barrier to entry as is the expense of software, training, and infrastructure to even get your foot in the proverbial geospatial door. The big goal of GeoCommons is to break down those barriers, so that geographic data and analysis can become accessible and participatory to everyone. I think that technology inexorably moves in this direction, but in my mind that is not why it is crucial to open up geographic data and analysis. The vast majority of geographic data is a public good. It is paid for and created by governments and nongovernmental organizations (NGOs). The mission of the data creators is to have the data readily available and consumable by the public, because they are inherently the ones that have paid for it. Yet we have a huge middleman that has grown up between the public and the data. A middle man that requires you to buy software, take training classes to use it, and support their ecosystem in order to access and consume the data. This ecosystem has in turned created a profession of people who have taken the courses, put in the time, and understands the often complicated world of geographic data and analysis. Neither the ecosystem nor the profession wants to see that cozy arrangement disrupted. Yet that is exactly what we are on the brink of.

Don’t get me wrong I am not advocating the end of Geographic Information Science or Systems. There is sophistication in the discipline that will never be comprehensible to the masses and that will always be the case. I spent way too long in grad school trying to sort it all out to have delusions that my Mom is going to be computing Voronoi tessellations. There are great things that the GIS world has and will continue to contribute, but it should not be an all or nothing monopoly. I do believe that access to geographic data and simple analysis tools should be made available to everyone, and I should not have to jump through the ridiculous barriers of entry to consume the data my tax dollars have already paid for. That all said there is an incredible amount of work that needs to be done to make this happen. We may or may not figure it all out, but we’ll push the ball forward and I’d put all the money in my piggy bank on the model changing through one innovation or another.

Popularity: 8% [?]

A few weeks ago James Fee wrote a blog post about a debate he had with Steven Citron-Pousty concerning the usefulness of GeoCommons. From a high level the argument came down to the GIS vs. Neogeogrpahy debate. There were great quotes on both sides like “freaking sweet” in support and “pretty worthless” in the bashing category. Over all we were excited to see that GeoCommons had started a debate in the GIS community. The intent of GeoCommons had originally not been to provide a resource to the GIS community, but to provide access to GIS data and a few tools to the rest of the world. The GIS community always had access to the data and the tools, so I had figured GeoCommons would not even pop up on the radar. In hindsight I think we should be providing mutual resources to each other, so with that in mind here are few thoughts on the topic.

Read the rest of this entry »

Popularity: 10% [?]