Links List 5.30.08
May 30th, 2008by Sean Gorman
Are paper maps no more? GIS Lounge reports that the cartography division of the California State Automobile Association is slowly being phased out. The cause for the demise is the widespread availability of online map directions and in-car navigation units which cut demand for the paper maps by 13% in 2007.
The Geospatial Semantic Web Blog shares some good news for the semantic web community. The U.S. Security and Exchange Commission recently proposed a timetable requiring 500 of the largest public companies to begin filling their financial data using XBRL (Extensible Business Reporting Language). This will create a mass amount of free and real-world data for research.
Speaking of data, Anand at DataWocky answers the question of why the world needs a new database system. He discusses high volumes of data that are not being utilized due to scalability. He points to the newly launched Aster Data which is a database system natively designed and architected from the ground up for a new hardware platform: commodity clusters.
Google Earth has a new browser plug-in, which continues its roll out of Google Map API for Flash and Google App Engine. Released with it is the very extensive Google Earth JavaScript API for writing 3D map applications. Moxie thinks that this has opened a new page for GeoWeb visualization.
Popularity: 15% [?]
Power Law Distributions of Google Indexed KML: Is the Long Tail the Wrong Tail for the GeoWeb
April 29th, 2008by Sean Gorman
We made the cross country hop out to Santa Clara to attend Location Intelligence today. The weather is awesome and we just finished the morning workshops. I sat in on Lior Ron and David Minogue’s talk on “Searching the GeoWeb“.
The talk produced many interesting insights on Google’s approach to searching geodata, but one statistic really grabbed my attention. Of the millions of KML files Google has indexed roughly 95% of them have only a single feature. Meaning the vast majority of KML indexed by Google consists of single place marks like “this is my house” or “this is an airplane in flight“.
There are also many single place marks that have more useful data as well, and Lior did a great job presenting several work flows pulling up very relevant place marks for things like finding a place to windsurf in the Bay Area or a place to hike in Austria.
What I found fascinating was Google’s focus on the long tail of data, which has been a popular meme in Web 2.0 in general. The long tail refers to the tail end of a statistical distribution that covers a large number entities with small number of observations.
You can also think of this as the 80/20 rule, where 20% of the people have 80% of the wealth and the other 80% of the people have only 20% of the wealth. In this situation the long tale of the distribution is the 80% of the people with 20% of the wealth - where there are a large number of people with only small numbers of observations (wealth).
This is also called the Pareto principle and often manifests itself as a power law distribution that are commonly referenced in http://en.wikipedia.org/wiki/Complex_system to describe self organizing systems and networks.
Google’s indexing of KML on the GeoWeb is fundamentally a self organizing system of user generated content and not surprisingly it looks to fit a power law distribution. specifically a power law distribution where 95% of the KML has a single feature and the other 5% has a very large number of features that accounts for a disproportionate amount of the total features in the database. Without the raw data it is just a hunch on my part but I would bet a bar tab on the R square of a power law fit being above .85 on a rank order distribution of KML on file size or number of features.
So geeking out on statistics and complexity theory aside why does this matter? It matters because I believe it ignores the power of the short tail. The long tail is easy from a computational perspective to deal with - the files sizes are small and rendering small numbers of place marks is easy. This keeps everything very manageable and scalable.
The downside is it leaves out many of the most interesting datasets potentially available, because they are large and complex - sitting on the short tail. Another popular Web 2.0 meme is that “data is the Intel inside” - positing that large complex data sets are one of the key differentiators on the Web. So, it would seem in this case that the focus on the “long tail” and positioning “data as the Intel inside” are in conflict. This also may be another indicator of where the semantic web (or what ever you want to call the next evolution of things) diverges distinctly from Web 2.0. Until the GeoWeb can solve the problem of dealing with large complex datasets I think it will be difficult to answer deeper questions for users that create substantive value.
Talking with Lior and Dave after the workshop we agreed it was a tough problem, but definitely had big potential if solved well. Although Dave brought up the thorny issue of how do you know you are answering questions correctly. That is another can of worms that will have to wait for another blog post, but will be hugely important as things evolve.
As a side note apologies to everyone for the issues we’ve been having with the date on the blog. Our virtual machine decided it wanted to peer into the future and run its system clock faster than reality. Looks like we have it fixed but it blew away this blog post and several recent comments. I’ve done my best a rewriting this one but sadly looks like we’ve lost the comments. Fortunately most of them were letting us know the date had done gone crazy. On the upside if anyone wants to know what the weather is going to be like this weekend or how the primaries will turn out just let me know
Popularity: 28% [?]
Hierarchy or Folksonomy? Is there a Hybrid between Order and Chaos
April 15th, 2008by Sean Gorman
When we started the very first iteration of GeoCommons in 2005 folksonomies were all the rage and we jumped on board using tags to organize the geospatial data that was pushed into the new platform. During the time we had the prototype deployed we ran into many of the same issues other applications have found with folksonomies
1) people’s tags may be difficult for others to understand,
2) people may have tagged items inappropriately for others’ needs.
In short your users will not always implement tags in ways that are productive for the community - in the extreme resulting in Flickr’s 20 million unique tags. How many of those 20 million tags are misspelled words or so off the path they never get found.
In addition to the problems you encounter with folksonomies in general you have the further complications of geopspatial data. All geospatial data sets have location tags, but adding them in an unstructured way creates enough chaos that it is very difficult to leverage location tags in a thorough way. Secondly many potential users do not know the variety of geodata available. Put more simply they do not know what to search for, and having the ability to browse through data by topics is appealing.
Despite the downsides of folksonomies they are incredibly powerful and have been hugely effective in organizing vast amount of data on the web. So, as we worked on the next iteration of GeoCommons we started looking at possible hybrid approaches to folksonomies and hierarchies.
Specifically we looked at the two problems specific to geospatial data listed above 1) place tags and 2) organizing data for browsing. Solving the problems required both short term and long term solutions.
Fortunately we had a small advantage over many crowd sourced project in that we have a full time data team. They are a great group of folks that spend their day finding cool geodata and coming up with clever ways to organize it.
Through the data team and the other community members that contributed data to the first iteration of GeoCommons we had a big pool of data with a wide variety of tags to examine. What we found were some distinct trends in the tagging and titling of data. Across the data there were a commons set of tags that broke the data up into a useful set of distinct categories, but there were also many data sets that were tagged with elements that made them often indiscoverable. After the analysis we started to look at structures we could establish to help create self similarity in tagging that still had the flexibility to be adaptive.
The result was the creation of a location and topical taxonomy based on our existing corpus of data that has the intelligence to adapt as the content grows and evolves. I can’t go into the technical details in depth, but fundamentally the concept is to intelligently leverage the taxonomies and structures to provide suggestions to users to tag their data better.
In many cases this can be very simple - like providing tips on how to tag and title effectively to make your data more valuable to the community. For instance with titles we found across GeoCommons there were four key pieces of information used for datasets in the past.
1) Source name, 2) Original Name of Dataset from Source (or short description of dataset) 3) Geographic Area, 4) Time period of data
Examples:
Communicating this effectively to users is a great way to get better consistency across data contributions, while still allowing flexibility for users to be creative and bring in information that does fit the rigid mold of a hierarchy. Of course this is the most simple and you can get far more clever.
Del.icio.us for instance has a great feature that notifies a user they are putting in a new tag no one has used before and asking if that is what they meant to do. You can also suggest tags from your taxonomy that are semantically related to the data the user is contributing. This creates a consistency across tags that makes data easier to find as the system scales to larger volumes.
The nice thing about taxonomies as opposed to folksonomies is that they can be structured as trees, which means you can compute across them quite easily. With a solid and adaptive taxonomy in place you can go a long ways in intelligently guiding users towards creating better and more consistent tags. At least that is what we think and it will be fun to see how it works out after the launch.
Popularity: 24% [?]
GeoWeb Metadata Follow Up
April 2nd, 2008by Sean Gorman
First off want to thanks the folk that commented on the last post. Lots of useful feedback and it also highlighted a bit of confusion I created with the first post. The purpose of the first post was not a proposal to create a new metadata standard. Instead it was simply a proposal of how we could map the metadata we collect in GeoCommons to existing standards.
From that standpoint the proposal is for an implementation not a standard. We have just about 5,000 unique datasets and about 70,000 data layers, and it would be great to expose useful metadata for the data. The data covers the gambit, from EPA toxic release sites to the number of Facebook users by city. The system and metadata requirements needs to be flexible enough to accommodate both a user uploading Facebook data and one uploading EPA data.
While GIS users might not be intimidated by a metadata form with 75 or even 335 elements your average Web/GeoWeb user definitely will be. The goal with GeoCommons is to provide a destination where both communities can consume and share data, and I think both communities will find tools and data that are useful.
In regard to the metadata elements we proposed to map to in the last post, we were looking for those that both technical and non-technical users would understand, and also automatically trap as many additional elements as possible. To cover technical users, that have a full compliment of metadata, the plan is to have an element where you can you can provide a link to a full metadata specification.
The comments directing us to the ISO 19115 standard were very useful and we are looking to see what elements we are missing to map to that standard as we evolve. The thing we want to make sure we get right is finding to best set of metadata elements to request from users. Balancing the fact that if we have a huge number of elements, most people are going to go running for the hills.
Right now it looks like we’ll have 17-20 elements that will map to Dublin Core, FGDC, and in a next release ISO 19115. So, for each data set in Geocommons you’ll have a page that lists those 17-20 elements in the metadata format technical folks are used to seeing. This should also provide a means by which to explore federating the data with other applications and search approaches.
The goal here is to create a bridge between content being created for the GeoWeb and content created for the GIS world and make both usable and remixable by the web community as a whole. I fully respect the motivations and requirements for the GIS metadata specifications out there, and I hope we can leverage them to create an implementation that will see a high level of adoption.
Without adoption standards are pretty hollow as we’ve seen with all the work that went into GML versus the much lighter specifications for KML and GeoRSS. While both have their place it is clear what the market is supporting. As more geospatial data is created outside of the government we are not going to have the government mandate to force metadata creation and what the market accepts is going to become increasing critical - IMHO. Look forward to getting more feedback as we get ready to launch.
Popularity: 14% [?]






