We made the cross country hop out to Santa Clara to attend Location Intelligence today. The weather is awesome and we just finished the morning workshops. I sat in on Lior Ron and David Minogue’s talk on “Searching the GeoWeb“.

The talk produced many interesting insights on Google’s approach to searching geodata, but one statistic really grabbed my attention. Of the millions of KML files Google has indexed roughly 95% of them have only a single feature. Meaning the vast majority of KML indexed by Google consists of single place marks like “this is my house” or “this is an airplane in flight“.

There are also many single place marks that have more useful data as well, and Lior did a great job presenting several work flows pulling up very relevant place marks for things like finding a place to windsurf in the Bay Area or a place to hike in Austria.

What I found fascinating was Google’s focus on the long tail of data, which has been a popular meme in Web 2.0 in general. The long tail refers to the tail end of a statistical distribution that covers a large number entities with small number of observations.

conceptual long tail

You can also think of this as the 80/20 rule, where 20% of the people have 80% of the wealth and the other 80% of the people have only 20% of the wealth. In this situation the long tale of the distribution is the 80% of the people with 20% of the wealth - where there are a large number of people with only small numbers of observations (wealth).

This is also called the Pareto principle and often manifests itself as a power law distribution that are commonly referenced in http://en.wikipedia.org/wiki/Complex_system to describe self organizing systems and networks.

Google’s indexing of KML on the GeoWeb is fundamentally a self organizing system of user generated content and not surprisingly it looks to fit a power law distribution. specifically a power law distribution where 95% of the KML has a single feature and the other 5% has a very large number of features that accounts for a disproportionate amount of the total features in the database. Without the raw data it is just a hunch on my part but I would bet a bar tab on the R square of a power law fit being above .85 on a rank order distribution of KML on file size or number of features.

So geeking out on statistics and complexity theory aside why does this matter? It matters because I believe it ignores the power of the short tail. The long tail is easy from a computational perspective to deal with - the files sizes are small and rendering small numbers of place marks is easy. This keeps everything very manageable and scalable.

The downside is it leaves out many of the most interesting datasets potentially available, because they are large and complex - sitting on the short tail. Another popular Web 2.0 meme is that “data is the Intel inside” - positing that large complex data sets are one of the key differentiators on the Web. So, it would seem in this case that the focus on the “long tail” and positioning “data as the Intel inside” are in conflict. This also may be another indicator of where the semantic web (or what ever you want to call the next evolution of things) diverges distinctly from Web 2.0. Until the GeoWeb can solve the problem of dealing with large complex datasets I think it will be difficult to answer deeper questions for users that create substantive value.

Talking with Lior and Dave after the workshop we agreed it was a tough problem, but definitely had big potential if solved well. Although Dave brought up the thorny issue of how do you know you are answering questions correctly. That is another can of worms that will have to wait for another blog post, but will be hugely important as things evolve.

As a side note apologies to everyone for the issues we’ve been having with the date on the blog. Our virtual machine decided it wanted to peer into the future and run its system clock faster than reality. Looks like we have it fixed but it blew away this blog post and several recent comments. I’ve done my best a rewriting this one but sadly looks like we’ve lost the comments. Fortunately most of them were letting us know the date had done gone crazy. On the upside if anyone wants to know what the weather is going to be like this weekend or how the primaries will turn out just let me know ;-)

Popularity: 42% [?]

Links List 4.28.08

April 28th, 2008by Sean Gorman

All Points Blog shares an article from Federal Times that looks at how government agencies are using Google and Microsoft for mapping applications.

Crowd-sourced data and seismology are discussed on Geomantic.

Privacy and GIS data are reviewed by GISLounge, which displays the public concern over privacy in imagery and information.

Mapperz announces that Yahoo Local is including GeoSpatial search functions now, providing search results that can be interactively expanded/refined by geographical location.

Popularity: 27% [?]

Virtual Earth vs. Google MyMaps KML Support

April 26th, 2008by Sean Gorman

As we’ve been putting GeoCommons through its paces I’ve been testing KML files we generate in different applications. The most interesting comparison by far has been between Virtual Earth and Google MyMaps. I did a high level comparison of the two plus Yahoo! MapMixer a few blog posts back, but after testing several KML files in each I thought it would make for a good follow up. Especially after Michael Jones’ comments to James Fee’s post about KML being the HTML of the GeoWeb.

The good news is that both Virtual Earth and Google Maps support KML, and we are seeing a greater number of applications supporting it and GeoRSS as GeoWeb standards. As the standards get picked up it will be interesting to see how they are supported and how applications differentiate themselves in doing so. Already we can see this beginning between the two titans (Microsoft and Google) expressing how their support of KML has advantages over the other. So, I thought I’d share what our experience was testing with both applications.

Google KML Support

For testing purposes I started off with a polygon data set of the 100 most polluted counties in the United States. The upload process for Google MyMaps was straight forward and my uploaded KML (or GeoRSS) file prepopulated a title and description field. Then after a bit of chugging rendered the KML file on the map. You can see the map I created embedded below:


View Larger Map

If you look closely you’ll notice that there are not 100 counties on the map (only about 44). Google MyMaps will support 200 pushpins on a map, but when you add in complex polygons the number of polygons and associated pushpins it will support goes down significantly. In the MyMaps application it gets around this problem by paginating the KML file into multiple maps each supporting the maximum number of pushpins, lines or polygons. Unfortunately you can only embed one map page at a time, so the map above only shows the first set of polygons.

An interesting observation in the Microsoft blog post about KML support noted that, “on Google Maps the polygons representing the parks didn’t load at all”. Our KML rendered the polygons fine, but we took an extra step in GeoCommons to generate our polygons as multigeometries where a pushpin with the data is included inside the polygon and highlights when you mouse over (at least in Google Earth). So, my hunch is that in order to get polygon KML to render in Google MyMaps you need to structure it as a multigeometry, or they’ve added the functionality since then. It would be great to not to have to add the pushpin to get the data, and enable clickable polygons in both Google Earth and Google Maps.

On the plus side Google MyMaps does a good job handling multi-polygons. A multi-polygon is when you have multiple polygons representing one geographic entity. For instance the United States of America consists several separate polygons, including Alaska, the Hawaiian Islands, and the contiguous states. Several of the counties in our test data set had multi-polygons and you can see those rendered in detail in the embedded map below:


View Larger Map

A second plus for Google MyMaps is balloon support for the data that shows all the attributes in a nicely parsed list. Even when I loaded up a census data set with 74 attributes it listed them all out with a scroll bar. So to recap:

Advantages = prepopulated title and description, quick load, multi-polygon support, full listing of data attributes.

Disadvantages = limited number of polygons rendered on one map, requires multigeometry KML to support clickable polygons, slow rendering of polygons, no ability to export KML or other standard.

Virtual Earth KML Support

Virtual Earth KML support is provided through the “Collections” feature. When you click “Import Collection” you are given the option to add a KML file (or GeoRSS or GPX). I uploaded the same county pollution file and Virtual Earth chugged along for a bit then gave me a message saying, “100 out 100 items uploaded”. I’ve tried this with other files and if the files has more than 200 features it will not upload all of them - just the first 200 then stop. Also if your KML file is over 2mb it will tell you it is too large. Over all this is a nice feature that lets you know the bounds of the system and what will work and what will not.

The second nice part is that all 100 counties made it on one map instead of just 44 as with Google. A second bonus was that Virtual Earth did not need the multigeometries to support the clickable polygons rendered on the map. In fact the multigeometries we included in our KML generation caused both a pushpin to be drawn and and second square that gets highlighted when you mouse over the polygon. You can check out the map here and see the screen shot below:

MSVE_polygons

Sadly Virtual Earth does not support embeds, so just the screen shot and link. Another small ding, a,s you can see in the screen shot, is that Virtual Earth does not support multiple polygons. The spots where you see push pins instead of polygons is indicative of multiple polygons representing a county, like Galveston, that could not be rendered so a push pin was placed there instead. It still gets the job done, but there is still something dissatisfying about America’s or any other political unit’s borders being replaced by a push pin. The last complaint is Virtual Earth only supports a limited number of characters for attributes, so when I tested a census file with 74 attributes I only got the first twenty or so and they were not well formated. So to recap:

Advantages = ability to render more polygons, ability to render polygons faster, ability to support clickable polygons without mulitgeometries, ability to export KML (and other formats)

Disadvantages = inability to support multi-polygons, slow to load KML, limited support of data attributes, no support of balloon styling

Over all I would give a slight edge to Virtual Earth when it comes to KML support from our unique perspective. Specifically the ability to load a larger number of polygons on a map and make those easily clickable allows more of our content to be leveraged at this point. It will be interesting to see how Google, Microsoft and others continue to enhance KML support to make more data available. I believe there is still a long way to go and the vast majority of the datasets in GeoCommons are too large for either to handle at this point. As the GeoWeb and the data it interconnects becomes more sophisticated I think it will be a necessity to greatly increase the amount and complexity of data that can be handled in a browser based map. Hopefully the market pushes Microsoft, Google and others to innovate in that direction.

Popularity: 90% [?]