We made the cross country hop out to Santa Clara to attend Location Intelligence today. The weather is awesome and we just finished the morning workshops. I sat in on Lior Ron and David Minogue’s talk on “Searching the GeoWeb“.

The talk produced many interesting insights on Google’s approach to searching geodata, but one statistic really grabbed my attention. Of the millions of KML files Google has indexed roughly 95% of them have only a single feature. Meaning the vast majority of KML indexed by Google consists of single place marks like “this is my house” or “this is an airplane in flight“.

There are also many single place marks that have more useful data as well, and Lior did a great job presenting several work flows pulling up very relevant place marks for things like finding a place to windsurf in the Bay Area or a place to hike in Austria.

What I found fascinating was Google’s focus on the long tail of data, which has been a popular meme in Web 2.0 in general. The long tail refers to the tail end of a statistical distribution that covers a large number entities with small number of observations.

conceptual long tail

You can also think of this as the 80/20 rule, where 20% of the people have 80% of the wealth and the other 80% of the people have only 20% of the wealth. In this situation the long tale of the distribution is the 80% of the people with 20% of the wealth - where there are a large number of people with only small numbers of observations (wealth).

This is also called the Pareto principle and often manifests itself as a power law distribution that are commonly referenced in http://en.wikipedia.org/wiki/Complex_system to describe self organizing systems and networks.

Google’s indexing of KML on the GeoWeb is fundamentally a self organizing system of user generated content and not surprisingly it looks to fit a power law distribution. specifically a power law distribution where 95% of the KML has a single feature and the other 5% has a very large number of features that accounts for a disproportionate amount of the total features in the database. Without the raw data it is just a hunch on my part but I would bet a bar tab on the R square of a power law fit being above .85 on a rank order distribution of KML on file size or number of features.

So geeking out on statistics and complexity theory aside why does this matter? It matters because I believe it ignores the power of the short tail. The long tail is easy from a computational perspective to deal with - the files sizes are small and rendering small numbers of place marks is easy. This keeps everything very manageable and scalable.

The downside is it leaves out many of the most interesting datasets potentially available, because they are large and complex - sitting on the short tail. Another popular Web 2.0 meme is that “data is the Intel inside” - positing that large complex data sets are one of the key differentiators on the Web. So, it would seem in this case that the focus on the “long tail” and positioning “data as the Intel inside” are in conflict. This also may be another indicator of where the semantic web (or what ever you want to call the next evolution of things) diverges distinctly from Web 2.0. Until the GeoWeb can solve the problem of dealing with large complex datasets I think it will be difficult to answer deeper questions for users that create substantive value.

Talking with Lior and Dave after the workshop we agreed it was a tough problem, but definitely had big potential if solved well. Although Dave brought up the thorny issue of how do you know you are answering questions correctly. That is another can of worms that will have to wait for another blog post, but will be hugely important as things evolve.

As a side note apologies to everyone for the issues we’ve been having with the date on the blog. Our virtual machine decided it wanted to peer into the future and run its system clock faster than reality. Looks like we have it fixed but it blew away this blog post and several recent comments. I’ve done my best a rewriting this one but sadly looks like we’ve lost the comments. Fortunately most of them were letting us know the date had done gone crazy. On the upside if anyone wants to know what the weather is going to be like this weekend or how the primaries will turn out just let me know ;-)

Popularity: 42% [?]

2 Responses to “Power Law Distributions of Google Indexed KML: Is the Long Tail the Wrong Tail for the GeoWeb”

  1. A Discussion of Statistical Tails and Technology Adoption | Off the Map - Official Blog of FortiusOne Says:

    […] last blog post on long tails and the GeoWeb got me thinking about what the implications of different statistical […]

  2. Top Picks at Where 2.0 and an Emerging Open Data Sharing Theme | Off the Map - Official Blog of FortiusOne Says:

    […] the data theme there is another interesting pairing of talks with Lior Ron of Google and Juan Gonzalez of PlanetEye, both talking about the indexing of geo-content. Lior gave […]

Leave a Reply