The last blog post on long tails and the GeoWeb got me thinking about what the implications of different statistical distributions were for the GeoWeb and crowd sourced data in general. Further, is there a right or wrong distribution for data on the GeoWeb, and how do distributions reflect usage and the technical limitations of applications?

While it is unlikely I’ll be able play around with the Google index of KML on the Web anytime soon. I do have the index of the data we had in GeoCommons from our prototype deployment. While not as beefy as Google’s it is still a decent chunk data with roughly 33 million unique geometries and 1.6 billion features. It does constitute a small sample of the open source geographic data available globally. To get an idea on the composition of the data I did a rank order distribution of datasets by their total number of features.

GC_data_distribution

I did the plot on a semi-log scale, which means the vertical axis (number of features) is at a log scale and the horizontal axis (dataset ID) is on a regular scale. The log scale means each increment (horizontal line) is a power of ten, so it goes 10, 100, 1000, 10000 etc. This is helpful when you have a really big range of numbers and you would like to see how the data is clustered.

To illustrate the structure of GeoCommons I added a vertical intersection to show how many datasets have more than thousand features. Looking at the intersection point you can see that roughly 2/3 of the data in GeoCommons has more than a thousand features, and the bulk of the data in GeoCommons has between 100 and 100,000 features.

Also the distribution of the data is roughly a 97% fit with an exponential distribution. Interestingly it does not fit the the curve at the extremities where it looks more like a power law distribution. Meaning there are more very small data sets and more very large datasets than an exponential distribution would predict. Further, it is a much different distribution than you would see of all other KML on the Web, where 95% consists of a single feature (place mark).

So what does this all mean? I believe it is largely a reflection of technology adoption and functionality. The first KML creation tools were for generating individual points, lines, and polygons. First in Google Earth then Virtual Earth followed by Google MyMaps. MyMaps alone has created over 9 million KML files/maps. In all three applications each feature is created by hand, which creates practical limitations to the number of features anyone user will create. While all three have been wildly popular for their simplicity and ease of use they have resulted in the bulk of KML files being small - IMHO.

Enter a technology advance, like Google Spreadsheets allowing KML generation from tabular data, and you have the ability to mass produce KML. I would wager that the average number of features generated by Google spreadsheet KML is significantly higher than the three map drawing tools listed above. How many people are going to create a spreadsheet with just one feature? The point being as technology empowers users with new functionality it will impact the structure of content they produce.

With GeoCommons the door is open to translate shapefiles and multi attributed spreadsheet data into KML for the Web. The increased functionality again shifts the curve, moving the trend towards larger data sets. LibKML may push it further and Virtual Earth’s integration of Silverlight even further (rendering 100,000 points opens the door to some beefy datasets). This does not mean the smaller data sets are unimportant. In fact you could make the argument that the growth in mobile applications is going to create even more single feature data sets.

Building both ends of the curve is critical, and the most interesting applications will combine data from both ends of the spectrum. The intersection of large structured data with personalized and locationally aware place marks opens many intriguing opportunities. At the boundary of order and chaos you get complexity and that typically is where all the interesting phenomena occur.

Popularity: 9% [?]

Leave a Reply