We’ve been doing a lot of data migration and new data uploads with Finder! and often times our data team runs into data and mapping headaches. One that we commonly encounter are largish shapefiles that make for really bloated KML when we convert it (for instance a 2mb shapefile for US counties becomes a 5.4 mb KML file). The end result are big files that completely kill browser based applications like Virtual Earth and Google Maps, or load really slowly in thick client applications like Google Earth and ESRI AGX.

There are three factors that constitute file bloat for any vector based geospatial data:

1) The number of attributes (how many columns)
2) The number of features (how many rows)
3) The complexity of the geometry (how much needs to be drawn)

You can do some clever things to manage the first two at a low level - although you still are going to have bloat when you convert to a standard file format. The third factor, geometry complexity, is interesting because you can also do some low level tricks whose savings can be passed along to standard file formats. Reducing the complexity of geometry is often called “map generalization” in academic circles.

The general concept is that you remove details from the map without loosing the message and context of the map. All maps have some form of generalization otherwise it would be a perfect reflection of reality. Academics have used algorithms to heuristically derive a map generalization. This is probably best explained with a few examples. Below is a map of Europe in full detail:

europe_mapshaper_detail

Next is map generalization that removes some of the detail but still keeps the context of Europe and the country boundaries:

europe_mapshaper_medium

Last a more extreme example with even greater detail removed:

europe_mapshaper_sparse

To pull off these nifty computational tricks used to require some fairly sophisticated desktop software, but Matt Bloch and Mark Harrower at the University of Wisconsin figured out a clever way to enable enable real-time WYSIWIG map generalization. The resulting application is called MapShaper. You can upload a shapefile and run different generalization routines (with high level of control if you choose) then export the result back out as a shapefile or an EPS file. The shapefile export is down at the moment, but hopefully will back in action soon.

I think these kinds of technologies and mathematics are going to be increasingly important as we need to make ever larger datasets available. Especially when the receiving devices are increasingly mobile with even smaller data handling capabilities.

Popularity: 28% [?]

We made the cross country hop out to Santa Clara to attend Location Intelligence today. The weather is awesome and we just finished the morning workshops. I sat in on Lior Ron and David Minogue’s talk on “Searching the GeoWeb“.

The talk produced many interesting insights on Google’s approach to searching geodata, but one statistic really grabbed my attention. Of the millions of KML files Google has indexed roughly 95% of them have only a single feature. Meaning the vast majority of KML indexed by Google consists of single place marks like “this is my house” or “this is an airplane in flight“.

There are also many single place marks that have more useful data as well, and Lior did a great job presenting several work flows pulling up very relevant place marks for things like finding a place to windsurf in the Bay Area or a place to hike in Austria.

What I found fascinating was Google’s focus on the long tail of data, which has been a popular meme in Web 2.0 in general. The long tail refers to the tail end of a statistical distribution that covers a large number entities with small number of observations.

conceptual long tail

You can also think of this as the 80/20 rule, where 20% of the people have 80% of the wealth and the other 80% of the people have only 20% of the wealth. In this situation the long tale of the distribution is the 80% of the people with 20% of the wealth - where there are a large number of people with only small numbers of observations (wealth).

This is also called the Pareto principle and often manifests itself as a power law distribution that are commonly referenced in http://en.wikipedia.org/wiki/Complex_system to describe self organizing systems and networks.

Google’s indexing of KML on the GeoWeb is fundamentally a self organizing system of user generated content and not surprisingly it looks to fit a power law distribution. specifically a power law distribution where 95% of the KML has a single feature and the other 5% has a very large number of features that accounts for a disproportionate amount of the total features in the database. Without the raw data it is just a hunch on my part but I would bet a bar tab on the R square of a power law fit being above .85 on a rank order distribution of KML on file size or number of features.

So geeking out on statistics and complexity theory aside why does this matter? It matters because I believe it ignores the power of the short tail. The long tail is easy from a computational perspective to deal with - the files sizes are small and rendering small numbers of place marks is easy. This keeps everything very manageable and scalable.

The downside is it leaves out many of the most interesting datasets potentially available, because they are large and complex - sitting on the short tail. Another popular Web 2.0 meme is that “data is the Intel inside” - positing that large complex data sets are one of the key differentiators on the Web. So, it would seem in this case that the focus on the “long tail” and positioning “data as the Intel inside” are in conflict. This also may be another indicator of where the semantic web (or what ever you want to call the next evolution of things) diverges distinctly from Web 2.0. Until the GeoWeb can solve the problem of dealing with large complex datasets I think it will be difficult to answer deeper questions for users that create substantive value.

Talking with Lior and Dave after the workshop we agreed it was a tough problem, but definitely had big potential if solved well. Although Dave brought up the thorny issue of how do you know you are answering questions correctly. That is another can of worms that will have to wait for another blog post, but will be hugely important as things evolve.

As a side note apologies to everyone for the issues we’ve been having with the date on the blog. Our virtual machine decided it wanted to peer into the future and run its system clock faster than reality. Looks like we have it fixed but it blew away this blog post and several recent comments. I’ve done my best a rewriting this one but sadly looks like we’ve lost the comments. Fortunately most of them were letting us know the date had done gone crazy. On the upside if anyone wants to know what the weather is going to be like this weekend or how the primaries will turn out just let me know ;-)

Popularity: 39% [?]

Virtual Earth vs. Google MyMaps KML Support

April 26th, 2008by Sean Gorman

As we’ve been putting GeoCommons through its paces I’ve been testing KML files we generate in different applications. The most interesting comparison by far has been between Virtual Earth and Google MyMaps. I did a high level comparison of the two plus Yahoo! MapMixer a few blog posts back, but after testing several KML files in each I thought it would make for a good follow up. Especially after Michael Jones’ comments to James Fee’s post about KML being the HTML of the GeoWeb.

The good news is that both Virtual Earth and Google Maps support KML, and we are seeing a greater number of applications supporting it and GeoRSS as GeoWeb standards. As the standards get picked up it will be interesting to see how they are supported and how applications differentiate themselves in doing so. Already we can see this beginning between the two titans (Microsoft and Google) expressing how their support of KML has advantages over the other. So, I thought I’d share what our experience was testing with both applications.

Google KML Support

For testing purposes I started off with a polygon data set of the 100 most polluted counties in the United States. The upload process for Google MyMaps was straight forward and my uploaded KML (or GeoRSS) file prepopulated a title and description field. Then after a bit of chugging rendered the KML file on the map. You can see the map I created embedded below:


View Larger Map

If you look closely you’ll notice that there are not 100 counties on the map (only about 44). Google MyMaps will support 200 pushpins on a map, but when you add in complex polygons the number of polygons and associated pushpins it will support goes down significantly. In the MyMaps application it gets around this problem by paginating the KML file into multiple maps each supporting the maximum number of pushpins, lines or polygons. Unfortunately you can only embed one map page at a time, so the map above only shows the first set of polygons.

An interesting observation in the Microsoft blog post about KML support noted that, “on Google Maps the polygons representing the parks didn’t load at all”. Our KML rendered the polygons fine, but we took an extra step in GeoCommons to generate our polygons as multigeometries where a pushpin with the data is included inside the polygon and highlights when you mouse over (at least in Google Earth). So, my hunch is that in order to get polygon KML to render in Google MyMaps you need to structure it as a multigeometry, or they’ve added the functionality since then. It would be great to not to have to add the pushpin to get the data, and enable clickable polygons in both Google Earth and Google Maps.

On the plus side Google MyMaps does a good job handling multi-polygons. A multi-polygon is when you have multiple polygons representing one geographic entity. For instance the United States of America consists several separate polygons, including Alaska, the Hawaiian Islands, and the contiguous states. Several of the counties in our test data set had multi-polygons and you can see those rendered in detail in the embedded map below:


View Larger Map

A second plus for Google MyMaps is balloon support for the data that shows all the attributes in a nicely parsed list. Even when I loaded up a census data set with 74 attributes it listed them all out with a scroll bar. So to recap:

Advantages = prepopulated title and description, quick load, multi-polygon support, full listing of data attributes.

Disadvantages = limited number of polygons rendered on one map, requires multigeometry KML to support clickable polygons, slow rendering of polygons, no ability to export KML or other standard.

Virtual Earth KML Support

Virtual Earth KML support is provided through the “Collections” feature. When you click “Import Collection” you are given the option to add a KML file (or GeoRSS or GPX). I uploaded the same county pollution file and Virtual Earth chugged along for a bit then gave me a message saying, “100 out 100 items uploaded”. I’ve tried this with other files and if the files has more than 200 features it will not upload all of them - just the first 200 then stop. Also if your KML file is over 2mb it will tell you it is too large. Over all this is a nice feature that lets you know the bounds of the system and what will work and what will not.

The second nice part is that all 100 counties made it on one map instead of just 44 as with Google. A second bonus was that Virtual Earth did not need the multigeometries to support the clickable polygons rendered on the map. In fact the multigeometries we included in our KML generation caused both a pushpin to be drawn and and second square that gets highlighted when you mouse over the polygon. You can check out the map here and see the screen shot below:

MSVE_polygons

Sadly Virtual Earth does not support embeds, so just the screen shot and link. Another small ding, a,s you can see in the screen shot, is that Virtual Earth does not support multiple polygons. The spots where you see push pins instead of polygons is indicative of multiple polygons representing a county, like Galveston, that could not be rendered so a push pin was placed there instead. It still gets the job done, but there is still something dissatisfying about America’s or any other political unit’s borders being replaced by a push pin. The last complaint is Virtual Earth only supports a limited number of characters for attributes, so when I tested a census file with 74 attributes I only got the first twenty or so and they were not well formated. So to recap:

Advantages = ability to render more polygons, ability to render polygons faster, ability to support clickable polygons without mulitgeometries, ability to export KML (and other formats)

Disadvantages = inability to support multi-polygons, slow to load KML, limited support of data attributes, no support of balloon styling

Over all I would give a slight edge to Virtual Earth when it comes to KML support from our unique perspective. Specifically the ability to load a larger number of polygons on a map and make those easily clickable allows more of our content to be leveraged at this point. It will be interesting to see how Google, Microsoft and others continue to enhance KML support to make more data available. I believe there is still a long way to go and the vast majority of the datasets in GeoCommons are too large for either to handle at this point. As the GeoWeb and the data it interconnects becomes more sophisticated I think it will be a necessity to greatly increase the amount and complexity of data that can be handled in a browser based map. Hopefully the market pushes Microsoft, Google and others to innovate in that direction.

Popularity: 85% [?]

GeoCommons Metadata Implementation Screenshots

April 22nd, 2008by Sean Gorman

We got such useful feedback from the last metadata post I thought I would add some screen shots of how it is starting to come together. Unfortunately we were not able to get all the suggestions in because of the time crunch hitting our release date, but please keep posting the feedback and we’ll work it in as we have more time.

The first screen shot is of the data details page, which contains the metadata information for the data set. In this case 2000 US Census data at the tract level for Alabama:

finder_data_page

Here you can see the major elements we are capturing in a user friendly graphical lay out. One of the cool new bits is the system automatically calculates statistics when you upload the data. Being able to data mine and run statistics on the fly is one of the new developments we are particularly excited about.

All the metadata on the data details page is exposed as Dublin Core elements which should make them machine readable to the rest of the world:

finder_view_source

Also there are links to FGDC and ISO 19115 metadata mappings which take you to simple text pages with the indicated information. We probably need another pass to get these completely correct, but the infrastructure is all in place to do so.

FGDC looks like this:

Finder_FGDC

ISO 19115 looks like this:

Finder_ISO

Hopefully this will help make the data in GeoCommons useful to multiple geospatial work flows. We hope having the ability to get data out in shapefile, KML, and .CSV (spreadsheets) will create more cross fertilization between GeoWeb and GIS users. With some luck it can help get more geospatial data out to the public that has been difficult to access in the past. A couple of examples below.

US Census Tract Data for Alabama

Alabama Census Tract

Global Maritime Shipping Lanes

Global Shipping Lanes

Zillow Neighborhoods and Shipping Lanes (just because it looked kinda cool)

SF_neighborhoods

Thanks again for the feedback from folks on the metadata and we’ll keep iterating on getting it spot on.

Popularity: 38% [?]

Links List 4.18.08

April 18th, 2008by Sean Gorman

Moxie designed a demonstration to show how integrated geo spatial service, RIA technology, location based service and digital mapa can make life easier. From that, a geo spatial service was developed that enabled Flex Yahoo AS3 map application.

Virtual Earth has been updated to include new imagery, new 3D buildings, direct support for MapCruncher, movie capture, export to KML and GPX files, and more.

Geomantic shares some coverage of geospatial topics in the Washington Post the past week including a story on Yahoo! Maps Live and a mashup from the Center for Neighborhood Technologies.

GISLounge and the Daily ACK announced that KML is now an Open Geospatial Consortium Standard. This means that Google will no longer be responsible for maintaining the KML file format which, instead, will be handled by OGC. KML (Keyhole Markup Language) is a file format that uses XML-based language to manage geographic information.

Flowing Data provides a list of data visualization blogs you may not know about, including Strange Maps, Well-formed Data, Random Etc, Serial Consign, and AnyGeo.

Popularity: 23% [?]