Whenever we present GeoCommons there are always questions about the accuracy and validity of crowdsourced data. The standard answer has been the data is as good as the source, and we provide multiple levels of citation to clearly identify the source. Sometimes the source is an individual who created their own data and there is no citation other than Bob made a spreadsheet or took his GPS out on the town. More frequently the data comes from an existing source like OECD, the United Nations, US Dept. of Transportation, etc. etc. and there is a link back to the source URL where the data was found. Lastly there is GIS data that has a full metadata specification (FGDC or ISO 19115) which can be included as a link.

While this information is all available on any metadata page in Finder there is nothing that really covers if the data has been quality checked. One of the dirty secrets of all data is there are inherently errors and mistakes. If anyone tells your their data is perfect they are most likely fibbing, and also believe their armpits never smell.

The challenges of data accuracy was reinforced recently on two different blog posts where readers identified errors on maps that we posted. One was a map our data team created on “College Coaches Salaries” where there were geocoding errors and the second was Steve Chilton’s OSM coverage map that had Monaco in place of Munich.

If you’ve spent a lot of time with geospatial data you’ll know these errors happen quite easily. Errors can be happen frequently with geocoding software and often it is just easy to overlook a misplaced city name when going over hundred of columns. I’ve been thinking about how we can introduce better quality assurance into both the data we contribute and help users of GeoCommons identify issues in their shared data.

For inspiration I looked into two existing projects Wikipedia and Swivel. Wikipedia probably has the most advanced quality assurance mechanisms in place for a crowdsourced project, but it is focused on text. Swivel on the other hand deals directly with data although not geospatial data.

One of the most useful approaches I’ve seen in Wikipedia is a common set of icons for labeling articles that have issues (no citation, too long, reads biased, needs verification, etc.). With the icons and text I can quickly see issues that exist with an article, which can help me gauge the extent to which I should trust the text. While the Wikipedia taxonomy is quite thorough it is geared around articles and not geospatial data.

One of the great things about data is that many organizations release it into the public domain, so copying data does not have the same issues that copying text has (plagiarism). This provides the opportunity to have data come directly from an “official source”. Swivel had the great idea of formalizing this by creating partnerships with organizations to share their data with the community as an “official source”. This again helps users decide on the level of confidence they have in a particular data set.

So my conclusion after spending some time looking at both was creating a set of icons and labels for datasets to let users know their level of vetting could be useful when combined with a clear labeling of a data set as “official source” or transcribed by someone else. Here a few possible labels for data and icons.

geocoding error

Geocoding Error

need_citation

Needs Citation

needs_cleanup

Data Needs Cleanup

QA icon

Data has been QA’d by an Editor

Then there are the icons that Swivel has created for “official source” data managed by Swivel and “official source” data uploaded by the source organization.

official_source_large

official_source_large_managed_by_swivel

These are the tags that seemed to be most relevant. Are there others that tags folks think would be useful, or does anyone see issue with these? If there is general consensus around labels and icons to tag the level of validation for a given data set they could be used by anyone that has a need.

The other bit of this that I think is critical is creating a feedback loop to identify what the issues are with a particular data set. Which opens the question should these be georeferenced annotations indicating where on the map there error is, comments on the metadata page explaining the problem, or a combination of both. This requires a bit more engineering effort than the icons, but my first take is that a combination of the two could work well. Any other suggestions out there for providing better QA on geospatial data? Would love to hear them.

Popularity: 17% [?]

The Election 2.0: Post-Election Data and Analysis

November 6th, 2008by Bill Greer

The Elections are over and Barack Obama won. Aside from historic nature of electing our first African-American President, this election was also historic based on voter turn out and the technology that was used to help make the election possible and entertaining. This election was surrounded by new technologies and innovation, including Holograms from CNN, Get out to Vote Drives on Facebook, Pandora (they had a find your polling place widget), and other social media platforms. By far the coolest Election application was the Twitter Vote Report. This service allowed users to tweet information on the go from their polling locations, giving information about wait time, ranking the polling location, and other quality indicators, all of which was updated to a live map. FortiusOne was joining in on the fun by making as much election data public and available as possible. Here are some examples of our political/election datasets and maps, including the full twitter results:

Twitter Voter report (end of day), USA, Nov. 4 2008

Polling locations, Maine, 2008
2008 Polls vs. 2004 Election Results with Socio-Economic Indicators, USA, 2008
Voting Districts, New York, 2000
FEC, Individual donations to Obama campaign during August, 2008, USA, August 2008

The most interesting data was the Twitter Vote Report Data, so we thought we would try to run a little analysis on where the Tweets were coming from and where the wait times were longer.



The first map shows counts of Twitters – Seems to correlate strongly with high tech corridors in the region.

The second is pretty interesting – really shows a spatial divide in the region in terms of average wait time. Lower wait times in the Western portion of the region, higher wait times in the Eastern portion. Orange represents high values and blue low values.

Seeing that the wait times were longer on the eastern part of DC we decided to run some statistical analysis to see if the longer wait times were correlated with race and ethnicity.

the numbers in the tables are correlation coefficients – a correlation coefficient has a possible range of -1 to 1. Anything closer to 1 or -1 indicate stronger associations between the variables. Also, negative coefficients indicate negative association and positive coefficients indicate positive associations. So, the percent black field a correlation coefficient around 0.15, which means that counties that have higher percentages of blacks had higher average wait times. This only shows a slight correlation, but its there.

Here are the correlations between Twitter users and race

Keep an eye out, we’ll be updating post-election data with full results, voter turn out, and other interesting tid-bits. Feel free to let us know what you would be interested in seeing in Finder!

Popularity: 12% [?]

Links List 10.24.08

October 24th, 2008by Sean Gorman

Ogle Earth shares a plethora of links with everything from a 3-D globe viewer from Microsoft Virtual Earth’s API to heatmaps of georeferenced Panoramio photos to a job search using ReliefWeb’s map of humanitarian vacancies. It really shows that you can use a map for anything.

Reverse geocoding for Google Maps is now available, and Google Maps Mania has a comprehensive review. Reverse geocoding is pretty cool, it allows users to enter in the latitude and longitude of your location and then provide the physical address (for example, FortiusOne’s mailing/street address).

Journalists take note. The AnyGeo Blog points out how important the visual of a map is in telling a story. Reading a recent article in the local Fort Collins, CO paper, Glenn says, “I can’t help to think how much more useful the article in the paper would have been by simply posting the actual map or a link and forget about all the blabber.”

The Click2Map blog gives an overview and insight into the Google Gears Geolocation API for laptop wi-fi users. The original intent of the Gears Geolocation API was for developers to easily deliver location enabled web sites on mobile phones. But the team realized that laptop users could benefit as well, so added that functionality to the product. Even better, the Gears Geolocation API is free.

Popularity: 9% [?]

Links List 10.10.08

October 10th, 2008by Sean Gorman

Adena at Directions Magazine shared the Mozilla announcement that Geode is coming. Geode is a geolocation add-on for Firefox which will enable localized content. ReadWriteWeb describes it as a tool that “understands location, enabling enriched, personalized, and localized content" and VentureBeat explains it’s a location determination tool, built on the W3C spec, upon which developers can build. There are still many more questions about the exact capabilities of Geode, but it looks like it could be an interesting tool for your browser.

SlashGeo talks about the importance of GeoPresence, based on a piece by Ron Lake of Galdos, Inc. Ron said, “…a GeoPresence might be thought of as a visual and behavioural representative for yourself or your organization, not in a complete world of fantasy such as Second Life, but in some sort of approximation of the real world, the Virtual World. Furthermore, we can expect that this GeoPresence will reflect you or your organization more or less in real time.”

Karen Siderelis was named the first geographic officer (GIO) for the Department of Interior. Siderelis will guide the Federal Geographic Data Committee (FGDC), which coordinates the federal government’s GIS activities to provide information to people.

The MetaCarta Public Sector User Group established geotagging crime reports as one of the key applications realized by public safety organizations at their meeting yesterday. They highlighted the North Texas Fusion Centers (NTFC) as an example of how the police were able to detect cross border weapons along geographic corridors of the Texas and Mexico border by geotagging the reports to see how crime travels.

ITT released its first, color half-meter ground resolution image taken from the GeoEye-1 satellite. Check out the fusion image ‘created from blending the 0.41m panchromatic image and the 1.65m color image.’

Popularity: 8% [?]