Ethics of Crowdsourcing - What Constitutes an Abuse of the Commons
July 29th, 2008by Sean Gorman
While getting ready to launch Finder! we had an internal debate whether or not to put limits on dataset downloading. There were several options, ranging from requiring a user to be logged in before they downloaded to limiting the number of downloads a user could make in a day. A lot of the argument centered around the value of raw data - echoing the O’Reilly manifesto that “data is the Intel inside“. This belief holds that the value of the NAVTEQ’s and TeleAtlas’s of the world is derived from the proprietary data they collected.
One side of the company felt that by not limiting access to data we were giving away the family jewels. The other side felt that open access was the best way to create a network effect for data by making it as accessible as possible. At the end of the day the open access philosophy prevailed, and from the sound of comments to James Fee’s post after GeoWeb, access to data is still an important facet to both GIS and GeoWeb users.
Now that Finder! has been out for a little while we’ve begun to see a big surge in downloads. I noted last week we hit 18,000 downloads and just a week later we are now over 28,000. This has caused us to take a second look at our access policies. “Knock on wood”, the system has scaled like a champ handling the traffic, but as we get ready to launch Maker! some concerns have come up about potential abuse and its effect on user experience.
The biggest concern is around systematic downloading of data and the potential for that to impact other users experiences on the site. The question is how to make the content available without impinging on the collective user experience. Wikipedia approaches this by making content available as one big tarball and asks users “Please do not use a web crawler to download large numbers of articles. Aggressive crawling of the server can cause a dramatic slow-down of Wikipedia. Our robots.txt blocks many ill-behaved bots.”
I’m not sure a giant tar ball of data is the best way to go for us, especially since the data is available in a variety of formats. A second option is to provide third party access to the data via an API. This API could also work for both download and upload. Andrei had an interesting suggestion in our last post:
“The two-way API will definitely help with the number of uploads. The cool thing to do, would be to add (”Add to Finder!”) a URL request:
…finder.com/add?file=file.kml&type=kml&name…”
If people have other ideas on how they could better access the data in bullk without impinging performance we’d love to hear them. Also thoughts on what the line is between fair use of content and abuse of the commons. It is a bit of gray line in my mind. Is systematic downloading (manually hitting every dataset) abusive? Is scraping datasets with bots abusive? The main goal in my mind is to provide the best service possible without creating a “tragedy of the commons“.
Popularity: 22% [?]







November 5th, 2008 at 9:03 pm
Interestingly, even for accountants :)))))