Dataset of the Day: Hockey, Getting Fans in the Seats
April 24th, 2009by Kevin Burke
The 2008-2009 NHL Season has been a thrilling one and it continues to be with the start of the playoffs. The game’s popularity has been growing and a rise in attendance figures has been a direct result. The Total NHL Attendance figure was broken this year for the fourth consecutive year. This news made me want to take a closer look at the data.
I first went to espn.com and looked at attendance figures from the 2008-2009 season. After looking over the stats I saw that some teams had regular sellouts and other teams struggled to fill the seats. The map below shows the percentage of seats that were filled throughout the season for each team. (click on the map for a larger view)
Why did some teams sell out every game while others showed poor attendance? I decided to investigate by using Finder! and Maker! to run correlations to determine why a team could or could not get fans in the arena.
The first thing I wanted to correlate was a team’s finishing place in the league and their attendance capacity percentage for the season. This is because a common theme in sports is that fans only go to watch a team if that team is winning. I mean who wants to go see the last place team in the league play.
The correlation shows some interesting results. It appears that the place of your team does not always affect the amount of fans you put in the seats. The correlation between the two factors was only .48 (high correlations are values close to 1 or -1). For example, the Ottawa Senators were able to fill 105% of their seats during the year yet they finished 22nd out of thirty teams in the league. Also, the Carolina Hurricanes who finished 11th in the league out of thirty teams only filled 88.5% of their seats (rated 10th worst in the league).
Now I looked at running some other correlations to see if any other factors resulted in getting people into the seats. Below is what I tried.
- Number of Consecutive Playoff or Non-Playoff Seasons (shows if a team has been continuously successful or unsuccessful)
- Unemployment % for February 2009 (If you’re broke and without a job, you probably won’t be spending your money to go to a hockey game)
- Average Temperature During Hockey Season (Hockey is a sport that is heavily followed in colder climates)
None of the correlations faired much better. Surprisingly Average Temperature During Hockey Season was the closest (-.59) This led me to the conclusion that it is a combination of different factors that determine if a team is able to get people in the seats for their games. Now I took several factors and gave them specific values and combined these to come up with the “The Kev Score”. I am hoping that “The Kev Score” will show how certain factors combined will determine if an NHL team will achieve their maximum attendance capacity.
Here is how I computed “The Kev Score”
Factors:
- Finishing Place (if in 1st place = 30 points, 2nd = 29 points, and so on)
- Temperature (Coldest City = 30 points, 2nd Coldest City = 29 points, and so on)
- Canada Factor (if a Canadian team you get 15 points added to your score)
- USA Hockey IQ Factor - if a USA city is known as a town known for hockey
o Good IQ (10 points added)
o Poor IQ (No points)
- City Population (Highest City Population = 30 points, 2nd Highest City Population - 29 points)
The Formula:
Finishing Place Points + Temperature Points + Canada Factor + Good USA Hockey IQ Factor + City Population Points = “The Kev Score”
The correlation between the Arena Full Capacity Percentage and the “Kev Score” is reasonably high at a score of .81. So is the “Kev Score” a reliable way to predict how to get fans in the seats. I decided to use the formula again but to test it with statistics from the 2007-2008 season. Here is what happened.
At a much lower correlation of .60 it seems that the “Kev Score” does not prove itself to be a strong indicator of fan attendance for the 07-08 season.
Was “The Kev Score” a reliable way to judge if a team would or would not have a strong attendance? Well not really but it worked better than all the other things I tried. See if you are able to discover your own “Kev Score” and help Hockey Team owners around the NHL discover how to bring more fans to their games.
Popularity: 9% [?]
Dataset of the Day: Mega Millions!!!!
March 3rd, 2009by Kevin Burke
As I was driving into work today and listening to the radio I heard some interesting news. The Mega Millions drawing for tonight is up to a value of $212 million. Wow! With a ticket only costing $1 I have decided to go out and buy one. Will I end up winning? Ha, fat chance, but why not?
In honor of the high amount of prize money I have decided to put together a few lottery maps by using Finder! and Maker! I’m hoping that this work and this blog will in some way shower me with good luck/good karma and reward me with the lucky jackpot numbers. We shall see …
My first map is of states that participate in the Mega Millions contest. It is in these states where you may purchase a mega millions ticket (or two).
http://maker.geocommons.com/maps/3361
The second map is the locations of previous winners.
http://maker.geocommons.com/maps/3364
And the third map is of state lottery sales in 2006
http://maker.geocommons.com/maps/3365
Now with all this work I am sure to win the lottery. I believe the rest of the day I will plan on how to spend my millions. Ahahahahaha.
Popularity: 9% [?]
Dataset of the Day: Stimulus Projects and Unemployment
February 9th, 2009by Emily Sciarillo
Everyone is keeping their eye on what will happen with Obama’s stimulus package. When it does pass, Obama pledges full “transparency,” so that “citizens can see how and where their tax dollars are being spent.” So as citizens, how can we best evaluate the appropriateness and effectiveness of projects that will be candidates for stimulus funding?
To help us, stimuluswatch.org has set up a site dedicated to helping “the new administration keep its pledge to invest stimulus money smartly, and to hold public officials to account for the taxpayer money they spend.” They provide a database of “proposed ‘shovel-ready’ projects” throughout the country which will be candidates for federal grant money as part of the stimulus package. The site offers the capability for citizens to view the proposals and decide if they think they are critical or not.
In order to help viewers better assess the appropriateness of these projects, we uploaded the data to Finder! and then used Maker! to compare where these projects will be and where jobs are most needed.
In the map below, we show the projects by the number of jobs that will be created. The larger circles are where more jobs will be created. We also show the change in unemployment by county between November of 2007 and November of 2008. The blue counties are where there was a decrease in unemployment, the white where there was a fairly small increase, and the yellow and orange areas show larger increases.
Taking a look at the country as a whole, it does seem that many of the projects are proposed in areas that have suffered job losses. This is particularly true for areas of Southern California, Florida and the Rust Belt. Areas in the center of the country, where there have been the some decreases in unemployment have less proposals for job creating projects.
Lets look more closely into an area to examine how the proposed projects are matching up to job losses. Georgia is one area that seems to have experienced a heavy loss in jobs over the past year.
You can see in the map above that there are many clusters of counties whose unemployment rate has increased by more than five percent in Georgia. None of these counties have a project planned in the direct vicinity. The county of Hancock Georgia has had the highest increase in unemployment and the third highest unemployment rate for this November of all the counties in the US. In November of 2007, its unemployment rate was 9.2 and in November of 2008 the rate reached 20.1, a 10.9 percent increase overall. The nearest proposed projects to Hancock are either an hour and a half away in Macon or an hour and forty minutes away in Conyers.
While the governor of Georgia may have good reasons for creating jobs in the proposed areas, it leaves one to wonder what will become of the towns, such as Hancock, who have suffered the greatest in this economic crisis.
Take a look at this map yourself in Maker!. You can zoom in to areas you are interested and decide for yourself the validity of these projects.
On the other hand, it is interesting that Illinois is fairly well represented here. Of the 891 projects in the country, 119 or 13.8% of them are in Illinois. While Illinois does have some yellow and orange counties, it is by no means the hardest hit state in the country in terms of unemployment. Does the state expect some favoritism from the new president?
At a closer look, the 119 projects in Illinois will create significantly fewer jobs then projects in other states. California, which faced the fourth highest unemployment rate in November, is proposing 93 projects which will produce 238,329 jobs.
The chart below provides 16 states with the highest unemployment rates in November along with the number of projects proposed in each state and the total number of jobs and the number of jobs per 1,000 people those projects will create.
States like Michigan and South Carolina, who need jobs the most are proposing projects that will create comparatively few jobs per capita. You can download a CSV of this dataset from Finder! and do your own analysis of the proposed projects.
We can also look at the projects compared to state unemployment rates, as is seen in the map below. The yellow and orange states are the ones shown in the graph above. To see this map click here.
Of course nobody is saying that the unemployment rates should be the only criteria as to where stimulus money should go. But if the package it going to truly address unemployment, projects that will add significant jobs to areas with high unemployment rates should be considered strong candidates for federal funding.
Popularity: 22% [?]
The Possibilities of Collective Statistical Intelligence
January 9th, 2009by Sean Gorman
I was reading Kevin Burke’s post today on the relationship between political affiliation and charitable giving, and thought it was a great example of “collective statistical intelligence“. In the post Kevin does a set of correlations between political affiliation and a generosity index then posts the results.
While the post was fascinating and great content, the comments were even more engaging. There is a great discussion on the data used and how the results could be interpreted and what some of the potential pitfalls are - like ecological fallacy. One of the most challenging aspects of doing a statistical analysis is interpreting the results. Running an analysis is fairly straight forward, but arriving at the right conclusion from that analysis can be quite challenging. Interpretation can go wrong because a user does not know the theory well enough or they do not the know the subject matter well enough (academically or “on the ground” experience).
The response to Kevin’s post I thought really showed the potential of “crowdsourcing” better statistical intelligence. When you open up the results of an analysis as well as the data used to perform the analysis there is a great opportunity for real collaboration. The type of discussion and conjecture that can lead to better decisions with statistical data. Since this is all discussion being done within a connected platform (i.e. the Web) the results can be harnessed over time and mined to see trends and macro correlations that help validate findings.
If we think about the way this is done traditionally it revolves around academic peer review. I have a hypothesis (that variable “x” could be an explanation of phenomenon “y”). I read the literature to see if there is theory to back up my hypothesis. I look at other studies to see what variables they used to explain phenomenon “y”. Then I build my model, run my results, write up my findings and send them off in hopes of being published. The journal takes my paper and sends it to other academic experts and they critique my research based on their experience and the relevant literature in the field. If I do my job well the paper is published and those with access to the journal can consume my research and hopefully be informed by it.
The problem is this is a very long process - on the order on years. It can take over a year to just go through the submittal, peer review and publication process. So, while the approach is great for validating research and producing meaningful results it is rarely done outside of academia in a rigorous way. What if that same process could be done in minutes/hours/days instead of years? We see a little bit of this in blogs every day - massively distributed peer review - but it is peer review of opinion 99% of the time. Kevin’s post showed something different, peer review of data. Not just reviewing “is the data accurate”, but “is the analysis of the data correct”. Over the course of a day the post has a really solid peer review of the analysis. To be honest it is better than many of the peer reviews I’ve gotten from academic journals.
If we go the next step and begin to harness this analysis to make it discoverable for the next user who runs an analysis with political affiliation or charitable giving it becomes yet more interesting. Lots of directions this can go and would love to get peoples thoughts on what they would find useful. If you’ve used GeoCommons a bit it is probably obvious that the scatter plot screen shots look awfully similar to the Maker user interface. That is no coincidence and we hope to have more details on a whole new set of GeoCommons functionality here shortly - stay tuned.
Popularity: 12% [?]
Quality Assurance for Crowdsourced GeoData: Icons and Comments?
December 16th, 2008by Sean Gorman
Whenever we present GeoCommons there are always questions about the accuracy and validity of crowdsourced data. The standard answer has been the data is as good as the source, and we provide multiple levels of citation to clearly identify the source. Sometimes the source is an individual who created their own data and there is no citation other than Bob made a spreadsheet or took his GPS out on the town. More frequently the data comes from an existing source like OECD, the United Nations, US Dept. of Transportation, etc. etc. and there is a link back to the source URL where the data was found. Lastly there is GIS data that has a full metadata specification (FGDC or ISO 19115) which can be included as a link.
While this information is all available on any metadata page in Finder there is nothing that really covers if the data has been quality checked. One of the dirty secrets of all data is there are inherently errors and mistakes. If anyone tells your their data is perfect they are most likely fibbing, and also believe their armpits never smell.
The challenges of data accuracy was reinforced recently on two different blog posts where readers identified errors on maps that we posted. One was a map our data team created on “College Coaches Salaries” where there were geocoding errors and the second was Steve Chilton’s OSM coverage map that had Monaco in place of Munich.
If you’ve spent a lot of time with geospatial data you’ll know these errors happen quite easily. Errors can be happen frequently with geocoding software and often it is just easy to overlook a misplaced city name when going over hundred of columns. I’ve been thinking about how we can introduce better quality assurance into both the data we contribute and help users of GeoCommons identify issues in their shared data.
For inspiration I looked into two existing projects Wikipedia and Swivel. Wikipedia probably has the most advanced quality assurance mechanisms in place for a crowdsourced project, but it is focused on text. Swivel on the other hand deals directly with data although not geospatial data.
One of the most useful approaches I’ve seen in Wikipedia is a common set of icons for labeling articles that have issues (no citation, too long, reads biased, needs verification, etc.). With the icons and text I can quickly see issues that exist with an article, which can help me gauge the extent to which I should trust the text. While the Wikipedia taxonomy is quite thorough it is geared around articles and not geospatial data.
One of the great things about data is that many organizations release it into the public domain, so copying data does not have the same issues that copying text has (plagiarism). This provides the opportunity to have data come directly from an “official source”. Swivel had the great idea of formalizing this by creating partnerships with organizations to share their data with the community as an “official source”. This again helps users decide on the level of confidence they have in a particular data set.
So my conclusion after spending some time looking at both was creating a set of icons and labels for datasets to let users know their level of vetting could be useful when combined with a clear labeling of a data set as “official source” or transcribed by someone else. Here a few possible labels for data and icons.
Geocoding Error
Needs Citation
Data Needs Cleanup
Data has been QA’d by an Editor
Then there are the icons that Swivel has created for “official source” data managed by Swivel and “official source” data uploaded by the source organization.
These are the tags that seemed to be most relevant. Are there others that tags folks think would be useful, or does anyone see issue with these? If there is general consensus around labels and icons to tag the level of validation for a given data set they could be used by anyone that has a need.
The other bit of this that I think is critical is creating a feedback loop to identify what the issues are with a particular data set. Which opens the question should these be georeferenced annotations indicating where on the map there error is, comments on the metadata page explaining the problem, or a combination of both. This requires a bit more engineering effort than the icons, but my first take is that a combination of the two could work well. Any other suggestions out there for providing better QA on geospatial data? Would love to hear them.
Popularity: 17% [?]
























