open data

Open Data: Countdown to Open Data Day 3

This is day 3 of my countdown to Open Data Day. That sneaky Open Data Day is slithering and sneaking up on us like a greased snake on an ice rink. So far we’ve looked at data from the provincial government and the city. That leaves us just one level of government: the federal government. I actually found that the feds had the best collection of data. Their site data.gc.ca has a huge number of data sets. What’s more is that the government has announced that it will be adopting the same open government system which has been developed jointly by India and the US: http://www.opengovtplatform.org/.

One of the really interesting things you can do with open data is to merge multiple data sets from different sources and pull out conclusions nobody has ever looked at before. That’s what I’ll attempt to do here.

Data.gc.ca has an amazing number of data sets available so if you’re like me and you’re just browsing for something fun to play with then you’re in for a bit of a challenge. I eventually found a couple of data sets related to farming in Canada which looked like they could be fun. The first was a set of data about farm incomes and net worths between 2001 and 2010. The second was as collection of data about yields of various crops in the same time frame.

I started off in excel summarizing and linking these data sets. I was interested to see if there was a correlation between high grain yields per hector and an increase in farm revenue. This would be a reasonable assumption as getting more grain per hector should allow you to sell more and earn more money.  Using the power of Excel I merged and cut up data sets to get this table:

Farm RevenueYield Per HectorProduction in tonnes
200118326722005864900
200221119119003522400
200319433126006429600
200423805531007571400
200521835032008371400
200626283829007503400
200730091826006076100
200838159732008736200
200938125028007440700
201035663632008201300
201148005633008839600
Looking at this it isn’t apparent if there is a link. We need a graph!

I threw it up against d3.js and produced some code which was very similar to my previous bar chart example in HTML 5 Data Visualizations – Part 5 – D3.js

Grain yields in blue, farm revenues in orangeGrain yields in blue, farm revenues in oranges

I didn’t bother with any scales because it is immediately apparent that there does not seem to be any  correlation. Huh. I would have thought the opposite.

You can see a live demo and the code over at http://bl.ocks.org/stimms/5008627

Open Data: Countdown to Open Data Day 2

Open Data day draws ever closer, like Alpha-Centauri would if we lived in a closed universe during it contraction phase. Today we will be looking at some of the data the City of Calgary produces. In particular the geographic data about the city.

I should pause here and say that I really don’t know what I’m doing with geographic data. I am  not a GIS developer so there are very likely better ways to process this data and awesome refinements that I don’t know about. I can say that the process I followed here does work so that’s a good chunk of the battle.

A common topic of conversation in Calgary is “Where do you live?”. The answer is typically the name of a community to which I nod knowingly even though I have no idea which community is which. One of the data sets from the city is a map of the city divided into community boundaries. I wanted a quick way to look up where communities are. To start I downloaded the shape files which came as a zip. Unzipping these got me

  • CALGIS.ADMCOMMUNITYDISTRICT.dbf
  • CALGIS.ADMCOMMUNITYDISTRICT.prj
  • CALGIS.ADMCOMMUNITYDISTRICT.shp
  • CALGIS.ADMCOMMUNITYDISTRICT.shx

It is my understanding that these are ESRI files. I was most interested in the shp file because I read that it could be transformed into a format known a GeoJSON which can be read by D3.js. To do this I followed the instruction on Jim Vallandingham’s site. I used a tool called ogr2ogr

ogr2ogr -f geoJSON output.json CALGIS.ADM_COMMUNITY_DISTRICT.shp

However this didn’t work properly and when put into the web page produced a giant mess which looked a lot like

Random Mess

I know a lot of people don’t like the layout of roads in Calgary but this seemed ridiculous.

I eventually found out that the shp file I had was in a different coordinate system from what D3.js was expecting. I should really go into more detail about that but not being a GIS guy I don’t understand it very well. Fortunately some nice people on StackOverflow came to my rescue and suggested that I instead use

ogr2ogr -f geoJSON output.json <strong>-t_srs "WGS84"</strong> CALGIS.ADM_COMMUNITY_DISTRICT.shp

This instructs ogr2ogr that the input is in World Geodetic System 1984.

Again leaning on work by Jim Vallandingham I used d3.js to build the map in an SVG.

The most confusing line in there is the section with scaling, rotating and translating the map. If these values seem random it is because they are. I spent at least an hour twiddling wit them to get them more or less correct. If you look at the final product you’ll notice it isn’t quite straight. I don’t care. Everything else is fairly easy to understand and should look a lot like the d3.js we’ve done before.

Coupled with a little bit of jquery for selecting matching elements we can build this very simple map. It will take some time to load as the GeoJSON is 3 meg in size. This can probably be reduced through simplifying the shape files and reducing the number of properties in the JSON. I also think this JSON is probably very compressible so delivering it over a bzip stream will be more efficient.

The full code is available on github at https://github.com/stimms/VectorMapOfCalgary

 

Open Data: Countdown to Open Data Day

Only a few more days to go before Open Data Day is upon us. For each of the next few days I’m going to look at a set of data from one of the levels of government and try to get some utility out of it. Some of the data sets which come out of the government have no conceivable use to me. But that’s the glory of open data, somebody else will find these datasets to be more useful than a turbo-charged bread slicer.

Today I’m looking at some of the data the provincial government is making available through The Office of Statistics and Information. This office seems to be about the equivalent of StatsCan for Alberta. They have published a large number of data set which have been divided into categories of interest such as “Science and Technology”, “Agriculture”, “Construction”. Drilling into the data sets typically gets you a graph of the data and the data used to generate the graph. For instance looking into Alberta Health statistics about infant mortality gets you to this page.

The Office of Statistics and Information, which I’ll call OSI for the sake of my fingers, seems to have totally missed the point of OpenData. They have presented data as well as interpretation of the data as OpenData. This is a cardinal sin, in my mind. OpenData is not about giving people data you’ve already massaged in a CSV file. It is about giving people the raw, source data so that they can draw their own conclusions. Basically give people the tools to think by themselves, don’t do the thinking for them.

The source data they give doesn’t provide any advantage over the graph, in fact it is probably worse. What should have been given here is an anonymized list of  all the births and deaths of infants in Alberta broken down by date and by hospital. From that I can gather all sorts of other interesting data such as

  • Percentage of deaths at each hospital
  • Month of the year when there are the most births(always fun for making jokes about February 14th + 9 months)
  • The relative frequency of deaths in the winter compared with those in the summer

For this particular data set we see reference to zones. What delineates these zones? I went on a quest to find out and eventually came across a map at the Alberta Health page. The map is, of course, PDF. Without this map I would never have known that Tofield isn’t in the Edmonton zone while the much more distant Kapasiwin is. The reason for this is likely lost in the mists of government bureaucracy. So this brings me to complaint number two: don’t lock data into artificial containers. I should not have to go hunting around to find the definition of zones, they should either be linked off the data page or, better, just not used. Cities are a pretty good container for data of this sort, if the original data had been set up for Calgary,Edmonton, Banff,… then its meaning would have been far more apparent.

Anyway I promised I would do something with the data. I’m so annoyed by the OSI that this is just going to be a small demonstration. I took the numbers from the data set above and put them in to the map from which I painstakingly removed all the city names.

Infant mortality in AlbertaInfant mortality in Alberta

Obviously there are a million factors which determine infant mortality but all things being equal you should have your babies in Calgary. You should have them here anyway because Calgary has the highest concentration of awesome in the province. Proof? I live here.