2013-02-20

Open Data - Countdown to Open Data Day 2

Open Data day draws ever closer, like Alpha-Centauri would if we lived in a closed universe during it contraction phase. Today we will be looking at some of the data the City of Calgary produces. In particular the geographic data about the city.

I should pause here and say that I really don’t know what I’m doing with geographic data. I am not a GIS developer so there are very likely better ways to process this data and awesome refinements that I don’t know about. I can say that the process I followed here does work so that’s a good chunk of the battle.

A common topic of conversation in Calgary is “Where do you live?”. The answer is typically the name of acommunityto which I nod knowingly even though I have no idea which community is which. One of the data sets from the city is a map of the city divided into communityboundaries. I wanted a quick way to look up where communities are. To start I downloaded the shape files which came as a zip. Unzipping these got me

  • CALGIS.ADM_COMMUNITY_DISTRICT.dbf
  • CALGIS.ADM_COMMUNITY_DISTRICT.prj
  • CALGIS.ADM_COMMUNITY_DISTRICT.shp
  • CALGIS.ADM_COMMUNITY_DISTRICT.shx

It is my understanding that these are ESRI files. I was most interested in the shp file because I read that it could be transformed into a format known a GeoJSONwhich can be read by D3.js. To do this I followed the instruction on Jim Vallandingham’s site. I used a tool called ogr2ogr

ogr2ogr -f geoJSON output.json CALGIS.ADM_COMMUNITY_DISTRICT.shp

However this didn’t work properly and when put into the web page produced a giant mess which looked a lot like

Random Mess

I know a lot of people don’t like the layout of roads in Calgary but this seemed ridiculous.

I eventually found out that the shp file I had was in a different coordinate system from what D3.js was expecting. I should really go into more detail about that but not being a GIS guy I don’t understand it very well.Fortunatelysome nice people on StackOverflow came to my rescue and suggested that I instead use

ogr2ogr -f geoJSON output.json <strong>-t_srs "WGS84"</strong> CALGIS.ADM_COMMUNITY_DISTRICT.shp

This instructs ogr2ogr that the input is in World Geodetic System 1984.

Again leaning on work by Jim Vallandingham I used d3.js to build the map in an SVG.

The most confusing line in there is the section with scaling, rotating and translating the map. If these values seem random it is because they are. I spent at least an hour twiddling wit them to get them more or less correct. If you look at the final product you’ll notice it isn’t quite straight. I don’t care. Everything else is fairly easy to understand and should look a lot like the d3.js we’ve done before.

Coupled with a little bit of jquery for selecting matching elements we can build this very simple map. It will take some time to load as the GeoJSON is 3 meg in size. This can probably be reduced through simplifying the shape files and reducing the number of properties in the JSON. I also think this JSON is probably very compressible so delivering it over a bzip stream will be more efficient.

The full code is available on github athttps://github.com/stimms/VectorMapOfCalgary

2013-02-19

Open Data - Countdown to Open Data Day

Only a few more days to go before Open Data Day is upon us. For each of the next few days I’m going to look at a set of data from one of the levels of government and try to get some utility out of it. Some of the data sets which come out of thegovernmenthave noconceivableuse to me. But that’s the glory of open data, somebody else will find these datasets to be more useful than a turbo-charged bread slicer.

Today I’m looking at some of the data the provincial government is making available through The Office of Statistics and Information. This office seems to be about theequivalentof StatsCan for Alberta. They have published a large number of data set which have been divided into categories of interest such as “Science and Technology”, “Agriculture”, “Construction”. Drilling into the data sets typically gets you a graph of the data and the data used to generate the graph. For instance looking into Alberta Health statistics about infant mortality gets you to this page.

The Office of Statistics and Information, which I’ll call OSI for the sake of my fingers, seems to have totally missed the point of OpenData. They have presented data as well as interpretation of the data as OpenData. This is a cardinal sin, in my mind. OpenData is not about giving people data you’ve already massaged in a CSV file. It is about giving people the raw, source data so that they can draw their own conclusions. Basically give people the tools to think by themselves, don’t do the thinking for them.

The source data they give doesn’t provide any advantage over the graph, in fact it is probably worse. What should have been given here is an anonymized list of all the births and deaths of infants in Alberta broken down by date and by hospital. From that I can gather all sorts of other interesting data such as

  • Percentage of deaths at each hospital
  • Month of the year when there are the most births(always fun for making jokes about February 14th + 9 months)
  • The relative frequency of deaths in the winter compared with those in the summer

For this particular data set we see reference to zones. Whatdelineatesthese zones? I went on a quest to find out and eventually came across a map at the Alberta Health page. The map is, of course, PDF. Without this map I would never have known that Tofield isn’t in the Edmonton zone while the much more distantKapasiwin is. The reason for this is likely lost in the mists of government bureaucracy. So this brings me to complaint number two: don’t lock data intoartificialcontainers. I should not have to go hunting around to find the definition of zones, they should either be linked off the data page or, better, just not used. Cities are a pretty good container for data of this sort, if the original data had been set up for Calgary,Edmonton, Banff,”¦ then its meaning would have been far more apparent.

Anyway I promised I would do something with the data. I’m so annoyed by the OSI that this is just going to be a small demonstration. I took the numbers from the data set above and put them in to the map from which I painstakingly removed all the city names.

Infant mortality in AlbertaInfant mortality in Alberta

Obviously there are a million factors which determine infant mortality but all things being equal you should have your babies in Calgary. You should have them here anyway because Calgary has the highest concentration of awesome in theprovince. Proof? I live here.

2013-02-18

Data Visualization - A Misleading Visualization

There is a saying which goes something like “you can make up statistics to prove anything, 84% of people know that”. The assertion is that nobody checks the sources of statistics which is more or less accurate. The lack of fact checking goes double for the recent surge of infographics on the web. I saw one show up on twitter today which I thought was particularly damning in its misrepresentation of statistics.

A poor visualizationA poor visualization

What’s wrong with this? Look at the size of those two circles. The one on the left is shockingly larger than the one on the right. This is done very much on purpose to shock people into thinking that the government is burning through money, that government workers havereceiveda huge salary increase in comparison with the private sector. However the difference isn’t that huge. The ratio between the two should be about 2.38 but if we look at the size of the circles the ratio looks to be closer to 7 or 8.

Small circles inside the large onSmall circles inside the large on

A common mistake made with circles is to double the diameter to represent a doubling in size.Unfortunately,this increased the volume by a factor of 4 and not 2. In this case the ratio is more than doubled so this isn’t the common mistake but a purposeful misrepresentation.

More than a 2.38 ratioMore than a 2.38 ratio

The morale of the story? While data visualizations can tell a story about data you, as the consumer of thevisualization, need to pay attention to the underlying data and not just a pretty picture.

2013-02-15

Open Data 101

One of the things I’m really enthusiastic about is open data. If you haven’t heard of open data the ideas that governments have a lot of data at their disposal. They gather this data as a normal part of doing business. If you think about your city they gather data about traffic patterns so they can set up traffic light patterns and decide which interchanges to expand first. They gather data about property values so they know what tax rates should be. They gather demographics aboutneighbourhoodsto decide where to put recreation centers. Cities also have at their disposal lists of all the street names and all the companies registered in the city. The list of data goes on and on.

Other levels of governments are equally well set up with data. You simply cannot run a government, or really any large company, without a lot of data. Typically governments sit on this data, they hoard it and gloat over it in the high towers of their data castles.

The City of Calgary's data castle. Huge waste of tax payer's money if you ask me.The City of Calgary’s data castle. Huge waste of tax payer’s money if you ask me.

Fortunately this is changing. Governments are starting to open this data up an make it available to the general public. This is fantastic because it allows those of us with some data analysis chops to dig in and find all sort of correlations which governments which might not have noticed. Many eyes can find things which government workers might have missed. Governments are also amazingly slow to react to new technologies so it isn’t very likely that your government is going to even think about producing a mobile application much less creating one. However private individuals or companies may well see profit or use in creating applications. With open data they can go ahead and do it.

The key to open data is that governments give it away for free and without strings. Somegovernmentsare reluctant to give the data away without strings attached. They consider that they have spent a lot of money to create the data so they should be compensated for it. What they’re missing is that the money that they’re spending is our money. We paid for the data so we should be entitled to use it in whatever way we see fit. The applications and tools which the data savvy are creating are not being created by governments. Everybody is likely tobenefitfrom these applications so that is another way in whichgovernmentsbenefitfrom open data.

The city in which I live, the City of Calgary, has a small collection of open data available on their website.Unfortunately,they’re still behind the times. The data is protected by a rather draconian user agreement(Edit: Walter was kind enough to point out that the license has been vastly improved, I no longer have any real complaints about it) and the city provides no API access to the data. I am hopeful that the city will catch on soon and update the data they provide.

I’m talking today about open data because February the 23rd is Open Data Day. I am going to spend some time writing something using either Calgary’s open data or something from the provincial or federal governments. I’m not sure what I’ll make so please feel free to make some suggestions. Calgary has a few transit applications already so it won’t be one of those. If you’reinterestedin joining me to create something then drop me a line. Open data is only a success if people like you and me use it. So let’s get using it!

2013-02-14

HTML 5 Visualizations -Talk Notes

If you came to my talk today then thanks! If you didn’t then you should know that I’m writing down you name. What am I going to do with the list of names I build? Probably I’ll sell it to telemarketers or something.

Power point slides:HTML5 data visualizations(don’t bother, there are like 3 slides)

Code: https://github.com/stimms/HTML5Visualizations

The presentation is based on a number of blog entries written earlier this year:

HTML5 Data Visualizations ““ Part 1 ““ SVG vs. Canvas

HTML5 Data Visualizations ““ Part 2 ““ An Introduction to SVG

HTML5 Data Visualizations ““ Part 3 ““ Getting Started with Raphaël

HTML5 Data Visualizations ““ Part 4 ““ Creating a component with Raphaël and TypeScript

HTML 5 Data Visualizations ““ Part 5 ““ D3.js

HTML 5 Data Visualizations ““ Part 6 ““ Visual Jazz

2013-02-14

Presentation Today!

Today I’m doing a talk at the Calgary .net group about HTML 5 data visualizations. If you’re interested in learning a bit about some of the cool, interactive graphics which HTML 5 enables on the browser then I encourage you to come out. You will go away knowing how to build a simple bar chart in a handful of JavaScript and you may even learn something about TypeScript.

The event starts at high noon in downtown Calgary, 800 6th Ave SW (+15 Level across from Spice Cafe):

[googlemaps https://maps.google.ca/maps?q=800+6th+Ave+SW+Calgary&hl=en&sll=51.013117,-114.088499&sspn=0.863132,1.674042&hnear=800+6+Ave+SW,+Calgary,+Alberta+T2P+3E5&t=m&layer=c&cbll=51.047926,-114.079055&panoid=cphX2uyRRUZ8A2UhYBhD9g&cbp=12,14.34,,0,-16.27&ie=UTF8&hq=&ll=51.04793,-114.078793&spn=0.001499,0.003484&z=14&source=embed&output=svembed&w=425&h=350]

Bring your lunch and come out, you’ll be done in plenty of time to make it back to work for 1. The slides and demos will be posted here once my presentation has started.

2013-02-13

Debugging Android App from OSX

I just got a Nexus 7 for my birthday and thought I would try deploying to it. I really have no idea about building for Android so there were a couple of stumbling blocks for me I thought I would write down for future reference.

  1. The Nexus 7 and, I understand, Android devices in general after 4.2 don’t have a development option on the menu. You have to go to settings > about and then tap on the build number 7 times to enable it. Hilarious. I’m delighted we basically have a Konami code for getting to options.

  2. By default the USB debugging is turned off. You need to turn that on from the developer menu to get it to actually be found by the android SDK. You can check your devices by running

adb devices

adb is found in the platform-tools directory of the SDK.

2013-02-12

Stop With the Error Codes, Already

I don’t like error codes. I suppose that back in the day when computers were young and had very limited memory there was a purpose to error codes. But last I checked everybody and their mother had enough storage available to show people a message in their language of choosing. I’m put in mind of this today by there being some random alpha numerical error on Windows Phone . I didn’t see it myself and, honestly, I’m so done with Windows Phone that I wouldn’t care if I did.

Why are we showing errors like this to users? We try to abstract the inner functioning of our applications from users so they don’t know what technologies or tools were made to created it. Users don’t care that you used a really novel way of building a dependency injection container or that your development workflow was centered around branches and not feature flags. Nope, they care if things work and if they are easy to use. So look deep into your developer brains and think which of these is better?

805a0193

or

An error hasoccurred: the server is currently unresponsive ““ we’re working on a fix and will be back shortly

Gosh, I sure like that second one. Even more than that error I would like to see something which will tell me how I can fix the problem, if it is indeed something I can fix. As an example I have an application which reads excel files and loads them into a database. I would say that 30% of the time when I run it there is a failure because I’ve still got the excel file open. Instead of crashing or showing anunintelligibleerror message I show

Unable to read Excel file. Typically this is because the file is already open. Close Excel and click to retry.

This error is precise, it gives a tip on how to fix it and a shortcut to the action the user should perform after they have solved the problem. We should be writing more errors like this. It only takes an extra 5 minutes to code up but can save people a lot in the long run.

2013-02-11

There is No Feature Flag Conspiracy

I know what they say, that if there were a conspiracy we would deny its existence but honestly those of us who support feature flags didn’t even know that we were suppose to be conspiring. In either case one of those smart guys I’ve talked about in the past, Amir Barylko, just wrote a blog post in which he derides the idea of feature flags and supports branches instead.

Thedefinition which Amir givesof feature flags doesn’t sitentirelywell with me. A feature flag is a branch in the flow of the application which disables a chunk of functionality for some or all users. These flow branches can be either compile time switches using ifdefs, if statements in the code or even a strategy pattern where different strategies are plugged in.

Let’s take a step back for a moment and talk about why we branch our code. I think there are a number of reasons for and applications of branching.

  1. You want to develop a feature which will take some time and will not beimmediatelyvisible to users. So perhaps you want to add a function which e-mails users when some long running action they have started completes. This will take more than the 3 hours you have left in the day so you branch and, when the feature is complete, you merge it back into trunk and all is well. These feature branches may live for hours, days, weeks or even months and may be worked upon by several people.

  2. You’re supporting multiple versions of the application. I know this is hard to remember in the reality of software as a service but there was once a time when you would deliver a product and then supply updates and patches for a time while having already started developing the next version. You don’t want users of the previous version to get the full update to the next version without paying but at the same time you don’t want them to suffer a bug which is fixed in the next version. In this case you maintain a branch for each release in which you can add fixes.

  3. When you’re working on a feature and something more important comes along so you shove your work in a branch and come back to it later.

  4. Quality branches when you promote features from dev to test to production. This ensures that only features which are ready for production make it there and they make it through test.

  5. To support multiple parallel developments against different technologies. Perhaps your product has a version for Windows and a version for UNIX. Between the two of them there likely exists a shared kernel but you keep different branches for each platform and merge into them from the shared kernel.

Huh, I was sort of expecting that to be a longer list. I may have missed something in here but I don’t think so. I don’t believe that either branching or feature flags is the magic pill which solves each one of these use cases optimally.

Amir’s argument boils down to: git is really good at merging and feature flags add a lot of overhead without offering anything that branching doesn’t. Well let’s look at each of the cases above and see if that holds true.

Feature Branches

Feature branches keep new functionality isolated from users until it is ready. I don’t like the idea of branching for features right off the bat. It gives the idea that the feature is a bigmonolithicpiece of code which needs to be completed before it is ready for users. Why is that dangerous? Because developing functionality in isolation leads to functionality which users don’t really want or don’t work in the way the users envisioned. Feature flags are a clear winner here in my mind. Feature flags allow developers to turn features on at runtime to test out new functionality on small sets of users or try them out at certain times of day.

In many large systems the testing environment isdifferentenough form production that it is almost impossible to know what will happen when a feature is turned on. The production environment can be too expensive to reproduce or too diverse. In these cases you turn the features on for a few people and see what happens. It sounds crazy but there arelegitimatelysome things you can’t test outside of production. There are also some things you want to test in production. If you’re scientific about making changes to your application then you want to to AB testing against a real user base.

Branching for features delays integration with mainline. It means that you’repurposefullyadding uncertainty about what will happen when the feature is merged back in. In the comments to another post on this subject, Dyaln Smith’s,Amir suggests that the delayed integration problem is a myth. Each feature branch is responsible for merging from mainline, he claims, so in fact integration is happening all the time. That’s not true. Integration is happening between mainline and each feature branch but not between different feature branches.

If features are very small and short lived then branching remains an option. In my mind if the development of a feature is expected to take more than 24 hours then a branch is not the right approach.

Branching to SupportMultipleReleases

I think this is a totally legitimate use of branching and a terrible place to use feature flags. The reason here is that the changes to the maintenance branch are always enabled. You would never want to put in a change to maintenance branch which wouldn’t be enabled right away. Source control systems like git are magnificent at merging between code bases. I remember merging between branches in ClearCase and in Subversion. These legacy source control systems were not good at this uses case ““ it was almost always easier to just manually make the changes in two places. It is really difficult to scale beyond two releases. When I was working with ClearCase we supported about 10simultaneousreleases. It was not fun.

Branching to Shelve Work

Again legitimate. When something more important comes along you may not have time to finish the line on which you’re working so the code base could not compile. Committing this back into mainline would bedisastrous (for a pretty benign definition of disaster, I admit). These branches are fine in my mind because they’re really short lived. Remember that: branchesshouldbe short lived.

Quality Branches

I don’t really think these differ much from feature branches. It is just a mechanism for promoting code between environments. Feature flags are a better choice here as they give better granularity for promotion and avoid the nightmare that is trying to isolate, at merge time what needs to be promoted. Merging upstream is a difficult problem. I’ve seen it used before in environments which strongly regulate what goes into production. We had terrible issues with merging and breaking upstream code. It was especiallyproblematicas this was many years ago before automated testing had really caught on. We did a lot of “emergency” fixes.

Supporting Multi Environment Development

Yuck. This is a totallyillegitimateplace to use branching. Different code for different environments should be isolated into different projects in mainline. If you create a branch for each one you’re buying a huge number of branches. Remember when I said I worked at a place where we supported 10releases? We also supported 31 platforms. 31. Heck we supported 4 different builds on Solaris: x86, x86_64, sparc and sparc64. Then we brought on a new compiler for Solaris which created binaries which were not backwards compatible. That added 3 more platforms, just for Solaris. I cannot even imagine how much time would have been wasted merging between branches if we had a different branch for each. It would have taken 90% of our development cycle just to push stuff to the various branches.

Arguments Against Feature Flags

A common argument I hear is that feature flags need to be toggled in a lot of places. If a change happens in the UI and the back end then that’s at least 2 places you need to put in the flag. This is a reasonable concern. The beauty of feature flags is that once the feature is fully live then the flags can be removed. Centralizing feature flag implementations is also key to implementing feature flags. Testing feature flags should be isolated to a single piece of code. I would also recommend leaning on the compiler a bit for feature flags. Instead of using strings to denote the name of a flag use an enum. In this way it is easy to find all references to the value and quickly jump to the feature flags used in code.

I really like feature flags for the flexibility they offer me for when I deploy features and for whom they are turned on. They’re not good in every situation and, as with all things, you have to think before you use them. There is a very realpossibilitythat feature flags will grow out of control and turn your code into spagetti. These concerns should be addressed the same way you would address any code quality concern: with thought and refactoring.

2013-02-08

Recreating Visualizations - CodeEval

I come across great visualizations every day. Every time I see one I now start thinking about how I could recreate it using SVG or even how I could improve it. The RecreatingVisualizationsseries of blog articles is going to explore some of these.


There was an interesting little article on reddit a few days back about some of the most popular programming languages of 2012. The results are a bit questionable but I did like the visualization they used. I say that the results are questionable because CodeEval are only looking at their own site’s results. They failed to state that clearly. To me the gold standard of programming language usage statistics is TIOBE. They publish their methodology openly and I cannot find fault with it. However their visualizations are not very attractive. Let’s see what happens when we combine the great visualizations of CodeEval and the statistics of TIOBE using d3.js.

A common mistake with bubble graphs of this sort is the scale of the bubbles. When building the graph we use the radius of the circle to draw the circles but if we use the radius directly to scale the circles then we actually end up with inaccurate circles because the surface area of a circle is pi * r^2. We need to adjust for this when building the graph by taking the square root of the value as the radius.

To start we build the data by copying and pasting from TIOBE

If you’ve read my multi-part series on introducing data visualizations for HTML5 then this should all look pretty familiar. If you haven’t read it then you should! I worked hard on it.

Initial Bubbles

Here we are setting up a new entry in our Graph module. You’ll notice on line 15 I set up the square root scale and set it to fill the entire width of the SVG. This is a bit of a naive approach and we’ll refine it later. We also placed each bubble in a row so none of them would overlap. It is going to look like

Bubbles!Bubbles!

Heck, already that’s kind of nifty. If we compare it with the CodeEval one, though it doesn’t have the same sorts of cool over laps and we’re missing labels. The labels are easy so let’s start with them.

Labels

In a previous post I mentioned that it was difficult to center strings in an SVG. This, as it turns out, isn’t true! You can make use of an attribute called text-anchor which sets where the anchor point is for a block of text. In our case we want to set its value to “middle” which means that whatever x value we gave should be treated as the center of the string.

This will add the name of the language to all our bubbles. You’ll notice that for the x and y values I’ve taken the values from the data array. In a step you’re about to see we calculate the values for the radius and coordinates for each bubble. Saving it back into the data array means we only ever have to calculate it once.

Calculating Bubbles

When CodeEval made their graphic they probably did so using photoshop or some other piece of graphical design software. This is typical of a lot of the visualizations we see on the web. As with most computery things if you have to do something once then you can do it by hand but if you do it more than once script it. Besides, we don’t havebenefitof graphical design software when we’re drawing an SVG so we need to do a little math to try to get our bubbles to fit in a nice way.

I started by thinking that I would like the majority of the image to be taken up by bubbles. If our bubbles were non-overlapping rectangles then we would end up with an algorithm which looked a lot like the NP-hard optimization version of theknapsack problem.Fortunately,we don’t need to do perfect packing so we can use linear approximations to come up with reasonable values.

I figure out the total volume of the SVG and then use it to manipulate the scale. The magic number there on line 2 was found by doing a bit of guess and check. Anything between 1.5 and 2 seemed to create a reasonable fill. Now we need to figure out the location of the bubbles, this is a bit harder. There are a couple of strategies which can be used; I’ve opted to go with the simplest here in the interests of having something more to blog about later.

We start by pretending we have a bubble in the middle of the canvas. This is done to stop us from having a boring bubble at the center of the canvas every time. From this we can figure out the placement of the next bubble. We would like the next bubble to have a little bit of overlap with the current bubble but not too much. If we add some padding to the current bubble then we get a new circle of possible locations for the center of the next bubble.

padding

Next we pick a random X value somewhere within the radius of the padding circle

calculation1

Now we can calculate the Y value as we know the radius of the outer circle. You will need to use TRIANGLES to do this. Well one triangle, but that doesn’t invalidate the point that your high school math is actually useful.

calculation2

That’s the center of your new bubble

Bubbles!

I randomly moved this point to differentquadrantsso we didn’t always have a bubble attached to another in the top right.

The drawback with this strategy is that you may have bubbles which end up off screen. I solved this by invalidating positions which were off screen and calculating a new random point for them.

This fits into our bubble calculations like so

Putting it Together

Now we have a way to build the locations of each bubble we can go ahead and combine it with the actual bubble drawing we did earlier.

This gives us something which looks like

Screen Shot 2013-02-08 at 9.42.31 AM

I like this a lot! However because our bubbles only care if they overlap the previous we do sometimes end up with a mess like

Screen Shot 2013-02-08 at 10.06.17 AM

We’ll look at some ways to deal with this in an upcoming post. You can get the full code for this over at github.