Simon Online

2013

2013-02-24

Using Realistic Data in Unit Testing

I was writing some unit tests the other day around some code what was missing them and I encountered a couple of field names which confused me. They were well named fields but within the domain I was working they could have a couple of different meanings. This isn’t the fault of the programmer and there are times when even the domain experts are a bit muddled about their terminology. Ideally we developers should work with the domain experts to sort out their language and ensure that everybody has a solid understanding of the domain. This isn’t always possible and you might end up with a field like PlantName which could be a couple of things.

I wasn’t keen on changing the field name as the code interfaced with other systems and I really didn’t want to coordinatesimultaneousreleases with our IT department. What I did, instead, was use the unit tests to clarify the content of the fields. The two meanings of theambiguouslynamed field had very different looking data in them. So I made sure that the data I put in corresponded with the right meaning for the field.

This isn’t what I typically do. I constantly use test data which is a random splat of letters or numbers without meaning. If, however, you take the extra few seconds needed to put in realistic data then you gain a couple of advantages:

Thosereading your testsimmediatelygain a deeper understanding of what it is you’re testing.
You may uncover subtile bugs which you would miss with random data. For instance consider an application which did word stemming: you’re likely to see a different result if you use the test data “running”(run) instead of “fkjdsklfjadsli”.

While I was thinking about this Istumbledon the blog of my Calgary .net hero David Paquette. Dave, in conjunction with James Chambers, announced tool called AngelaSmith named after the British Labour Party Politician who was instrumental in changing the law to afford those attacked by dogs better protections. Dave and James are real activists for the protection of those injured in dog attacks so I can see why they are using the name. Their tool allows for the creation of realistic test data using a variable name based strategy.

What’s that mean? It means that if you have common names in your code like “FirstName” or “LastName” then AngelaSmith will put in realistic values from its own database. AngelaSmith has built in values for a lot of common fields but if you have something weird, like I did you can put in a custom population strategy and have AngelaSmith generate values for you.

The project is pretty much brand new but it is a great idea and I hope it is something which James and Dave choose to continue.

I’ll post up some examples of how to use and extend AngelaSmith in a bit but for now the introduction on the github page is sufficient:https://github.com/MisterJames/AngelaSmith

2013-02-22

Open Data Day - The Final Day

I have been talking a lot this week about open data this week as we draw closer to Open Data Day. Tomorrow I’ll be at the Open Data Day event in Calgaryfor all the fun and games one might expect from an Open Data Day. In my last post before the event I wanted to talk about a different kind of open data. I started this series by defining open data within the context ofgovernmentdata. There is a great deal of data out there which is paid for by the government and tax payers but is held secret. I am speaking of the research from universities.

The product of universities is research papers. These papers should be published openly for anybody to read and make use of. Instead they are locked away behind pay walls in expensive journals. The price of these papers isabsolutelystunning. For instance the quiteinteresting paper “Isomorphism Types of Infinite Symmetric Graphs” which was published in 1972 is $34. “Investigating Concurrency in Online Auctions Through Visualization” is $14. This sort of information is rightfully ours as the tax payers who funded the university. It was in an attempt to free this data that Aaron Swartz was arrested and hounded to the death. He downloaded an,admittedlylarge, number of papers andrepublishedthem.

A fallen hero

There should have been no need for this tragedy. The system which allows this practice to continue should be changed, it should be torn down an destroyed. Peer review is an important part of the scientific process and it is pure hubris to assume that all research is done in universities which are able to afford licenses to journals. Who knows how many there are out there likeSrinivasa Ramanujanwho canbenefitfrom open data.

While Iabhor thesecrecy of journals there is a greater part to this story. The research behind these papers is rarely opened up to review. We see only the summarized results and perhaps a mention of the methodology. Requiring that actual raw statistical and log data be opened as part of the process of peer review will help alleviate fraudulent research andacceleratethe application of new discoveries. Push your politicians to link university funding toopenness. It is an idea whose time has come.

2013-02-21

Open Data - Countdown to Open Data Day 3

This is day 3 of my countdown to Open Data Day. That sneaky Open Data Day is slithering and sneaking up on us like a greased snake on an ice rink. So far we’ve looked at data from the provincial government and the city. That leaves us just one level of government: the federal government. I actually found that the feds had the best collection of data. Their site data.gc.cahas a huge number of data sets. What’s more is that the government has announced that it will be adopting the same open government system which has been developed jointly by India and the US: http://www.opengovtplatform.org/.

One of the really interesting things you can do with open data is to merge multiple data sets from different sources and pull out conclusions nobody has ever looked at before. That’s what I’ll attempt to do here.

Data.gc.ca has an amazing number of data sets available so if you’re like me and you’re just browsing for something fun to play with then you’re in for a bit of a challenge. I eventually found a couple of data sets related to farming in Canada which looked like they could be fun. The first was a set of data about farm incomes and net worths between 2001 and 2010. The second was as collection of data about yields of various crops in the same time frame.

I started off in excel summarizing and linking these data sets. I was interested to see if there was acorrelationbetween high grain yields per hector and an increase in farm revenue. This would be a reasonable assumption as getting more grain per hector should allow you to sell more and earn more money. Using the power of Excel I merged and cut up data sets to get this table:

	Farm Revenue	Yield Per Hector	Production in tonnes
2001	183267	2200	5864900
2002	211191	1900	3522400
2003	194331	2600	6429600
2004	238055	3100	7571400
2005	218350	3200	8371400
2006	262838	2900	7503400
2007	300918	2600	6076100
2008	381597	3200	8736200
2009	381250	2800	7440700
2010	356636	3200	8201300
2011	480056	3300	8839600

Looking at this it isn’t apparent if there is a link. We need a graph!

I threw it up against d3.js and produced some code which was very similar to my previous bar chart example in HTML 5 Data Visualizations ““ Part 5 ““ D3.js

Grain yields in blue, farm revenues in oranges

I didn’t bother with any scales because it isimmediatelyapparent that there does not seem to be any correlation. Huh. I would have thought the opposite.

You can see a live demo and the code over athttp://bl.ocks.org/stimms/5008627

2013-02-20

Open Data - Countdown to Open Data Day 2

Open Data day draws ever closer, like Alpha-Centauri would if we lived in a closed universe during it contraction phase. Today we will be looking at some of the data the City of Calgary produces. In particular the geographic data about the city.

I should pause here and say that I really don’t know what I’m doing with geographic data. I am not a GIS developer so there are very likely better ways to process this data and awesome refinements that I don’t know about. I can say that the process I followed here does work so that’s a good chunk of the battle.

A common topic of conversation in Calgary is “Where do you live?”. The answer is typically the name of acommunityto which I nod knowingly even though I have no idea which community is which. One of the data sets from the city is a map of the city divided into communityboundaries. I wanted a quick way to look up where communities are. To start I downloaded the shape files which came as a zip. Unzipping these got me

CALGIS.ADM_COMMUNITY_DISTRICT.dbf
CALGIS.ADM_COMMUNITY_DISTRICT.prj
CALGIS.ADM_COMMUNITY_DISTRICT.shp
CALGIS.ADM_COMMUNITY_DISTRICT.shx

It is my understanding that these are ESRI files. I was most interested in the shp file because I read that it could be transformed into a format known a GeoJSONwhich can be read by D3.js. To do this I followed the instruction on Jim Vallandingham’s site. I used a tool called ogr2ogr

ogr2ogr -f geoJSON output.json CALGIS.ADM_COMMUNITY_DISTRICT.shp

However this didn’t work properly and when put into the web page produced a giant mess which looked a lot like

Random Mess

I know a lot of people don’t like the layout of roads in Calgary but this seemed ridiculous.

I eventually found out that the shp file I had was in a different coordinate system from what D3.js was expecting. I should really go into more detail about that but not being a GIS guy I don’t understand it very well.Fortunatelysome nice people on StackOverflow came to my rescue and suggested that I instead use

ogr2ogr -f geoJSON output.json <strong>-t_srs "WGS84"</strong> CALGIS.ADM_COMMUNITY_DISTRICT.shp

This instructs ogr2ogr that the input is in World Geodetic System 1984.

Again leaning on work by Jim Vallandingham I used d3.js to build the map in an SVG.

The most confusing line in there is the section with scaling, rotating and translating the map. If these values seem random it is because they are. I spent at least an hour twiddling wit them to get them more or less correct. If you look at the final product you’ll notice it isn’t quite straight. I don’t care. Everything else is fairly easy to understand and should look a lot like the d3.js we’ve done before.

Coupled with a little bit of jquery for selecting matching elements we can build this very simple map. It will take some time to load as the GeoJSON is 3 meg in size. This can probably be reduced through simplifying the shape files and reducing the number of properties in the JSON. I also think this JSON is probably very compressible so delivering it over a bzip stream will be more efficient.

The full code is available on github athttps://github.com/stimms/VectorMapOfCalgary

2013-02-19

Open Data - Countdown to Open Data Day

Only a few more days to go before Open Data Day is upon us. For each of the next few days I’m going to look at a set of data from one of the levels of government and try to get some utility out of it. Some of the data sets which come out of thegovernmenthave noconceivableuse to me. But that’s the glory of open data, somebody else will find these datasets to be more useful than a turbo-charged bread slicer.

Today I’m looking at some of the data the provincial government is making available through The Office of Statistics and Information. This office seems to be about theequivalentof StatsCan for Alberta. They have published a large number of data set which have been divided into categories of interest such as “Science and Technology”, “Agriculture”, “Construction”. Drilling into the data sets typically gets you a graph of the data and the data used to generate the graph. For instance looking into Alberta Health statistics about infant mortality gets you to this page.

The Office of Statistics and Information, which I’ll call OSI for the sake of my fingers, seems to have totally missed the point of OpenData. They have presented data as well as interpretation of the data as OpenData. This is a cardinal sin, in my mind. OpenData is not about giving people data you’ve already massaged in a CSV file. It is about giving people the raw, source data so that they can draw their own conclusions. Basically give people the tools to think by themselves, don’t do the thinking for them.

The source data they give doesn’t provide any advantage over the graph, in fact it is probably worse. What should have been given here is an anonymized list of all the births and deaths of infants in Alberta broken down by date and by hospital. From that I can gather all sorts of other interesting data such as

Percentage of deaths at each hospital
Month of the year when there are the most births(always fun for making jokes about February 14th + 9 months)
The relative frequency of deaths in the winter compared with those in the summer

For this particular data set we see reference to zones. Whatdelineatesthese zones? I went on a quest to find out and eventually came across a map at the Alberta Health page. The map is, of course, PDF. Without this map I would never have known that Tofield isn’t in the Edmonton zone while the much more distantKapasiwin is. The reason for this is likely lost in the mists of government bureaucracy. So this brings me to complaint number two: don’t lock data intoartificialcontainers. I should not have to go hunting around to find the definition of zones, they should either be linked off the data page or, better, just not used. Cities are a pretty good container for data of this sort, if the original data had been set up for Calgary,Edmonton, Banff,”¦ then its meaning would have been far more apparent.

Anyway I promised I would do something with the data. I’m so annoyed by the OSI that this is just going to be a small demonstration. I took the numbers from the data set above and put them in to the map from which I painstakingly removed all the city names.

Infant mortality in Alberta

Obviously there are a million factors which determine infant mortality but all things being equal you should have your babies in Calgary. You should have them here anyway because Calgary has the highest concentration of awesome in theprovince. Proof? I live here.

2013-02-18

Data Visualization - A Misleading Visualization

There is a saying which goes something like “you can make up statistics to prove anything, 84% of people know that”. The assertion is that nobody checks the sources of statistics which is more or less accurate. The lack of fact checking goes double for the recent surge of infographics on the web. I saw one show up on twitter today which I thought was particularly damning in its misrepresentation of statistics.

A poor visualization

What’s wrong with this? Look at the size of those two circles. The one on the left is shockingly larger than the one on the right. This is done very much on purpose to shock people into thinking that the government is burning through money, that government workers havereceiveda huge salary increase in comparison with the private sector. However the difference isn’t that huge. The ratio between the two should be about 2.38 but if we look at the size of the circles the ratio looks to be closer to 7 or 8.

Small circles inside the large on

A common mistake made with circles is to double the diameter to represent a doubling in size.Unfortunately,this increased the volume by a factor of 4 and not 2. In this case the ratio is more than doubled so this isn’t the common mistake but a purposeful misrepresentation.

More than a 2.38 ratio

The morale of the story? While data visualizations can tell a story about data you, as the consumer of thevisualization, need to pay attention to the underlying data and not just a pretty picture.

2013-02-15

Open Data 101

One of the things I’m really enthusiastic about is open data. If you haven’t heard of open data the ideas that governments have a lot of data at their disposal. They gather this data as a normal part of doing business. If you think about your city they gather data about traffic patterns so they can set up traffic light patterns and decide which interchanges to expand first. They gather data about property values so they know what tax rates should be. They gather demographics aboutneighbourhoodsto decide where to put recreation centers. Cities also have at their disposal lists of all the street names and all the companies registered in the city. The list of data goes on and on.

Other levels of governments are equally well set up with data. You simply cannot run a government, or really any large company, without a lot of data. Typically governments sit on this data, they hoard it and gloat over it in the high towers of their data castles.

The City of Calgary’s data castle. Huge waste of tax payer’s money if you ask me.

Fortunately this is changing. Governments are starting to open this data up an make it available to the general public. This is fantastic because it allows those of us with some data analysis chops to dig in and find all sort of correlations which governments which might not have noticed. Many eyes can find things which government workers might have missed. Governments are also amazingly slow to react to new technologies so it isn’t very likely that your government is going to even think about producing a mobile application much less creating one. However private individuals or companies may well see profit or use in creating applications. With open data they can go ahead and do it.

The key to open data is that governments give it away for free and without strings. Somegovernmentsare reluctant to give the data away without strings attached. They consider that they have spent a lot of money to create the data so they should be compensated for it. What they’re missing is that the money that they’re spending is our money. We paid for the data so we should be entitled to use it in whatever way we see fit. The applications and tools which the data savvy are creating are not being created by governments. Everybody is likely tobenefitfrom these applications so that is another way in whichgovernmentsbenefitfrom open data.

The city in which I live, the City of Calgary, has a small collection of open data available on their website.Unfortunately,they’re still behind the times. ~~The data is protected by a rather draconian user agreement~~(Edit: Walter was kind enough to point out that the license has been vastly improved, I no longer have any real complaints about it) and the city provides no API access to the data. I am hopeful that the city will catch on soon and update the data they provide.

I’m talking today about open data because February the 23rd is Open Data Day. I am going to spend some time writing something using either Calgary’s open data or something from the provincial or federal governments. I’m not sure what I’ll make so please feel free to make some suggestions. Calgary has a few transit applications already so it won’t be one of those. If you’reinterestedin joining me to create something then drop me a line. Open data is only a success if people like you and me use it. So let’s get using it!

2013-02-14

HTML 5 Visualizations -Talk Notes

If you came to my talk today then thanks! If you didn’t then you should know that I’m writing down you name. What am I going to do with the list of names I build? Probably I’ll sell it to telemarketers or something.

Power point slides:HTML5 data visualizations(don’t bother, there are like 3 slides)

Code: https://github.com/stimms/HTML5Visualizations

The presentation is based on a number of blog entries written earlier this year:

HTML5 Data Visualizations ““ Part 1 ““ SVG vs. Canvas

HTML5 Data Visualizations ““ Part 2 ““ An Introduction to SVG

HTML5 Data Visualizations ““ Part 3 ““ Getting Started with RaphaÃ«l

HTML5 Data Visualizations ““ Part 4 ““ Creating a component with RaphaÃ«l and TypeScript

HTML 5 Data Visualizations ““ Part 5 ““ D3.js

HTML 5 Data Visualizations ““ Part 6 ““ Visual Jazz

2013-02-14

Presentation Today!

Today I’m doing a talk at the Calgary .net group about HTML 5 data visualizations. If you’re interested in learning a bit about some of the cool, interactive graphics which HTML 5 enables on the browser then I encourage you to come out. You will go away knowing how to build a simple bar chart in a handful of JavaScript and you may even learn something about TypeScript.

The event starts at high noon in downtown Calgary, 800 6th Ave SW (+15 Level across from Spice Cafe):

[googlemaps https://maps.google.ca/maps?q=800+6th+Ave+SW+Calgary&hl=en&sll=51.013117,-114.088499&sspn=0.863132,1.674042&hnear=800+6+Ave+SW,+Calgary,+Alberta+T2P+3E5&t=m&layer=c&cbll=51.047926,-114.079055&panoid=cphX2uyRRUZ8A2UhYBhD9g&cbp=12,14.34,,0,-16.27&ie=UTF8&hq=&ll=51.04793,-114.078793&spn=0.001499,0.003484&z=14&source=embed&output=svembed&w=425&h=350]

Bring your lunch and come out, you’ll be done in plenty of time to make it back to work for 1. The slides and demos will be posted here once my presentation has started.

2013-02-13

Debugging Android App from OSX

I just got a Nexus 7 for my birthday and thought I would try deploying to it. I really have no idea about building for Android so there were a couple of stumbling blocks for me I thought I would write down for future reference.

The Nexus 7 and, I understand, Android devices in general after 4.2 don’t have a development option on the menu. You have to go to settings > about and then tap on the build number 7 times to enable it. Hilarious. I’m delighted we basically have a Konami code for getting to options.
By default the USB debugging is turned off. You need to turn that on from the developer menu to get it to actually be found by the android SDK. You can check your devices by running

adb devices

adb is found in the platform-tools directory of the SDK.

Archives

A blog about computer programming and technology.

My Books