2013-03-04

Typescript - Compiling your first program

I have been using typescript which is the newishMicrosoft language which compiles to JavaScript in a lot of my blog posts lately. It occurred to me that I haven’t actually written anything about typescript by itself. I’m really a big fan of many of the things typescript does so let’s dive right in.

The first thing to know is that typescript needs to be compiled by the typescript transcompiler. There are a number of ways to get the compiler. You can install it as part of a node.js install using npm, the node package manager. You can download the visual studio plugin or you can compile it from source. Compile it from source? What is this, linux? Actually the compiler is written in typescript. I guess that this is a pretty common thing to do when you build a compiler: build a minimal compiler and then use that to build a more fully featured compiler. It boggles my mind, to be perfectly honest.

Anyway, I installed the npm version because I’m doing most of my web development using the Sublime text editor. The command line is very easy, you can pretty much get away with knowing

./tsc blah.ts

This will produce a javascript file called blah.js. I recommend that you keep the default of naming the javascript files after the type script files, it will help you keep your sanity. If you really want to rename the output you can do it with the out flag

./tsc blah.ts -out thisisabadidea.js

There are a number of other flags you can use including a bunch of undocumented flags you’ll only find by reading the source. The most useful flag is -w which watches the .ts file and recompiles it when there are changes. This is useful if you’re developing you code and don’t want to have to keep dropping to the command line to recompile and then refresh in the browser.

The first thing about the typescript language itself you need to know is thatit has a hybrid type system. Typically javascript performs type checking only at runtime, by default this is the same thing that typescript does but typescript allows you to perform some type checking by annotating your variables with a type. To do this the syntax is to add : type to a variable declaration. These can be used in functions and in declarations. For instance:

If we change the call to this function to

Then try to compile it typescript will throw an error

Supplied parameters do not match any signature of call target

This is obviously a very simple example but I have found that typescript quickly pulls out silly typing errors which normally I would have to check with unit tests.

Tomorrow we’ll look into some more features of typescript.

2013-03-01

Force Directed Graph 2

Yesterday I showed creating a force directed graph with a dataset taken from wikipedia. As it stood it was pretty nifty but it needed some interaction. We’re going to add some very basic interaction. To do this we’ll use the data- attribute we added when building the graph. These attributes were added as part of HTML5 to allow attaching data to DOM elements. Any attributes which are prefixed with data- are ignored by the browser unless you specifically query for them.

I started by adding a set of buttons to the page, one for each show.

Then I added a simple jQuery function to hook up these buttons and filter the graph

First I reset the graph to is original color, then I select a collection of elements using the data-productions element. This is a stunningly inefficient way to select elements but we have a pretty small set of data with which we’re working so I’m not overly concerned about the efficiency. It could be improved by using a css class instead of a data-attribute as these selectors have been optimized.

The final product is up athttp://bl.ocks.org/stimms/5069532

There are a bunch of other fun customizations we could do to this visualization. Some of my ideas are:

  • Improve the display of name labels during hover
  • Highlight linksbetweenpeople when you hover over a node
  • Show the number of collaborations when somebody hovers over a link

That’s the best part of visualization: there is always some crazy new idea to try out. The trick is to know when to stop and what you can take away without ruining the message.

2013-02-28

Force Directed Graphs

In talking with a friend the other day he mentioned that everybody at his work was all agog about a TED talk which had a cool looking graph in it. He promised that he wouldabsolutely100% pinky swear send me a link to the talk so I could try to recreate it. He didn’t.

I had a pretty good idea what it was he was talking about though: a force directed graph. The idea behind a force directed graph is that you have a number of connected nodes which are attached using springs and attracted by gravity. These graphs can be used to show relationships between a number of items and they are interactive so that they can be dragged around to see what the data would look like from a different direction. The proximity of nodes to one another can denote the strength of the relationship.

Let’s try to recreate it using our good friend d3.js. The first thing we need is a set of related data. Wikipedia is a great source for data of this sort of data. A good data set for a demonstration will have nodes which are connected to more than one other node and may have another aspect to it like that some of the nodes share another property. This additional degree of relationship can be denoted with colour.

I took a look at a few pages of data but I’m a nerd so I chose the dataset of actors with whom Joss Whedon has collaborated. If you just clicked on the link to see who Joss Whedon is the you get off this blog, you get off and you never come back.

I started by pulling the table from wikipedia and then transforming it into JSON. I got some help in doing that fromhttp://jsonlint.com/which is a great tool for checking and formatting JSON. The file is pretty long but a chunk of it looks like

You may notice that I included the names of the productions and their medium. We’ll see more about this tomorrow when we add filtering to the graph.

Fortunately for us d3 provides some helpers to set up a force graph. I basically stole my entire graph code from Mike Bostock’s page. d3 requires that you set up a list of nodes and edges.

Nodes are quite easily set up and are just represented as circles. This is pretty much what we’ve seen before except that we call force.drag which, if you drill into the example you’ll see allows for moving the nodes

The edges have different strengths, the higher the value of the links the stronger the connection so the closer the nodes would be. I built the links based on the productions shared between the two people. The code for extracting the shared productions is”¦ umm”¦ not pretty. I really don’t know how you would make it prettier other than changing the underlying data structure. So I guess the lesson here is: pick data structures which work with your requirements.

The resulting graph looks like this:

GraphGraph

If you want to see the interactive version you should pop over to http://bl.ocks.org/stimms/raw/5061669/

2013-02-27

Search Everywhere is the Future

Today I was demoed an in house application. It was what I have come to think of as a typical in house application: that is say lacking some attention to usability. I see this all the time and I think that not spending time on good usability and attractive design serves to make people think less of the application. You hear “Oh that accounting software sucks, it is so hard to use” and eventually the software is replaced sooner than needed and at a higher cost. Anyway that’s not what this blog is about, it is about search and how it is the killer user interface.

In the application I was looking at, let’s give it a random 3 letter acronym as all with in house software: TYM, there was a form which required you to select an employee. The employee selection was a drop down box with 2600 entries in it. I asked why the usability on that drop down was so bad. Why weren’t there at least some filters? “Oh,” said the lady demoing it ,”you can use hyperscroll”. Hyperscroll?! I haven’t seen hyperscroll before so I had her demo it. Turns out it is just the search you get when you go into a drop down and type the first letters of the entry. In a list of 2600 entries that approach isn’t great. There is limited space in a drop down and the UI paradigm just doesn’t adapt well to that many entries. It also isn’t good if you’re looking for a last name and the list is sorted by first name.

This is a perfect place to use a search. Search doesn’t have to be big and involved it can be very simple and small, it can even be an autocomplete box. Heck, you don’t even need to make the search server side! I believe that in an era of lots of data the utility of the drop down, or combo box is limited. There are still some times when the entries in a drop down are fewer than, say, 10 that a drop down is the correct approach. Of course it is difficult to know when your data will exceed the number of entries so you might be better off building the search in right off the bat.

This wasn’t the only place in TYM which would havebenefitedfrom a search. There was another form which contained a wall of check boxes. I mocked up this example but TYM was far more extreme.

CheckboxtopiaCheckboxtopia

Again this is a place where search would have been a better solution. The typical approach I take in this situation is to provide a tagging mechanism like the one provided by TagIt. This approach it not asintuitiveto users but once they have the hang of it they will like it far more.

I really think that a few minutes of care and attention in user interface design will win your users over to you.

2013-02-26

AngelaSmith - Creating Test Data

Yesterday I blogged about a rapid data prototyping tool called AngelaSmith which permits building realistic test data. Today I’m going to show you how to make some use of it. The first thing we need is a model on which to operate. Ever since I read the license agreement for Java 1.1 which explicitly prohibited the development of software related to airplanes I’ve liked example code which references airplanes. I even had a text book at one point in which all the examples were related to the flying of planes. I can’t remember which book it was but it might have been an old Deitelbook.

Anyway let’s start with this model:

This can be filled with AngelaSmith

But if we print out the data we can see that Angela wasn’t quite smart enough to look at substrings of the field names to perform matches. That’s a shame and probably going to be a pull request.

Not great test dataNot great test data

I did have a good laugh at the idea of a pitchfork with a range of 58km. That’s a long pitchfork!

So we’re going to have to set up some custom rules for the data generation

For the most part we can lean on Jen(oh such witty and confusing naming) to do the population for us, but you can see that I used my own custom generation method for flight numbers because they’re not really standard.

Better generationBetter generation

Well that was pretty easy! Thanks, Angela, you’re the best.

2013-02-24

Using Realistic Data in Unit Testing

I was writing some unit tests the other day around some code what was missing them and I encountered a couple of field names which confused me. They were well named fields but within the domain I was working they could have a couple of different meanings. This isn’t the fault of the programmer and there are times when even the domain experts are a bit muddled about their terminology. Ideally we developers should work with the domain experts to sort out their language and ensure that everybody has a solid understanding of the domain. This isn’t always possible and you might end up with a field like PlantName which could be a couple of things.

I wasn’t keen on changing the field name as the code interfaced with other systems and I really didn’t want to coordinatesimultaneousreleases with our IT department. What I did, instead, was use the unit tests to clarify the content of the fields. The two meanings of theambiguouslynamed field had very different looking data in them. So I made sure that the data I put in corresponded with the right meaning for the field.

This isn’t what I typically do. I constantly use test data which is a random splat of letters or numbers without meaning. If, however, you take the extra few seconds needed to put in realistic data then you gain a couple of advantages:

  1. Thosereading your testsimmediatelygain a deeper understanding of what it is you’re testing.
  2. You may uncover subtile bugs which you would miss with random data. For instance consider an application which did word stemming: you’re likely to see a different result if you use the test data “running”(run) instead of “fkjdsklfjadsli”.

While I was thinking about this Istumbledon the blog of my Calgary .net hero David Paquette. Dave, in conjunction with James Chambers, announced tool called AngelaSmith named after the British Labour Party Politician who was instrumental in changing the law to afford those attacked by dogs better protections. Dave and James are real activists for the protection of those injured in dog attacks so I can see why they are using the name. Their tool allows for the creation of realistic test data using a variable name based strategy.

What’s that mean? It means that if you have common names in your code like “FirstName” or “LastName” then AngelaSmith will put in realistic values from its own database. AngelaSmith has built in values for a lot of common fields but if you have something weird, like I did you can put in a custom population strategy and have AngelaSmith generate values for you.

The project is pretty much brand new but it is a great idea and I hope it is something which James and Dave choose to continue.

I’ll post up some examples of how to use and extend AngelaSmith in a bit but for now the introduction on the github page is sufficient:https://github.com/MisterJames/AngelaSmith

2013-02-22

Open Data Day - The Final Day

I have been talking a lot this week about open data this week as we draw closer to Open Data Day. Tomorrow I’ll be at the Open Data Day event in Calgaryfor all the fun and games one might expect from an Open Data Day. In my last post before the event I wanted to talk about a different kind of open data. I started this series by defining open data within the context ofgovernmentdata. There is a great deal of data out there which is paid for by the government and tax payers but is held secret. I am speaking of the research from universities.

The product of universities is research papers. These papers should be published openly for anybody to read and make use of. Instead they are locked away behind pay walls in expensive journals. The price of these papers isabsolutelystunning. For instance the quiteinteresting paper “Isomorphism Types of Infinite Symmetric Graphs” which was published in 1972 is $34. “Investigating Concurrency in Online Auctions Through Visualization” is $14. This sort of information is rightfully ours as the tax payers who funded the university. It was in an attempt to free this data that Aaron Swartz was arrested and hounded to the death. He downloaded an,admittedlylarge, number of papers andrepublishedthem.

A fallen hero

There should have been no need for this tragedy. The system which allows this practice to continue should be changed, it should be torn down an destroyed. Peer review is an important part of the scientific process and it is pure hubris to assume that all research is done in universities which are able to afford licenses to journals. Who knows how many there are out there likeSrinivasa Ramanujanwho canbenefitfrom open data.

While Iabhor thesecrecy of journals there is a greater part to this story. The research behind these papers is rarely opened up to review. We see only the summarized results and perhaps a mention of the methodology. Requiring that actual raw statistical and log data be opened as part of the process of peer review will help alleviate fraudulent research andacceleratethe application of new discoveries. Push your politicians to link university funding toopenness. It is an idea whose time has come.

2013-02-21

Open Data - Countdown to Open Data Day 3

This is day 3 of my countdown to Open Data Day. That sneaky Open Data Day is slithering and sneaking up on us like a greased snake on an ice rink. So far we’ve looked at data from the provincial government and the city. That leaves us just one level of government: the federal government. I actually found that the feds had the best collection of data. Their site data.gc.cahas a huge number of data sets. What’s more is that the government has announced that it will be adopting the same open government system which has been developed jointly by India and the US: http://www.opengovtplatform.org/.

One of the really interesting things you can do with open data is to merge multiple data sets from different sources and pull out conclusions nobody has ever looked at before. That’s what I’ll attempt to do here.

Data.gc.ca has an amazing number of data sets available so if you’re like me and you’re just browsing for something fun to play with then you’re in for a bit of a challenge. I eventually found a couple of data sets related to farming in Canada which looked like they could be fun. The first was a set of data about farm incomes and net worths between 2001 and 2010. The second was as collection of data about yields of various crops in the same time frame.

I started off in excel summarizing and linking these data sets. I was interested to see if there was acorrelationbetween high grain yields per hector and an increase in farm revenue. This would be a reasonable assumption as getting more grain per hector should allow you to sell more and earn more money. Using the power of Excel I merged and cut up data sets to get this table:

Farm RevenueYield Per HectorProduction in tonnes
200118326722005864900
200221119119003522400
200319433126006429600
200423805531007571400
200521835032008371400
200626283829007503400
200730091826006076100
200838159732008736200
200938125028007440700
201035663632008201300
201148005633008839600
Looking at this it isn’t apparent if there is a link. We need a graph!

I threw it up against d3.js and produced some code which was very similar to my previous bar chart example in HTML 5 Data Visualizations ““ Part 5 ““ D3.js

Grain yields in blue, farm revenues in orangeGrain yields in blue, farm revenues in oranges

I didn’t bother with any scales because it isimmediatelyapparent that there does not seem to be any correlation. Huh. I would have thought the opposite.

You can see a live demo and the code over athttp://bl.ocks.org/stimms/5008627

2013-02-20

Open Data - Countdown to Open Data Day 2

Open Data day draws ever closer, like Alpha-Centauri would if we lived in a closed universe during it contraction phase. Today we will be looking at some of the data the City of Calgary produces. In particular the geographic data about the city.

I should pause here and say that I really don’t know what I’m doing with geographic data. I am not a GIS developer so there are very likely better ways to process this data and awesome refinements that I don’t know about. I can say that the process I followed here does work so that’s a good chunk of the battle.

A common topic of conversation in Calgary is “Where do you live?”. The answer is typically the name of acommunityto which I nod knowingly even though I have no idea which community is which. One of the data sets from the city is a map of the city divided into communityboundaries. I wanted a quick way to look up where communities are. To start I downloaded the shape files which came as a zip. Unzipping these got me

  • CALGIS.ADM_COMMUNITY_DISTRICT.dbf
  • CALGIS.ADM_COMMUNITY_DISTRICT.prj
  • CALGIS.ADM_COMMUNITY_DISTRICT.shp
  • CALGIS.ADM_COMMUNITY_DISTRICT.shx

It is my understanding that these are ESRI files. I was most interested in the shp file because I read that it could be transformed into a format known a GeoJSONwhich can be read by D3.js. To do this I followed the instruction on Jim Vallandingham’s site. I used a tool called ogr2ogr

ogr2ogr -f geoJSON output.json CALGIS.ADM_COMMUNITY_DISTRICT.shp

However this didn’t work properly and when put into the web page produced a giant mess which looked a lot like

Random Mess

I know a lot of people don’t like the layout of roads in Calgary but this seemed ridiculous.

I eventually found out that the shp file I had was in a different coordinate system from what D3.js was expecting. I should really go into more detail about that but not being a GIS guy I don’t understand it very well.Fortunatelysome nice people on StackOverflow came to my rescue and suggested that I instead use

ogr2ogr -f geoJSON output.json <strong>-t_srs "WGS84"</strong> CALGIS.ADM_COMMUNITY_DISTRICT.shp

This instructs ogr2ogr that the input is in World Geodetic System 1984.

Again leaning on work by Jim Vallandingham I used d3.js to build the map in an SVG.

The most confusing line in there is the section with scaling, rotating and translating the map. If these values seem random it is because they are. I spent at least an hour twiddling wit them to get them more or less correct. If you look at the final product you’ll notice it isn’t quite straight. I don’t care. Everything else is fairly easy to understand and should look a lot like the d3.js we’ve done before.

Coupled with a little bit of jquery for selecting matching elements we can build this very simple map. It will take some time to load as the GeoJSON is 3 meg in size. This can probably be reduced through simplifying the shape files and reducing the number of properties in the JSON. I also think this JSON is probably very compressible so delivering it over a bzip stream will be more efficient.

The full code is available on github athttps://github.com/stimms/VectorMapOfCalgary

2013-02-19

Open Data - Countdown to Open Data Day

Only a few more days to go before Open Data Day is upon us. For each of the next few days I’m going to look at a set of data from one of the levels of government and try to get some utility out of it. Some of the data sets which come out of thegovernmenthave noconceivableuse to me. But that’s the glory of open data, somebody else will find these datasets to be more useful than a turbo-charged bread slicer.

Today I’m looking at some of the data the provincial government is making available through The Office of Statistics and Information. This office seems to be about theequivalentof StatsCan for Alberta. They have published a large number of data set which have been divided into categories of interest such as “Science and Technology”, “Agriculture”, “Construction”. Drilling into the data sets typically gets you a graph of the data and the data used to generate the graph. For instance looking into Alberta Health statistics about infant mortality gets you to this page.

The Office of Statistics and Information, which I’ll call OSI for the sake of my fingers, seems to have totally missed the point of OpenData. They have presented data as well as interpretation of the data as OpenData. This is a cardinal sin, in my mind. OpenData is not about giving people data you’ve already massaged in a CSV file. It is about giving people the raw, source data so that they can draw their own conclusions. Basically give people the tools to think by themselves, don’t do the thinking for them.

The source data they give doesn’t provide any advantage over the graph, in fact it is probably worse. What should have been given here is an anonymized list of all the births and deaths of infants in Alberta broken down by date and by hospital. From that I can gather all sorts of other interesting data such as

  • Percentage of deaths at each hospital
  • Month of the year when there are the most births(always fun for making jokes about February 14th + 9 months)
  • The relative frequency of deaths in the winter compared with those in the summer

For this particular data set we see reference to zones. Whatdelineatesthese zones? I went on a quest to find out and eventually came across a map at the Alberta Health page. The map is, of course, PDF. Without this map I would never have known that Tofield isn’t in the Edmonton zone while the much more distantKapasiwin is. The reason for this is likely lost in the mists of government bureaucracy. So this brings me to complaint number two: don’t lock data intoartificialcontainers. I should not have to go hunting around to find the definition of zones, they should either be linked off the data page or, better, just not used. Cities are a pretty good container for data of this sort, if the original data had been set up for Calgary,Edmonton, Banff,”¦ then its meaning would have been far more apparent.

Anyway I promised I would do something with the data. I’m so annoyed by the OSI that this is just going to be a small demonstration. I took the numbers from the data set above and put them in to the map from which I painstakingly removed all the city names.

Infant mortality in AlbertaInfant mortality in Alberta

Obviously there are a million factors which determine infant mortality but all things being equal you should have your babies in Calgary. You should have them here anyway because Calgary has the highest concentration of awesome in theprovince. Proof? I live here.