Simon Online

2013

2013-04-08

Hollywood Teaches Distributed Systems

I’m sitting here watching the movie Battle: LA and thinking about how it relates to distributed systems.The movie is available on Netflix if you haven’t seen it. You should stop reading now if you don’t want me spoiling it.

The movie takes place, unsurprisingly, in Los Angeles where an alien attack force is invading the city. As Aaron Eckhart and his band of marines struggle to take back the city they discover that the ships the aliens are using are actuallyunmanneddrones.

Unmanned, except for the man in front, obviously

This is not news the member of the signal corps they have hooked up with. She believes that all the drones are being controlled from a central location, a command and control center. The rest of the movie follows the marines as they attempt to find this structure which, as it turns out, isburiedin the ground. It is, seemingly, very secret and very secure hidden in the ruined city.Fortunately, as seems to happen frequently in these sorts of movies, the US military prevails and blows up this centralstructure.

The destruction of the command ship causes all the drones, which where were previously holding human forces at bay, to crash and explode. It is a total failure as a distributed system. Destroying one central node had the effect of taking out a whole fleet of automated ships. The invasion failed because some tentacled alien failed to read a book on distributed systems.

See the key is that you never know what is going to fail. Having a single point of failure is a huge weakness. Most people, when they’re designing systems don’t even realize that what they’ve got is a distributed system. I’ve seen this costing a lot of people a lot of time a couple of times. I’ve seen lone SAN failure take out an entire company’s ability to work. I’ve seen failures in data centers on the other side of the planet take out citrix here. If there is one truth to the information systems in large companies it is that they are complicated. However the people working on them frequently fail to realize that what they have on their hands is a single, large, distributed system.

For sure some services are distributed by default (Active Directory, DNS,”¦) but many are not. Think about file systems: most companies have files shared from a single point, a single server. Certainly that server might have multiple disks in a RAID but the server itself is still a single point of failure. This is why it’s important to invest in technologies like Microsoft’s Distributed File System which uses replication to ensure availability. Storage is generally cheap, much cheaper than dealing with downtime from a failed node in Austin.

Everything is a distributed system, let’s start treating it that way.

2013-04-05

Shrink geoJSON

A while back I presented a d3.js based solution for showing a map of Calgary based on the open data provided by the city. One of the issues with the map was that the data was huge, over 3 meg. Even with goodcompressionthis was a lot of data for a simple map. Today I went hunting to find a way to shrink that data.

I started in QGIS which is a great mapping tool. In there I loaded the same shape file I downloaded from the City of Calgary’s open data website some months back. In that tool I went to the Vector menu then Geometry Tools and selected Simplify geometries. That presents you with a tolerance to use to remove some of the lines. See the map is a collection of polygons, but the ones the city gives us are super in detail. For a zoomed out map such as the one we’re creating there is no need to have that much detail. The lines can be smoothed

Smoothing a line

if we repeat thisprocessmany thousands of times the geometry becomes simpler. I played around with various values for this number and finally settled on 20 as a reasonabletolerance.

QGIS then shows map with the verticies removed highlighted with a red X. I was alarmed by this at first but don’t worry, the export won’t contain these.

I then right clicked on the layer and saved it as a shape file. Next I dropped back to my old friend ogr2ogr to transform the shape file into geojson.

ogr2ogr -f geoJSON data.geojson -t_srs “WGS84” simplified.shp

The result was a JSON file which clocked in at 379K but looked indistinguishable from the original. Not too bad, about an 80% reduction.

I opened up the file in my favorite text editor and found that there was a lot of extra data in the file which wasn’t needed. For instance the record for the community of Forest Lawn contained

“type”: “Feature”, “properties”: { “GEODB_OID”: 1069.0, “NAME0”: “Industrial”, “OBJECTID”: 1069.0, “CLASS”: “Industrial”, “CLASS_CODE”: 2.0, “COMM_CODE”: “FLI”, “NAME”: “FOREST LAWN INDUSTRIAL”, “SECTOR”: “EAST”, “SRG”: “N/A”, “STRUCTURE”: “EMPLOYMENT”, “GLOBALID”: “{0869B38C-E600-11DE-8601-0014C258E143}”, “GUID”: null, “SHAPE_AREA”: 1538900.276872559916228, “SHAPE_LEN”: 7472.282043282280029 }, “geometry”: { “type”: “Polygon”, “coordinates”: [ [ [ -113.947910445225986, 51.03988003011284, 0.0 ], [ -113.947754144125611, 51.03256157080034, 0.0 ], [ -113.956919846879231, 51.032555219649311, 0.0 ], [ -113.956927379966558, 51.018183034486498, 0.0 ], [ -113.957357020247059, 51.018182170100317, 0.0 ], [ -113.959372692834563, 51.018816832914304, 0.0 ], [ -113.959355756999315, 51.026144523792645, 0.0 ], [ -113.961457605689915, 51.026244276425345, 0.0 ], [ -113.964358232775155, 51.027492581858972, 0.0 ], [ -113.964626177178133, 51.029176808875143, 0.0 ], [ -113.968142377996443, 51.029177287922145, 0.0 ], [ -113.964018694707519, 51.031779244814821, 0.0 ], [ -113.965325119316319, 51.032649604496967, 0.0 ], [ -113.965329328092139, 51.039853734199696, 0.0 ], [ -113.947910445225986, 51.03988003011284, 0.0 ] ] ] } }

Most of the properties are surplus to our requirements. I ran a series of regex replaces on the file

s/, “SEC.“ge/}, “ge/g s/“GEODB.“NA/“NA/g s/,s//g s/]s//g s/[s//g s/{s//g s/}s//g

The first two strip the extra properties and the rest strip extra spaces.
This striped a record down to looking like(I added back some new lines for formatting sake)

{“type”:”Feature”,”properties”:{“NAME”:”FOREST LAWN INDUSTRIAL”},”geometry”: {“type”:”Polygon”,”coordinates”:[[[-113.947910445225986,51.03988003011284,0.0], [-113.947754144125611,51.03256157080034,0.0],[-113.956919846879231, 51.032555219649311,0.0],[-113.956927379966558,51.018183034486498,0.0], [-113.957357020247059,51.018182170100317,0.0],[-113.959372692834563, 51.018816832914304,0.0],[-113.959355756999315,51.026144523792645,0.0], [-113.961457605689915,51.026244276425345,0.0],[-113.964358232775155, 51.027492581858972,0.0],[-113.964626177178133,51.029176808875143,0.0], [-113.968142377996443,51.029177287922145,0.0],[-113.964018694707519, 51.031779244814821,0.0],[-113.965325119316319,51.032649604496967,0.0], [-113.965329328092139,51.039853734199696,0.0],[-113.947910445225986, 51.03988003011284,0.0]]]}}

The whole file now clocks in at 253K. With gzip compression this file is now only 75K which is very reasonable, and something like 2% of the size of the file we had originally. Result! You can see the new, faster loading, map here

2013-04-04

pointer-events - Wha?

Every once in a while I come across something in a technology I’ve been using for ages which blows my mind. Today’s is thanks to Interactive Data Visualizations for the Web about which I spoke yesterday.In the discussion about how to set up tool tips there was mention of a CSS property called pointer-events. I had never seen this property before so I ran out to look it up. As presented in the book the property can be used to prevent mouse events from firing on a div. It is useful for avoiding mouse out events should a tool tip show up directly under the user’s mouse.

The truth is that it is actually a far more complex property. See an SVG element is constructed from two parts: the fill is the stuff inside theelement In the picture here is is the purple portion. There is also a stroke which is theouteredge of the shape; shown here in black.

By setting pointer-events to fill then mouse events will only fire when the cursor is over the filled portion of an element. Conversly setting it to stroke will only fire events when the mouse pointer is over the outer border. There are also settings to only fire events for visible elements (visibleStroke, visibleFill) and to allow the mouse events to pass through to the element underneath (none). TheMozilla documentsgo into its behaviour in some depth.

The ability to hide elements from the mouse pointer is powerful and can be used to improve the user’s interaction with mouse events. I had originally recommended using an invisible layer for this but pointer-events is much cleaner.

2013-04-03

Review - Interactive Data Visualization for the Web

I’ve been reading Scott Murray’sexcellentbook Interactive Data Visualization for the Webthis week. Actually, I’ve been reading a lot of O’Reilly books as of late because they keep offering to sell them to me at a huge discount. Ostensibly the book is about d3.js which is my personal favorite datavisualizationlibrary for JavaScript.It is a pretty well thought out book and I would recommend it for those looking to explore d3.js.I don’t know that it is really about interactive data visualizations so much as it is about one specific technology for creating visualizations. However if we ignore the title then the contents stand by themselves.

I discovered the book because I cam across Scott’s blogwhen I was doing a spike on d3.js some weeks back. His blog was expanded into the book. The book starts with a look at why we build data visualizations, offers some alternative toolkits then jumps right into d3.js. I would have liked to see more of a discussion around the technologies available in HTML5 forvisualizingdata.In addition to SVG, which d3.js leverages, there is canvas and you can also build some pretty interesting things in pure CSS3. There are also many tools for doing static image generation on the server side.

A significant section of the book is dedicated to teaching JavaScript and presenting web fundamentals. I wasn’t impressed that so much effort had gone into a topic which is covered so well by many other books. I’m sure it was just that I’m not the target audience of this section. By chapter 5, though, things are getting interesting.

Scott introduces d3 in more detail and talks about method chaining(a huge part of d3) and getting data. The rest of the book builds on the basic d3.js knowledge by creating more and more complicated graphs. The book moves through bar chart and scatter plot before adding talk of using scales and leveraging animation. I had been a bit confused about how to make use of dynamic data sets in d3 but the section on how to add data cleared that up nicely.

I think the real key to this book is the chapter on interactions. Anybody can draw a graph server side, the story for creating it in JavaScript becomes much more compelling when users can click on items in the graph and have things happen. There is a pretty extensive discussion about how to add tooltips to your visualizations. I have to admit I was a bit miffed by that because I was going to do a blog series along the same lines and now I’ll just look like an idea stealing baboon instead of an insightful orangutang.

Finally a couple of more advanced topics are covered including talking about some(not all, mind you) of the buit in layouts in d3.js. Finally mapping is covered. Thank goodness there is some discussion of projections because that is what got me when I worked with maps in d3.js.

There is very little discussion about what makes a good visualization and there is no attempt to come up with any unique visualizations.If you’re interested in that aspect then pretty much anything which comes out of Stephen Few is insightful and superinteresting.

For the price that O’Reilly charged me for this book it is 100% worth it. Plus I hear that for every time you look at an O’Reilly book and don’t buy it they kill one of the animals pictured on the cover.

Book Cover

2013-04-02

Importing a git Repository into TFS

From time to time there is need to replace a good technology with a not so good technology. The typical reason is a business one. I don’t claim to have deep understanding of how businesses works but if you find yourself in a situation where you need to replace your best friend, your amigo, your confident Git with the womanizing, drunken lout that is TFS then this post is for you! This post describes how to import a git repo into TFS and preserve most of the history.

The first think you’ll need is a tool called Git-TF which can be found on codeplex. This comes as a zip file and you can unzip it anywhere. Next you’ll need to add the unzip directory to your path. If you’re just doing this as a one-time operation then you can use the powershell command:

$env:Path += ";C:tempgit-tf"

to add to your path.

Now that you have that git should be able to find a whole set of new subcommands. You can check if it is working by running

git tf

You should get a list of subcommands you can run

Git-TF subcommands

Now drop into the git repository you want to push to TFS and enter

git tf configure http://tfsserver:8080/tfs $/Scanner/Main

Wherehttp://tfsserver:8080/tfs is the collection path for your TFS server and$/Scanner/Main is the server path to which you’re pushing. This will modify your .git/config file and add the following:

[git-tf “server”] collection =http://tfsserver:8080/tfs serverpath =$/Scanner/Main

Your git repository now knows a bit about TFS. All you need to do now is push your git code up and that can be done using

git tf checkin –deep

This will push all the commits on the mainline of your git repo up into TFS. Without the ““deep flag only the latest commit will be submitted.

There are a couple of gotchas around branching. You may get this error:

git-tf: cannot check in - commit 70350fb has multiple parents, please rebase to form a linear history or use –shallow or –autosquash

You canflattenyour branches by either rebasing or by passing git-tf the ““autosquash flag which will attempt to flatten the branching structure automatically. I’m told that autosquashing can consume a lot of memory if there are a lot of commits in the repository. I have not had any issue but my repositories are small and my machine has 16GB of memory.

Now you have move all your source code over to TFS. Yay.

I’m not going to point out that if you keep git-tf around you can continue to work as if you have git and just push commits to TFS. That would likely be against your company’s policies.

2013-04-01

On Hiding Data

I had a veryinterestingissuesubmitted over the weekend; one of several which boiled down to the same issue. The application against which it was submitted has a number of categories. Some of the categories are empty, which is just fine in the application. We made the mistake of underestimating our users and we started to hide empty categories. We assumed that when filtering users wouldn’t want to see categories which would give them empty results. Why would you want to see empty categories? Well as it turns out there are a ton of uses cases:

When adding a new category it will be empty by default and users would like to see that their addition of a category worked
When generating reports users want to have proof that the category’s contents are missing
When listing categories users have an expectation that none will be missed

We hid the categories to reduce options in a drop down and to generally clean up the UI. Now I’ve had a number of issues on this topic submitted by users we’re going to remove the empty category filters. We’re actually hurting our users by trying to guess at what they want instead of asking them and letting them try things out. Lessons learned in software development, I suppose.

2013-03-29

Select your JSON

In yesterday’s post I talked about how to change out the JSON serializer in ASP.net MVC. That was the first step to serializing an NHibernate model out to JSON. The next issue I camacrosswas that the serialization was really slow. What’s the deal with that? I looked at the JSON which was produced by the serialization and found that a single record was 13KiB! Even a modest result from this action was over 250KiB. This is a pretty significant amount of data to serialize. There was really no need to serialize that much data. In the end the data was being used to populate a couple of columns in a table.

The reason the data was so huge is that the NHibernate object which was being serialized was fully populated with data fromseveraltables. There was even a collection or two of records in there.

Projections to the rescue!

The beauty of returning JSON is that it isn’t strongly typed. This means that all you need to do to trim down the traffic is to use object builder notation to project from the collection into a new object. To do this you need only do something like

results.Select(x=>new{ID=x.ID,Client=x.Client.Name,UserName =x.User.Name,Name=x.Name,Date=x.Date})

In my case this reduced the JSON payload from over 250KiB to 2.5KiB. This reduced not only the serialization time but also the time taken to send the traffic over the network.

This is a really easy optimization and it is something that tends to slip the mind. This is why it’s important to test with large dataset.

2013-03-28

Changing the JSON Serializer in ASP.net MVC

Today I stumbled onto some code which was serializing an NHibernate model to JSON and returning it to the client. The problem with directly serializing the object from NHibernate is that it may very well contain loops. This is fine in a C# object graph because only references are stored. However JSON doesn’t have the ability to store references. So a serializer attempting to serialze a complex C# object must explore the whole object graph. The loop would make the object infinatly large.

The code in question returned a ContentResult instead of a JSON result. I didn’t like that, why not just override the Json method? So I went to override the call and found, to my chagrin, that the base class, Controller, did not declare all the signatures as virtual. Bummer. I hate this sort of thing. Now serialization becomes something of which developers need to be aware. There are a couple of methods you can override but that’s not a good solution as unless developer know about the custom serializer they won’t know which methods they can an cannot call. I’m not sure what the motivation is around not providing a place to plug in a custom serializer but it does seem to me to be a pretty common extension point. The ASP.net MVC team are pretty good a what they do so I imagine there is a reason.

The best solution I could come up with in short order was to override the two signatures I could. To do that I wrote

https://gist.github.com/anonymous/5268866

I also created a NewtonSoftJsonResult which extended the normal JsonResult. This class provides the serialization implementation. In our case it looks like

https://gist.github.com/anonymous/5268878

I still don’t like this solution but it does work. I’m going to think on this one.

2013-03-27

Mobile is Getting Cool

I’m way behind the times on mobile development. It is one of those things I’ve been meaning to get into but just haven’t had the cycles. There is a mobile app on my horizon so I’ve been looking a bit into how I would go about it. From what I can see there are currently two good options for building mobile applications at the moment. Ishouldstop and explain mycriteriabefore somebody shives on me for jumping to conclusions.

First is that it needs to run on as many devices as possible. Realistically an Android tablet and iPad would be sufficient(sorry, Surface). This is obviously a pretty common use case these days. The tablet market has been pretty much owned by iPad but recently Android has been making inroads. I have a Nexus 7 which is really a fantastic device. Of course I’m comparing it with a first generation iPad so things on the Apple side might have come along some way.

Second is that it needs to be easy to develop. I’ve never been very impressed with Objective C and developing for the Dalvik VM using Java is grim. I’m really sad about where Java’s going as of late. There are Java security flaws all over the web as of late and the autoupdate mechanism install highly questionable software along with the updates. I still think that the JVM is a good platform, I just think that Java has lost its way.

With these criteria in mind I think the two great options for developing for mobile are Mono and Phone Gap. Mono started off as a port of the .net framework to Linux but has now really got legs. You can now write the majority of your application in C# and have it run on Android or iOS. Phone Gap allows you to write your application in HTML5 but still have access to most of the underlying device’s hardware. For the things that don’t have an API yet you can write bindings, but they have to be donenativelyand, obviously, they need to be written for every platform.

In my mind both Phone Gap and Mono provide powerful programming tools and models for phone development. There arecertainlysome things for which you might want to write a native app but I bet they make up less than 5% of the apps on phones. If I had to develop an app today without being given any opportunity for further research I think I would probably jump into Phone Gap. I’m really excited about all the good work going on in the HTML/JavaScript space at the moment. I think the ecosystem for HTML based phone applications will greatlybenefitfrom the larger world of HTML and JavaScript development.

The mobile space is getting really interesting, far more interesting than it was even two years ago. I bet it will be even more exciting in two years.

2013-03-26

On Being a Generalist

I blog about quite a few subjects. They are mostly software andtechnologyrelated but from time to time I talk about business or why youabsolutelyhave to put orange zest in cranberry orange muffins. Within the technology camp I talk about all sorts of things because I consider myself to be a generalist. I don’t want to be stuck using any one technology because I worry that something will happen to that technology and I’ll be unable to find a new job. Then I’ll be unable to work and things which just spiral down until I’m the star of some show on TLC: “TechWennie to Crack Deal” or “Early Adopter to Laggard:Victimsof Rogers Diffusion Model”

Sure there are some “safe” technologies like Oracle or SAP which areunlikelytodisappearbut you never know”¦ I think about Siverlight which was the future for many years until HTML5 pretty much killed it.

I think that being a generalist is a great move. I work with all sorts of technologies and it affords me the ability to apply ideas form one technology to another. I think I come up with someinterestingsolutions in a problem space because I’ve seen how it is done by people in an unrelated space. I am constantly exposed to things outside of my comfort zone, I’m always learning. However, it has its costs: I probably don’t earn anywhere near what I could as a deep specialist in a field. Because I have a lack of really deep knowledge I don’t have theopportunitiesto travel to conferences or give training as an expert. Basically I’m trading being a super-massive star with a short lifespan for being an average main sequence star with a long lifespan. I won’t shine as bright but I’m alsoprobablynot going to be on TLC.

Archives

A blog about computer programming and technology.

My Books