Simon Online

2013

2013-12-09

Speeding up page loading - part 2

In part 1 of this series I talked about speeding up page loading by combining and minimizing CSS and JavaScript files. Amongst the other things loaded by a web page are images. In fact images can be one of the largest items to load. There are a number of strategies for dealing with large or numerous images. I’ll start with the most radical and move to the least radical.

First approach is to get rid of some or all your images. While it seems crazy because you spent thousands on graphically designing your site it might be that you can duplicate the look without images. Removing even a couple of images can speed up page loading significantly. Instead of images you have a couple of options. You can do without images at all or you can replace the images with lower bandwidth options such as an icon font or pure CSS.

Icon fonts are all the rage at the moment. Downloading custom fonts in the browser is nothing new but a couple of years ago somebody realized that instead of having letters make up the font glyphs you could just as easily have icons. Thus fonts like fontawesomewere born. Fonts are a fairly bandwidth efficient method of providing small images for your site. They are vector based thus far smaller and more compressible than raster fonts. The problem is that an icon font might have hundreds of icons in it which you’re not using. This can be addressed by building a custom icon font. If you’re just interested in font-awesome then icnfntis the tool for doing that.

Alternately you can build your own images using pure CSS. The logo for one of the sites I run is pure CSS. At first it was kind of a geek thing to do to prove we could create a logo that way but having it in CSS actually allowed us to scale it infinitely and reduced the size of the page payload. It is a bit of an adventure in CSS to build anything complicated but worthwhile. If your image it too complicated for CSS then perhaps SVG is the tool for you. Scalable vector graphics can be directly included in the markup for your site so require no additional requests and are highly compressible.

Some images aren’t good candidates for vector images. For these there are fewer options. The first step is to play around with the image compression to see if you can trade some quality for image size. Try different compression algorithms like gif, and png then play with the quality settings. This is a bit of a rabbit hole, you can spend days on this. Eventually you end up kidnapping people off the street, strapping them into a chair and asking them “Which one is better A or B, B or C, A or C?”. This is okay if you’re an optician but for regular people it usually results in jail.

Once the image is optimized you can actually embed the image into your CSS. This is a brand new thing for me which my mentor told me about. In your CSS you can put a base 64 encoded version of your image. Manually this can be done at this cool websitebut more reasonably there is an awesome grunt task for processing CSS.

If you have a series of smaller images then you might consider using CSS sprites. With sprites you combine all your images into a larger image and then use CSS to display only portions of the larger image at a time.

Personally I think this is a huge amount of work with the same results achieved through embedding the images in the CSS. I wouldn’t bother, but you might want to give it a try as combining the images can result in a smaller image overall due to encoding efficiencies.

I was pretty impressed with how much image optimization helped our use of data. We saved about 100Kib which doesn’t sound like a lot but on slow connections it is a lifetime. It is also a lot of data in aggregate.

So far we’ve been concentrating on reducing the amount of data sent and received. In the next part we’ll look at some activities on the server side to reduce the time spent building the response.

2013-12-04

Speeding up page loading - part 1

I started to take a look at a bug in my software last week which was basically “stuff is slow”. Digging more I found that the issue was that pages, especially the dashboard were loading very slowly. This is a difficult problem to definitively solve because page loading speed is somewhat subjective.

We don’t have any specifications about how quickly a page needs to load on the site. Less than 5 seconds? Less than 2 seconds? Such a thing is difficult to define because all too often we fail to define for whom the page loading should be quick. Loading is governed by any number of factors

time taken to build the HTML for the view (excuse the MVC style language ““ the same token replacement needs to be done on most frameworks)
speed of the server
speed of the connection from the server to the client
bandwidth between the client and the server
speed of the client to render the HTML
“¦

The list is pretty daunting so I thought I would write about what I did to improve the speed of my application.

The application is a pretty standard ASP.net MVC application with minimal fron end scripting(at least compared with some applications). This means that the steps I take here are pretty much globally applicable to any ASP.net MVC website and many of them are applicable to any website. My strategy was to pick off the low hanging fruit first, fixing easy to fix problems and those which had a big impact on the speed. This would give some breathing room to get time to fix the harder problems.

This post became quite long so I’ve split it into a number of parts.

Bundling CSS/JS
1. Removing images
2. Reducing Queries
3. Speeding Queries

I’ll post the later parts as I finish writing them.

Loading Resources

A web page is made up of a number of components each of which has to be retrieved from a server and delivered to a client. You start with the HTML and as the client parses the HTML it issues additional requests to retrieve resources such as pictures, CSS and scripts. There is a whole lot of optimization which can be done at this stage and it is where I started on this website.

I started on the slowest loading page: the dashboard. We’re not live yet but it is embarrassing that on our rather limited testing data we’re seeing page load times on the order of 15 seconds. It should never have got this far out of control. Performance is a feature and we should have been checking performance as we built the page. Never mind, we’ll jump on this now.

My tool of choice for this is normally Google Chrome but I thought I might give IE11”²s F12 tools a try. A lot of effort has been put in my Microsoft to improve Internet Explorer in the past few years and IE11 is really quite good. I have actually found myself in a position where I’m defending how good IE is now to other developers. I never imagined I would be in this position a couple of years ago, but I digress.

You can get access to the developer tools by hitting F12 and pressing the play button then reloading the page. This will result in something like this:

This is actually the screen after some significant optimization. If you zoom in on this picture then you can see that this page is made up of 9 requests

1 HTML
3 CSS
2 Scripts
3 Fonts

Originally the page had several more script files and several images taking the total to something like 15. We want to attempt to minimize the number of files which make up a page as there is a pretty hefty overhead associated with setting up a new connection to the server. For each file type there is a strategy for reducing the number of requests.

HTML is pretty much mandatory. CSS files can be concatenated together to form a single file. Depending on how your CSS is constructed and how diligent you’ve been about avoiding reusing identifiers for different things across the site this step can be easy or painfully difficult. On this website I made use of the CSS bundling tools built into ASP.net. I believe that the templates for new ASP.net projects include bundling by default but if you’re working on an existing project it can be added by creating the bundles like so

You’ll note that I’m registering the bundles twice, this is just to demonstrate that you can include either individual files or a whole directory. Then call out to this in the Global.asax.cs’s application start

You can now replace all your inclusions of CSS with a single request to ~/bundles/Style (don’t worry about that ~/ razor will correctly interpret that for you and point it at the site root). If you look at the CSS file hosted there you’ll see that it is a combined and whitespace-stripped file. This minimization will save you some bandwidth and is an added benefit to bundling.

JavaScript files can be bundled in much the same way. If you’ve been smart and namespaced your JavaScript into modules then combining JavaScript should be a sinch. Otherwise you might want to look into how to structure your JavaScript.Bundling the script files is much the same as the CSS

The script bundle will concatenate all your script files together and also minify) them. This saves not only on bandwidth but also on the number of connections which need to be opened.

Reducing the number of requests is a pretty small improvement but it is also pretty simple to do. In the next part we’ll look at removing images to speed up page loading.

2013-12-02

Content-Disposition comes to azure

The Azure team accepts requests for new features on their user voice voice page. I have spent an awful lot of votes on this request “Allow Content-Disposition http header on blobs“. Well now it has been released!

Why am I so excited about this? Well when I put files up into storage in order to avoid name conflicts I typically use a random file name such as a GUID and then store that GUID somewhere so I can easily look up the file and access it. The problem arises when I try to let people directly download file from blob storage, they get a file which is named as a random string of characters. That isn’t very user friendly. Directly accessing blob storage lets me offload the work from my web servers and onto any number of storage servers. So I don’t want to abandon that either. Content disposition lets me hijack the name of the file which is downloaded.

To make use of the new header is actually very simple. I updated the save to blob storage method in the project to take an optional file name which I push into the content disposition

Now when I link people directly to the blob they get a sensibly named file.To do this you’ll need the latest Azure storage dll(3.0).One note of caution is that as of writing the storage emulator hasn’t been updated and will throw some odd errors if you attempt to use the new storage dll against it. Apparently it will all be fixed in the next release of the emulator.

Setting the content disposition header on the blob ensures that everybody who downloads it gets the renamed file. It is also possible to set the header using the shared access signature (SAS) so that you can modify the name of the document for each download. Although, I’ll be honest, I could not find a way of doing this from the managed storage library. I can only find examples using the REST API.

2013-11-23

What makes a senior developer

LinkedIn was kind enough to send me an email with some suggestions about people I might know. In that collection was a young fellow with whom I interact infrequently. He graduated from university in 2009 at which point he started working for the company he remains with to this day. About a year and a half after he stated with the company he became a senior developer.

So this fellow 18 months out of school, who has worked with one company on one project in one language is a senior developer.

Oh. My.

If this fellow works for another 40 years I’m not sure what title he will end up with Ultra-super Megatron Developer? 8th Degree Developer Black Belt? Or something truly silly like Archtect?

The real issue, though, is that companies pay people by title. I would guess that this fellow deserved a raise and that to get that raise his manager had to bump his title. The whole system devalues the concept of experience which is very important.

As an industry we still has bent quite figured out the career path for people who like to program. We shove them into management roles because that is what we have always done with other disciplines. There are countless blogs and articles about that problem. By moving experienced developers to manager roles we’re losing years of great experience and young developers have to relearn the lessons of the past.

We are never going to be able to change business titles, there is too much momentum behind job titles. We need to borrow an idea from “The Naming of Cats”, the T S Eliot poem. Each cat has 3 different names one of which is the cat’s secret name, the name which no human will ever know. Equally developers need to have names that business doesn’t know. I’m reminded of those Geek Codes from the days of slashdot. These were a way of identifying just what sort of a geek you were.

We should have a way of talking about our abilities and skills which is distinct from job titles. This is somewhat similar to the ideas of software craftsmanship which Uncle Bob uses. I think that the craftsmanship movement is a bit too narrow and focused on complying with one way of thinking. So I would suggest that a senior developer should have

Worked for a number of companies
Developed on a number of different platforms
Worked with several different programming paradigms (OO, procedural, functional,”¦)
Shipped new software
Supported existing software
Improved the culture at a company (introduced source control, introduced builds, moved a team to agile,”¦)
An understanding of scale, databases and caching

In addition a senior software developer should be able to have reasonable discussions about almost anything in computers. They should have strong opinions on most technology and they should be willing to change these ideas when faced with new and better ones. A senior developer should watch emerging trends, understand them and take advantage of them.

In short a senior developer should be awesome. I’ve only known a handful of people who are sufficiently awesome to be senior software developers. I wouldn’t count myself among them, but I’m working on it. You should too.

2013-10-10

2 Days with ScriptCS

A couple of days ago I found myself in need of doing some programming but without a copy of Visual Studio. There are, obviously, a million options for free programming tools and environments. I figured I would give ScriptCS a try.

If you haven’t heard of ScriptCS I wouldn’t be surprised. I don’t get the impression that it has a huge following. It is basically a project which makes use of the Roslyn C# compiler to build your shell script like C# into binaries and execute them. It provides a REPL environment for C#. On the surface that seems like a pretty swell thing. My experience was mixed.

The Good

Having the syntax around from C# made my life much easier. I didn’t have to look up any syntaxes for loops or the whatnot which always seems to catch me when I script something in a more traditional scripting language. It was also fantastic to have access to the full CLR. I could do things quickly which would have been a huge pain to figure out in other scripting languages like access a database.

You can also directly install nuget packages using scriptcs simply by running

scriptcs.exe -install DocX

It will download the package, create or update a packages.cs and the scripts automatically find the packages without having to explicitly include the libraries. I was able to throw together a tool which manipulated word documents with relative ease.

I had a lot of fun not having access to Intellisense. At first I was concerned that I really didn’t know how to programme at all and that everything I did was just leaning on a good IDE. After an hour or so I was only slightly less productive that I would have been with Visual Studio. I used sublime as my editor and threw it into C# mode which gave me syntax highlighting. Later I discovered a sublime plugin which provided C# completion! Woo, if not for coderush sublime could have replaced Visual Studio outright.

It was easy to define classes within my scripts something I find cumbersome in some other scripting languages. The ability to properly encapsulate functionality was a joy.

I didn’t try it but I bet you would have no problem integrating unit tests into your scripts which puts you a huge step up on bash”¦ is there a unit testing framework for bash? (Yep there is:https://github.com/spbnick/epoxy).

The Not So Good

Of course nothing is perfect. I used scriptCS as one would use bash. I would write a chunk of code then run the script, check the output and then add more code. Problem is that scriptcs is SLOW to start up. Like kicking off a fully fledged compiler slow. This wouldn’t have been too bad except that for some reason every time I ran a script it would lock the output file.

C:tempscriptcs> scriptcs.exe .test1.csx ERROR: The process cannot access the file ‘C:tempscriptcsbintest1.dll’ because it is being used by another process. C:tempscriptcs> del bin* C:tempscriptcs> scriptcs.exe .test1.csx hi C:tempscriptcs>

I wanted to stab the stupid thing in the face after 5 minutes. I opened up process explorer to see if I could see what was locking the file. As it turns out: nothing was using the file. I don’t know if this is a bug in windows or in scriptcs. In either case it is annoying. I discovered that you can pass the -inMemory flag which avoids the file locking issue by not writing out a file. I guess this will become the default in the future but it brings me to:

The documentation isn’t so hot either. I get that, it is a new project and nobody wants to be slowed down by having to write documentation. However I couldn’t even find documentation on what the flags are for scriptcs. When I went to find out how to use command line arguments I could only find an inconclusive series of discussions on a bug.

The Bad

There were a couple of things which were so serious they would stop me from running ScriptCS for anything important. The first was script polution. If you have two scripts in a folder when running the second one you’ll get the output from the first script. Yikes! So let’s say you have

delete-everything.csx send-reports-to-boss.csx

running

scriptcs send-reports-to-boss.csx

will run delete-everything.csx. Ooops. (Already documented ashttps://github.com/scriptcs/scriptcs/issues/475)

I also ran into a show stopping issue with using generic collections which I further documented here:https://github.com/scriptcs/scriptcs/issues/483

The final show stopper for me making more use of ScriptCS is that command line argument passing hasn’t really been figured out yet. See the issue is that passing arguments normally passes them to ScriptCS instead of the script

scriptcs.exe .test1.csx -increase-awesome=true

the solution seems to be that you have to add “” to tell scriptcs to use that argument for the script

scriptcs.exe .test1.csx – -increase-awesome=true

However some version of powershell hate that. The issue is well documented inhttps://github.com/scriptcs/scriptcs/issues/474

Am I going to Keep Using It?

Well it doesn’t look like I’ll be getting a full environment any time soon in this job. As such I will likely keep up with ScriptCS. I hate not having a real environment because it means I can’t contribute back very well. Although my discovery of C# completion in Sublime might change my mind”¦

If scriptcs worked on mono(it might, I don’t know) and if there was a flag to generate executables from scripts I would be all over it. It is still early in the project and there is a lot of potential. I’ll be keeping an eye on the project even if I don’t continue to use it.

2013-10-03

What's this? - JavaScript Context

I have a little 2 year old son who’s is constantly pointing at things and asking “What’s this?”. I’m a man of the world so I can almost always tell him what it is at which he’s pointing. “That’s a sink”, “That’s the front door”, “That’s an experimental faster than light quantum teleportation device I’m using to travel through time”. “That’s the huddled masses quivering in fear of my mastery over space time”.

The one time I have trouble is when he points at a JavaScript function.

“What’s this?”

Well, son, that depends. One of the weirdities of JavaScript is that the context in which a method is written is not, necessarily, the one in which it is run. Let’s try that for a second. Here is some simple code which demonstrates two different ways to call a function

The console here shows

What is happening is that the first function is being called normally and the second is being called with a different context. In this case an evil context. The discussion of this oddity could end right here if nobody made use of call() or apply(). However people do make use of it, in fact jQuery makes very heavy use of it which, frankly, makes my life miserable.

Whenever you attach an action listener using jQuery the function runs in the context of the item to which the listener is attached.

So the function here will be in the context of whatever.button-thing was clicked. I can see why this was done and it does make things very easy to code having the click context right there. However it makes for trouble when you’re trying to follow proper namespaced JavaScript rules.Let’s set up a demonstration here:

If we run this, as shown athttp://jsfiddle.net/Aj2Sc/4/, then you can see we’re not able to retrieve the value correctly because the context is not the same on in which listener is declared.

Fortunately there are workarounds for this sort of behaviour. The most common is to declare a variable which will hold a temporary copy of the correct this. Typically we call this variable one of {that, self, me} all of which are terrible names in my mind. We can replace the init function in our example with this one

This rather interesting structure creates what is, in effect, a temporary context preserving class. It does this by passing the newly created self into a function which returns a function scoped to self. In effect it is a proxy.

If you think this hack is unappealing you’re right. It is, unfortunately, the way that JavaScript works and you can’t get around it. What you can do is pretty the whole thing up by using the jQuery function proxy like so:

This function does all the same nastyness we did ourselves but at least it wraps it and makes the code readable.

2013-10-01

I do not get browser statistics

I don’t understand what the heck is going on with browser usage statistics. I honestly have no clue which browser has the most market share. If you look a different sources you get radically different numbers.

	W3 Schools	Net Market Share	Wikipedia	Global Stats counter	W3 Counter
Internet Explorer	11.80%	57.79%	20.47%	28.56%	23.90%
Firefox	28.20%	18.58%	17.71%	18.36%	17.80%
Chrome	52.90%	15.98%	46.02%	40.80%	31.60%
Safari	3.90%	5.77%	3.10%	8.52%	14.20%
Opera	1.80%	1.47%	5.45%	1.16%	2.40%

That’s hard to understand so let’s throw out some graphs. The one which caught my eye right away was Net Market Share. They show an overwhelming lead for Internet Explorer

On the other hand everybody else shows a lead, of various degrees, for Chrome

The divergence is because of the different methods used to get the data. For instance Wikipedia and W3 schools look only at the statistics on their site. Because both of them are used by people with a fair degree of technical ability they reflect a higher degree of usage by Chrome. The interesting ones are the first three, W3 counter, Global Stats Counter and Net Market Share. They are all aggregators of a large number of sites. I’m socked to see such a high degree of variability. Each of these sources use millions of page views to gather their information so a variation of more than a couple of points seems unusual.

It feels like the take-away here is that the browser usage statistics are garbage. As an industry we’re totally failing to measure the most basic of statistics about how people interact with the Internet as a whole. We should be ashamed of ourselves and we should do something about it. In the meantime it seems like we’re going to have to continue to support at least 3 possibly 4 different browsers to say nothing of the various versions of the browsers. The trends don’t reveal anything of any use either. There seems to be some momentum behind Chrome and IE and less for Firefox, Opera and Safari ““ but who really knows. The Internet is not homogeneous so we see different browser statistics when we slice our data geographically and topically. I bet the usage statistics on hacker news are interesting.

Unfortunately all of this means that you’re going to have to look at the statistics on your own website to see which browsers should be concentrated upon. I hate supporting old browsers but if your market is 6 guys sitting in their Unix holes* using Lynx then you’re supporting Lynx. Best of luck to you!

*Unix hole ““ it is a thing, trust me.

2013-09-25

What'a a good metric for programming language usage?

The whole “which programming language is most popular” debate was kicked off in my mind today by a tweet from @kellabyte. She tweeted

“X is dead” usually derived from small samples of our industry. http://t.co/0XFN0RQBrI greater growth than JS in last 12mo. Think about that

“” Kelly Sommers (@kellabyte) September 25, 2013

I was outraged that a well respected blogger/tweeter such as kellabyte would tweet horrific lies of this sort. “This is exactly”, I thought, “the problem with our industry ““ too many people corrupted by fame and supporting their own visual basic.net related agendas.” Of course I was wrong: kellabyte has no interested in VB and her numbers were not wrong.

I have always relied on TIOBE’s measurement of programming language popularityto give me an idea of what the top languages are. I think this is likely kellabyte’s source also. The methodology used is quite extensively outlined athttp://www.tiobe.com/index.php/content/paperinfo/tpci/tpci_definition.htm. If you don’t fancy reading all that the gist is that they use a series of search engines and count the number of results. The ebb and flow of these numbers is what makes up the rankings.

Obviously there are a number of flaws in this methodology:

The algorithms used by the search engines are not static
Not all programming languages are equally likely to be written about
Languages and technologies are often conflated

Let’s look at each one of those. The search engine market is a constantly changing landscape. Google and Bing are always working towards improving ranking and how results are reported. There is going to be some necessary churn around ranking changes. TIOBE average out a number of search engines in the hopes they can normalize that problem. They use 23 different search engines which is a good number but many of them are very specialized search engines such as Deviant Art. Certain search engines are also given higher ranking for instance Google gives 28% of the final score. In fact the top 3 search engines account for 69% of the score. I’m no statistician but that doesn’t seem like a good distribution. Interestingly 4 out of the top 5 sources are Google properties with the 5th(wikipedia) being heavily sponsored by Google.

The second point is that programming langues are not all equally likely to be written about. My feeling is that newer languages and “cooler” languages will gain an unfair advantage here. People are much more likely to be blogging about them than something boring like VBA. I would say half the code I’ve written in the last 6 months has been VBA but I don’t believe I have more than 2 blog posts on that topic.

I’m guilty of this: when I talk about .net in most cases I’m really talking about C#. Equally when people talk about Rails they’re talking about ruby. I’m not convinced that this information is well captured in TIOBE. It is a difficult problem because a search for “rails” is likely to return far more hits than just those related to programming. Context is important and without some natural language processing capabilities I don’t see how TIOBE can be accurate.

The alternatives to TIOBE are not particularly promising. James McKaysuggested that looking at job posting and github project would be a better metric. He specifically mentioned the job aggregatorhttp://www.itjobswatch.co.uk/. I’ve been thinking about this and it seems like a pretty good metric. The majority of development is likely done inside companies so looking at a job site gives a window into the inner workings of companies. Where it falls down is in looking at companies which are too small to post jobs and open source software. The counter balance to that is found in github statistics. These statistics are likely to have the opposite bias favoring upstart languages and open source contributions. I think we’re at the point where if you’re running an open source project you’re running it on github which makes it an invaluable source of data.

To the mix I would add stackoverflow as a source of numbers. They are a large enough question and answer site now that they’re a great source of data. I’m not sure what the biases would be there ““ C# perhaps?

Combining these statistics would be an interesting exercise ““ perhaps one for a quickly approaching winter’s day.

2013-09-18

Document control and DDD/CQRS - solving similar problems

I had the good fortune to have a two hour introduction to the world of document control the other day. It was refreshing to see that we programmers aren’t the only ones who don’t have things figured out yet. The entire document control process is an exercise in managing the flow and ownership of data. I spent a lot of time thinking about how similar the document control problem and the data flow problem mirror each other.

Document control is really interested in documents and doesn’t care at all about the contents of these documents. Their concerns are largely around

who owns this document?
what is the latest version of this document?
how is this document identified?
how long do I have to keep this document?
is this document superseded by some other document?

These sound a lot like issues with which we deal when using DDD. Document ownership is a simply a problem of knowing in which aggregate root a document belongs. Document versioning is similar to maintaining an event stream. Document identification is typically done through numbering ““ however the flow of documents is slow enough that sequential numbering isn’t a problem ““ no need for a randomly generated GUID.

Document retention isn’t one with which we typically spend much time in CQRS land. Storage is cheap so we just keep every version around or at least we’re able to generate every version through event sourcing. Perhaps the most congruent concept is taking snapshots of aggregates, but we’re typically only interested in the most recent version of the aggregate. With document control there is always some degree of manual intervention with documents so there is a significant cost to retaining all documents indefinitely. I’m only talking about digital copies of document here, Zuul protect you if you need to track paper copies of things too. I can’t even keep track of my keys let alone tens of thousands of documents. My strategy for paper documents would be to burn them as soon as I got them and refer people to the digital version.

Superseding documents also doesn’t seem like a problem we typically have in CQRS. In document control one or more documents may be supersededby one or more documents. For instance we may have a lot of temporary documents which are created by the business things like requests to move offices. They have value but only in a transitory way. Every week the new office seating chart is built from these office move documents and the documents discarded. Their purpose is complete and we no longer care about them as we have a summary document.

Many documents become one. I call it a Voltron operation.

In the opposite operation a document can be replaced by a series of documents. This activity is prevalent when adding detail to documents. A single data sheet may become several documents when examined in more detail

Reverse Voltron? Fan-out? The name may need some work.

This was originally going to be a post about how much we in the DDD/CQRS community have to learn from document control. I imagined that document control was a pretty old and well defined problem. There would surely be well defined solutions. I did not get that impression.

The problem of canonical source of truth or “who owns the data” is a very difficult one in document control. We’re spoiled in DDD because it is rare indeed that the owner of a piece of data can change during its lifespan. Typically the data would remain within an AR and never updated without the involvement of the AR. With document control it is probable that responsibility could jump from your AR to some other, possibly unknown, AR. It could then jump back. At any point in time it would be impossible, without querying every AR, who had control of the data. Of course with a distributed system like many people working on a document it is possible that there will be disagreement about which AR has responsibility at any one time. Yikes!

What we can learn from document control

I think that looking at document control gives us a window into what can happen when you relax some of the constraints around DDD. Data life-cycle is well defined in DDD and we know who owns data. If you don’t then you end up in trouble with knowing who is the source of truth. Document control must solve this problem constantly and it can only be done by going out and asking stakeholders a lot of questions ““ a time consuming exercise.

The introduction of splitting and combining documents, or in our case aggregates, over their lifetime is disastrous. You lose out on the history of information and knowing where to apply events becomes difficult. Instead we should retain aggregates as unchanged as possible (in terms of what fields they have, obviously the data can change) and rely on projections of the data to create different views of information. This is basically impossible to apply to formatted documents as you would have in document control.

What I think would help out document control

The first thing which comes to mind as being directly applicable to document control is removing meaning from the document identifiers. The documents document control manages tend to be numbered and the temptation to add meaning to a document number is too tempting to turn down. For instance you might get a number like

P334E-TT-6554

In our imaginary scheme all documents which start with P are piping diagrams. The 334 denotes the system to which it belong, E the operating pressure and TT the substance inside the pipe. The final digits are just incrementally assigned. The problem is just what you would expect: things change. When they do a decision must be made to either leave the number intact and damage its reliability or to renumber the document and lose the history. instead document control would do well to maintain an identifier whose sole purpose is to identify the document. The number can be retained but only as a field.

A more controversial assertion is that document control should retain all documentation. We retain a full history of messages used to build an entity, even if it is offline and used in favor of a snapshot. I believe that document control should do the same thing. Merging and splitting document is problematic and complicated. It is easier to just create a new document and reference the source documents. Ideally the generation of these new documents can be treated as a projection and the original documents retained.

In the end it is interesting to see how similar problem domains are solved by different people. That’s the beauty in learning a new development language; every language has different features and practices. I’m not, however, prepared to be the guy who learns document control in depth to bring their knowledge back to the community.

2013-09-11

The city of calgary doesn't get open data

It seems that the City of Calgary has updated its open data portal. I was alerted to it not by some sort of announcement but by a tweet from Grant Neufeld who isn’t a city employee any shouldn’t be my source of information on open data in Calgary.

Nice new City of Calgary Open Data website was quietly deployed a couple weeks ago! https://t.co/y1PnIuJg0o #yycdata #yyccc

“” Grant Neufeld (@grant) September 10, 2013

The new site is better than the old one. They have done away with the concept of having to add data to a shopping card and then check out with it. They have also made the data sets more obvious by putting them all in one table. They have also opened up an app showcasewhich is a fantastic feature. It can’t help to cross promote apps which make use of your data. There are also a few links to Google and Bing maps which do an integration with the city’s provided KML files. As I’ve said before I’m not a GIS guy so most of that is way over my head.

It is a big step forward”¦ well it is a step forward. I know the city is busy with more important things than open data but the improvements to the site are a couple of day’s worth of work at best. What frustrates me about the process is that despite having several years on lead time on this stuff the city is still not sure about what open data is. I draw your attention to the FoIP requests CSV&VariantId=1(CITYonlineDefault)). First thing you’ll notice is that despite being listed as a CSV it isn’t, it is an Excel document. Second is that the format is totally not machine readable, at least not without some painful parsing of different rows. Third the data is a summary and not the far more useful raw data. I bet there is some supposed reason that they can’t release detailed information. However if FoIP requests aren’t public knowledge then I don’t know what would be.

Open data is not that difficult. I’ve reproduced here the 8 principles of open data fromhttp://www.opengovdata.org/home/8principles

Data Must Be Complete

All public data are made available. Data are electronically stored information or recordings, including but not limited to documents, databases, transcripts, and audio/visual recordings. Public data are data that are not subject to valid privacy, security or privilege limitations, as governed by other statutes.

2. Data Must Be Primary

Data are published as collected at the source, with the finest possible level of granularity, not in aggregate or modified forms.

3. Data Must Be Timely

Data are made available as quickly as necessary to preserve the value of the data.

4. Data Must Be Accessible

Data are available to the widest range of users for the widest range of purposes.

5. Data Must Be Machine processable

Data are reasonably structured to allow automated processing of it.

6. Access Must Be Non-Discriminatory

Data are available to anyone, with no requirement of registration.

7. Data Formats Must Be Non-Proprietary

Data are available in a format over which no entity has exclusive control.

8. Data Must Be License-free

Data are not subject to any copyright, patent, trademark or trade secret regulation. Reasonable privacy, security and privilege restrictions may be allowed as governed by other statutes.

The city is failing to meet a number of these. They are so simple, I just don’t get what they’re missing. The city employees aren’t stupid so all I can conclude is that there is either a great deal of resistance to open data somewhere in the government or nobody is really convinced of the value of it yet. In either case we need a good push from the top to get going.

Archives

A blog about computer programming and technology.

My Books