Copyright 2017-2018 Jason Ross, All Rights Reserved

User Rating: 0 / 5

Star InactiveStar InactiveStar InactiveStar InactiveStar Inactive
 
A CN Rail train crossing 34 Ave SE in Calgary. Although there's a level crossing there and a siding, it's not often that you see the train crossing the road. It's still best to assume it might be there though.

Nobody likes work-related phone calls in the early hours of the morning; they have a nasty tendency to ruin your day, if not your weekend, and even worse they’re usually avoidable. The most avoidable of all are those caused by import data being, for want of a better word, garbage.

In the early days of a system things might work fairly well – the system has been developed to handle the data formats described in the “Data Transfer Specification” - but it doesn’t take long before things seem to go downhill quickly. Comparing the data you’re receiving to the specification, you’ll often discover that the description in the specification and the actual data your system receives are getting further and further apart. What should you do?

To be fair, very few specifications are written with the specific intent to deceive. They’re usually written by people who are doing their best to express the data being made available to client systems at that time. The problem is the same as with any other document that is reviewed and controlled; it takes a lot of effort to keep it up to date with the format of the data, so nobody wants to do it. There’s also the effort of communicating changes to clients, which never seems to get to the people who need it.

From the client side this appears as a data format which is well specified to start with but becomes more vague and inaccurate with time. The names and types of fields change arbitrarily, date formats switch from ISO to country-specific, fields move or disappear, new fields appear and eventually someone excels themselves and sends out the data in a ZIP file. Even if the data does match the specification, you’ll find that it starts to include edge cases which were never tested during development. Coupling all of this with the fact that most data processing is done overnight you can almost guarantee that someone, possibly you, will be the lucky person getting a phone call in the early hours of the morning. Probably on a Sunday.

There are two ways to fix this. One option you could try is talking to the supplier of the data and asking them to correct it. Bear in mind the data probably comes from a third party and you’ll doubtless be told that they’d warned someone of the change in format, or they’ll apologize but leave the data as it is, or they’ll just ignore you. In any case this won’t be any help to you in the early hours of the morning or at the weekend, when you really need it, and then when the format inevitably changes again you’ll be back in exactly the same position.

Obviously if it turns out that the errors were caused by a misinterpretation of the specification then this option isn’t open to you anyway.

The second option is to assume that anything sent to your system is in some way wrong and to handle it accordingly.

Before you get too deep into solving the problem this way it’s best to clarify what should happen when you get data that is in the wrong format, and indeed what exactly IS the wrong format. For example, if the order of the fields of a CSV file changes, or extra fields are added, that probably isn’t a problem. If the field names change that could be a problem, but if it’s just a change of case then it might be acceptable. With an XML file things are a little different: it might seem that simply validating against the schema is enough, and maybe it is. But maybe extra data that’s not mentioned in the current schema can be ignored. Missing data that the schema says is required, on the other hand, is probably enough to make the file invalid. But then what if that only occurs on a few nodes? Should you just reject those nodes, or the whole file?

Finally you need to decide what is supposed to happen when a file is rejected – is there anything you can actually do if anyone calls you outside work hours? If not, what’s the point?

Once you’ve worked all this out and your system has decided that the file looks valid, you need to think about defensive techniques to handle the data. If you’re at the start of the project that makes things easier – your techniques will be built in from the start and you don’t have to worry about changing components. Otherwise you’re probably going to have to do some refactoring. The main thing to do is to ensure you have validation code, and that it doesn’t overreact to invalid data.

Many import systems don’t have any validation code, but since that’s all that stands between your system and invalid data, or maybe even malware, yours shouldn’t be one of them. It’s not just a security risk, there’s also the chance that the system will take totally invalid data and try to send it to a database, either lowering the overall quality of the data or causing an unexpected exception to be thrown.

Many import routines throw an exception if they encounter an error in a row of data. Most of the time this is enough to terminate the process, and prevent the import of the rest of the data. This is not what you need though; it needs to be changed to: detect and log the error, then skip the current row without importing it.

Taking this approach the system will import all of the valid data it finds, and record the details of everything it rejected. At the start of the next work day you can see what has been imported, examine the data that failed, and adjust your import process as required. If the files contain invalid data, then you have more details to pass back to your data provider to help them debug their systems, and to stop them sending invalid data to you. This means your company’s systems will be more reliable, your data quality will be much higher, and you should be much less likely to get a call when you’re out of the office, and that can’t be bad.