The Problem
My current project involves Azure’s Data Lake Store and Analytics. We’re using the SSIS Azure Feature Pack‘s Azure Data Lake Store Destination to move data from our clients on premise system into the Lake, then using U-SQL to generate a delta file which goes on to be loaded into the warehouse. U-SQL is a “schema-on-read” language, which means you need a consistent and predictable format to be able to define the schema as you pull data out.
We ran in to an issue with this schema-on-read approach, but once you understand the issue, it’s simple to rectify. The Data Lake Store Destination task does not use the same column ordering which is shown in the destination mapping. Instead, it appears to rely on an underlying column identifier. This means that if you apply any conversions to a column in the data flow, this column will automatically be placed at the end of file- taking away the predictability of the file format, and potentially making your schema inconsistent if you have historic data in the Lake.
An Example
Create a simple package which pulls data from a flat file and moves it into the Lake.
Mappings of the Destination are as follows:
Running the package, and viewing the file in the Lake gives us the following (as we’d expect, based on the mappings):
Now add a conversion task – the one in my package just converts Col2 to a DT_I4, update the mappings in the destination, and run the package.
Open the file up in the Lake again, and you’ll find that Col2 is now at the end and contains the name of the input column, not the destination column:
The Fix
As mention in my “The Problem” section, the fix is extremely simple – just handle it in your U-SQL by re-ordering the columns appropriately during extraction! This article is more about giving a heads up and highlighting the problem, than a mind-blowing solution.