Tuesday, March 6, 2012

Adding new column to Flat File Source Connection

What is the best way to deal with a flat file source when you need to add a new column? This happens constantly in our Data Warehouse, another field gets added to one of the files to be imported, as users want more data items. When I originally set the file up in Connection Managers, I used Suggest File Types, and then many adjustments made to data types and lengths on the Advanced Tab because Suggest File Types goofs a lot even if you say to use 1000 rows. I have been using the Advanced Tab revisions to minimize the Derived Column entries. The file is importing nightly. Now I have new fields added to this file, and when I open the Connection Manager for the file, it does not recognize the new columns in the file unless I click Reset Fields. If I click Reset Fields, it wipes out all the Advanced Tab revisions! If I don't click Reset Fields, it doesn't seem to recognize that the new fields are in the file?

Is it a waste of time to make Advanced Tab type and length changes? Is it a better strategy to just use Suggest Types, and not change anything, and take whatever you get and set up more Derived Column entries? How did the designers intend for file changes to be handled?

Or is there an easy way to add new fields to this import that I am overlooking? I am finding it MUCH more laborious to set up or to modify a file load in SSIS than in DTS. In DTS, I just Edit the transformation, and add the field to the Source and Destination lists, and I'm good to go. My boss isn't understanding why a "better" version is taking so much more work!

thanks,

Holly

Ah, well, I have some sympathy for you, but as soon as you said Data Warehouse, I lost that sympathy. The flat file source CANNOT change. SSIS isn't meant to be a be-all/do-all application. There are many downstream metadata components that would surely break if the flat file source could somehow magically understand the new columns. The whole point of SSIS (and many die-hard DTS people will balk at this) is to enforce strict-adherence to the way connections are built. Any changes require developer effort to ensure that the changes are handled appropriately. (No guessing!)

Suggest data types is good for a starting point, but you'll still need to go through and make sure that they are correct. The suggest feature does not scan all rows.

The more important question is why are you (the maintainers of the Data Warehouse) allowing for a constantly changing file to be introduced into the system? There should be some sort of production control in place.|||Exactly for the reasons Phil says, I don't use the "flat file" connection in SSIS for imports unless really, really needed.

If it is a standard delimited file, I use a temporary table in the same field structure as the source and the field names for the destination. Use Bulk Insert or BCP to insert into the temp table, then use SSIS to transform/move the data to the ultimate destination. In most cases use
INSERT INTO dest SELECT * from temptable. In other cases, I have to write a dynamic sql statement to read and populate the matching field names between the two tables and insert them.

That way if a new field is added, all you have to do is change the temp table and the dest table. Nothing in SSIS needs to change.

Yes, I know it is slower. But the ease of changing far outweights the added time it takes.|||

Sorry, I obviously did not explain clearly. The files are not "constantly changing."

Business users submit requests for data warehouse additions. Our data warehouse has expanded constantly over the years. New tables, new fields in existing tables, etc. When these requests are approved, the changes have to be implemented. If the data comes from the mainframe, and if it is logically part of an existing download, the programmer adds the new field/s to the appropriate download creating the extract file. Then the data warehouse administrator has to modify the package to import the new data fields. As businesses change, new regulations, new situations, new data is needed.

I don't mean that one night, suddenly, with no warning, there are new fields in a file. But I'd love to have a dollar for every field I have added to an existing DTS import over the years.

We use DTS for our ETL. I am now converting it to SSIS, but only in the Test environment. So far I have mostly been creating the SSIS packages, but since it takes a while, and other work goes on, I am coming to situations where I have to modify existing SSIS packages that I have already converted, because I have changed the DTS packages they are converting from.

I am beginning to think that I took the wrong strategy to change types and lengths on Advanced Tab, because I can't see any way of adding another field or two, without wiping out all that work. That was the purpose of my question. What are others in this situation using as their approach.

Thank you for your reply and I would be glad to hear other opinions.

Holly

|||

Tom,

Yes, I can see your point. The ideal thing would just be to let the import file define the file structure of that temporary or prep or stage table, where ever you want to dump the data. I am importing, in most cases, into a "prep" table from the text file. However, the approach to setting up the data structure of the prep tables was defined in 1999 and 2000, and the stored procedures that load from them are expecting that structure, and there are hundreds of tables, and I really don't have time to change them and then change the stored procs to deal with the new situation..... It seems to be an example of, if you set up a new ETL using SSIS it would be done very differently from an ETL that was set up under older versions. The conversion requires fixing something, and it's sort of a question of what you decide to fix.

It is interesting to hear the different ways people do things.

Holly

|||

I think a general "best practice" kind of approach to loading files is to always load them to a staging table. The goal of this initial load is that it should contain virtually no logic - it should be so simple that it virtually never fails.

A more drastic way of doing this is to use a "generic loader" setup. This consists of one staging table with enough varchar columns (max size) to match the number of columns in your widest table, plus a "target_table_name" column. You can create a generic file import process that will load every record in every file to this structure, setting the target table column differently for each load. You will have to maintain a mapping of your generic column names to the specific column names of the target table, or you can double up on your columns and add a "Column name" column for each column.

You can change datatypes, etc. when loading the targets from the generic table. This makes it a bit easier to recover from errors due to datatype issues, which is especially common when loading user-maintained Excel files, for example.

Obviously there are some drawbacks to this approach, like the mapping confusion it can create, as well as the fact that your loads tend to get pretty linear in execution. If you try to parallel load a setup like this the table gets thrashed too much and performance suffers. However, these tradeoffs do lead to a lot more flexibility and you have fewer ETL objects to maintain, though the maintenance may be more difficult in some cases.

There are always tradeoffs, but thought I'd throw the idea it out in case it might fit in your situation.

|||Holly,

I know it is a bunch of work to do to reimplement, I too have hundreds of flat files to import. I started this approach a long time ago because I could see this was going to be a problem.

I shortened part of my process in my description. I actually have a stored proc importing the flat files into a temp (staging) database with the table names the same as the flat files. I use a loop for each table, import each flat file (tablename.txt) into the temp table and process the tables from "import" database into their ultimate destination.

In your case, if you take my approach you could move them into the stage tables in a stage database, and then insert into your existing "prep" tables for the rest of the processing by the stored proc.

Neither SSIS or DTS are pretty when it comes to changing file structures for importing. Especially when all you usually need to do is map txtfile.field123 to table.field123.

Good luck|||You may try DataDefractor. It handles input data change gracefully.

Regards,
Ivan

No comments:

Post a Comment