Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

 All Forums
 SQL Server 2005 Forums
 SSIS and Import/Export (2005)
 Hash/Encrypt CSV to lookup repeating data?

Author  Topic 

gllen
Starting Member

7 Posts

Posted - 2007-10-01 : 22:51:32
Hopefully this makes sense, not sure what to even begin researching...

I'm trying to optimize all facets of this process, as it will take over the resources on my server if not done efficiently.

I have CSV files containing INTs that I need to upsert (match to an existing/earlier imported array or create a new record set) millions of times a day. To be clear, this data is a small subset of the actual import, this arrays contents are not the main data of the process, and the value of the entire array is meant to be related to higher level tables.

The contents of the CSV array are 99.9+% repeating, meaning they will very often share the exact same contents as a a previously imported array. A rough guess is there are 20k combinations existing, and less than 1k new per month, and will range from 6 cols x 15 rows to 6 cols x 50 rows.

So current plan is to use a MD5 hash during the (not SQL related) export process to identify the contents of this CSV file, and export only the md5 (32 digit hex) as a lookup to identify the contents. If the SQL import process finds a new (unknown) MD5 it will request the actual contents, otherwise it will simply use the MD5 as a key/id/code for the actual array contents that are already stored.

There's probably a certain terminology I'm not familiar with for this type of thing.. I've never heard of something like this. I realize collision is a threat, but I'm unsure how much I should be worried about it with this type of data (similar size/contents, but a relatively small amount of possibilities). I think up to even 0.1% collision would be acceptable which is probably way more than enough.


Does this sound like a bad idea to anyone? Are there certain hash functions I should use for this type of thing? Anyone have suggestions of where to look next?

Thanks!

Kristen
Test

22859 Posts

Posted - 2007-10-02 : 04:31:45
We use an "integration" server for this type of activity.

We "pull" the data from the source (CSV, remote Oracle database , whatever!) into #TEMP table

Then we "Upsert" the Staging tables in the Integration database.

The Staging tables hold ONLY the columns WE need. (i.e. the Import data may have more columns that we actually store).

The Staging tables also have an Update Date column. The Upsert sets that date.

We then Upsert into the production database, but only records where the UpdateDate is more recent than last time (we store the Max(UpdateColumn) value AFTER a successful Upsert of production.)

Typically the Integration server is "close" (in network/bandwidth terms) to the Source data, so only minimal "really-changed" data needs to be shipped over the longer, slower, bandwidth to the remote server.

And in our context "really-changed" means in columns that WE are interested in, therefore many rows that have physically changed may not be regarded as a change at all to us.

Kristen
Go to Top of Page
   

- Advertisement -