Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

 All Forums
 General SQL Server Forums
 New to SQL Server Programming
 blob vs seperate lines

Author  Topic 

pille
Starting Member

15 Posts

Posted - 2011-02-23 : 11:19:46
hello everybody
i need to write a script for import txt files in a database.
its a huge number of files and every file contains thousands of entries. i am talking of logfiles of 2 years (apache). we need to store them 100% reconstruable.
we decided to write all in an sql database (almost sure the 2008 enterprise)
now is the question
what is the better strategy ?
1. read the whole file in per statement (blob) or read line for line to the database in ?
2. if i need to search for something or need to recover a timepersiode which is the best way (i guess the line for line strategy) ?
3. which needs less space ?

primary is not high performance. if the read in needs 10 mins longer as the other method - doesnt matter

important needs
1. fast recover of logs / timeperiode
2. less space on the hdd (at the moment we have 1 tb on logs)
3. fast searching for any entries

thank you guys for advice
pille
PS: i can write a simple script but i am not an expert in ms sql

robvolk
Most Valuable Yak

15732 Posts

Posted - 2011-02-23 : 12:29:14
All of the needs you listed argue against using BLOBs, because they cannot be indexed the way you'd like (full-text indexing is possible but won't handle dates or numeric values). There's no good way to get high performance from BLOB columns.

If you're using SQL 2008 Enterprise you have the option of data compression, except it's not available for BLOB columns. Normalizing the data into multiple rows would allow you to take advantage of it.

It's hard to say how to normalize these logs without seeing some examples. Regarding the "100% reconstructable" requirement, I'd suggest just zipping/archiving the log files after you've imported the data into the database.
Go to Top of Page

pille
Starting Member

15 Posts

Posted - 2011-02-24 : 07:03:10
thank you for your answer. i was thinking already the same way.

here a example of a log entry (short version - its from my testserver)
2011-02-24 01:46:44 172.16.10.10 - www.domain.com GET /index.php detailed_link_without_domainname 200 - - Mozilla/4.0+(client browser information) - complete_link

i think the best move would to split it
@data: 2011-02-24
@time: 01:46:44
@ipaddress: 172.16.10.10
@domain: www.domain.com
@get: get
@mainsite: /index.php
@option: detailed_link_without_domainname
- hier are a lot of different options in the link
@httpcode: 200
@browerinfo: Mozilla/4.0+(client browser information)
@completlink:complete_link

with the zipping/archiving option we can not archive our goals. the backup is usually for 3 month but we need to archive logs of 2 years. a database is backuped easy and a restore of the enries from the db is easy too. also i expect a much more smaller volume with a database as with files (we have already 1tb only in the achive folder). every year the db could grow 700 gb more (an analyse about the trend)
what do you thing ?
Go to Top of Page

robvolk
Most Valuable Yak

15732 Posts

Posted - 2011-02-24 : 08:17:09
The layout sounds good, but I'm not sure you'll save disk space storing it that way, even with data compression. To get the best effect there has to be a lot of duplicate data in sequential rows, i.e. 20 entries of the same domain name, same browser info, etc. will compress nicely. If they differ then it won't compress as much.

You can save space using date/time/datetime data types vs. storing them as varchar, plus they will index properly. But the index(es) will add additional space, and the internal storage of SQL Server has some overhead and padding that a plain text file won't.

The only way I can see this REALLY saving you space is if you fully normalize the tables, something like:

CREATE TABLE Domains (DomainID int NOT NULL PRIMARY KEY, DomainName varchar(256) NOT NULL UNIQUE)
CREATE TABLE MainSite (MainSiteID int NOT NULL PRIMARY KEY, MainSiteName varchar(256) NOT NULL UNIQUE)
CREATE TABLE BrowserInfo (BrowserInfoID int NOT NULL PRIMARY KEY, BrowserInfo varchar(256) NOT NULL UNIQUE)
...etc.

You'd have to preprocess the log records and insert each distinct value of each section into its relevant table. Once that's done, you import and parse them into a staging table with the layout you described earlier. However, you'd actually STORE the logs in a table like this:

CREATE TABLE Logs(LogID int NOT NULL PRIMARY KEY, LogTime datetime NOT NULL, IPAddress varchar(15) NOT NULL,
DomainID int NOT NULL REFERENCES Domains(DomainID),
MainSiteID int NOT NULL REFERENCES MainSite(MainSiteID),
BrowserInfoID int NOT NULL REFERENCES BrowserInfo(BrowserInfoID),
...other column definitions...)

You'd join the staging table to each of the other tables to get the DomainID, MainSiteID, and so on, and only store the IDs in the Logs table. In essence, you are de-duplicating data and replacing it with small tokens instead of the full values. This will minimize the overhead and make indexing a lot more efficient and compact. Data compression is still an option and could save even more space.

When you need to reconstruct the log, you write a query to join the tables and include the named values instead of the IDs:

SELECT L.LogTime, L.IPAddress, D.DomainName, M.MainSiteName, B.BrowserInfo
FROM Logs L
INNER JOIN Domains D ON L.DomainID=D.DomainID
INNER JOIN MainSite M ON L.MainSiteID=M.MainSiteID
INNER JOIN BrowserInfo B ON L.BrowserInfoID=B.BrowserInfoID

The last compaction I can suggest is to convert the IP Address into a 4 byte binary value, or parse it into 4 1-byte (tinyint) columns. It depends on what kind of queries you need to do on IP Address. This one is pretty minor, you'll save a lot more space if you do the other things I described.
Go to Top of Page

pille
Starting Member

15 Posts

Posted - 2011-03-03 : 07:42:39
hello
thank you for your hints. i did it and the schema of the database is finish. i feed some tables with fixed values. now i need some advice/help with some other questions

1. how to insert over more as one table, it was written with "union" but as far i understood its if the content is the same "table a with names and table b with names"
i have 8 tables. the main table contains
unique_ID (PK)
IPAddress
domainID (FK)
DateID (FK)
....
another table is the domain which includes
DomainID (PK)
DomainName
in the table domain are values in
"1" "www.domain1.com"
"2" "www.domain2.com"
now i need to do a select * from domain where "the_input = "the_output" in an insert query (i guess)
how can i insert over more as one table ?

2. in the example in the top - if a new www.domain3.com needs to be added to the table domain - could i do fill the values automatically if a new values comes up

3. any advice for a dynamical entry of data (the example below)

how the datas looks like

ip - time - date - domain - details_link - httpstatuscode
10.10.10.10 - 10:10 - 01.01.2011 - www.domain.com - /index.html/something?=312&WT.xx=123&,&WT.zz=4325&&WT.rr=rre& - 200

note: the details_link comaitns 2 parts. the link itself and a part with WT.value=value&. these part can be 5 times or 20 times (always different wt.another_definition).
the wt_part i splitt in a new table. the table link_details looks like
linkDetailsID (PK)
linkDetails (FK) -
wtID - which point to a table wt and its definition of the WT.xxx
is that a problem if i fill 1 or 100 values of it ? how can i solve it dynamically (i have 150 definitions, do i need to create 150 colums in the table wt ? )

thank you guys for all your help and advice
normally its a job for a database designer. we do not need to spend primary attention to speed rather than that all logfiles are stored in the database.
regards
pille
Go to Top of Page
   

- Advertisement -