Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

 All Forums
 SQL Server 2000 Forums
 SQL Server Development (2000)
 wordlist with freq from text: optimizing

Author  Topic 

Cornelius19
Starting Member

30 Posts

Posted - 2008-09-11 : 15:48:18
Hi,

I have a relatively large text (about several Million words) from which I need to create a wordlist with frequency count. I have an SQL script with a nested loop: the outer loop goes through each record and the inner loop goes through the current record character by character and finds words (e.g. strings between two spaces). Words are written into a temp table from where frequency is calculated. The script is working as expected but is very slow: it would take an estimated 40h to do this job.

Q#1: Is there a way to improve such script?
Q#2: Is there any other approach that would be faster? For instance, a kind of isword built in function?

Cornelius

P.-S. The row count is temporarily set to 10.


if exists (select 1 from sysobjects where name='_list1')
begin drop table _list1 print '_list1 deleted' end

print'Create table where to copy the words'
CREATE TABLE [_list1] (
[words] [varchar] (500) COLLATE SQL_Latin1_General_CP1_CS_AS NULL
)

declare @sent varchar(50), @word varchar(50), @wl int, @i int, @j int, @r int

set @j = 1 -- outer loop counter: row count
set @word = '' -- a single word, between two spaces
set @r = 10
--select @r = COUNT(F1) FROM ESwik -- total # of rows in the input table
SET NOCOUNT ON

print'copy words'
while @j < @r +1
BEGIN
select @sent = F1 from ESwik where id = @j
set @wl = len(@sent) -- length of the current row
set @i = 1 -- inner loop counter: char count
while @i < 2 + @wl
BEGIN
if substring(@sent, @i,1) <> ' ' set @word = @word + substring(@sent, @i,1)
else begin insert into _list1 select @word set @word ='' end
set @i = @i + 1
END
set @j = @j + 1
END

if exists (select 1 from sysobjects where name='_list2')
begin drop table _list2 print '_list2 deleted' end

CREATE TABLE [_list2] (
[wcnt] [int] NULL ,
[words] [varchar] (500) COLLATE SQL_Latin1_General_CP1_CS_AS NULL
)

INSERT INTO _list2 (wcnt, words)
SELECT COUNT (words) AS wcnt, words
FROM _list1
GROUP BY words
ORDER BY COUNT(words) DESC

LoztInSpace
Aged Yak Warrior

940 Posts

Posted - 2008-09-11 : 20:33:32
What are you trying to do? Have you looked at full text indexes?
Also look at some of the CSV conversion code on these forums. There are examples of set based extraction that might be of use - just replace comma with space.

http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=97682

not sure if it will work with SQL 2000 though.
Go to Top of Page

Cornelius19
Starting Member

30 Posts

Posted - 2008-09-17 : 11:15:40
Thanks. I was trying to break/split a very large data (about 300Meg text file) into words (after copying it to SQL Server) and count how many times each word occurs. I realize now that a perl script can do a better (i.e. faster and cleaner) job for this and will use SQL to further process the wordlist.
Go to Top of Page
   

- Advertisement -