Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

 All Forums
 General SQL Server Forums
 New to SQL Server Programming
 Strip HTML Tags from a String

Author  Topic 

SQLNOVICE999
Yak Posting Veteran

62 Posts

Posted - 2011-10-28 : 16:20:28
Guys,

Guys I have a table with a column that has html text. The column with html text is pretty big datatye varchar(max)... I wanted to check if any of you have any function that I can use to Strip out the HTML tags... I saw couple of version online, but it was running too slow..

This is the one I used:
http://cosier.wordpress.com/2008/10/22/tsql-strip-html-function/

Any suggestion is helpful.

Thanks,
Laura

slimt_slimt
Aged Yak Warrior

746 Posts

Posted - 2011-10-29 : 01:35:00
hi,

if you are doing this only once, then it should not be a problem to wait. Functions as this one is presumably slow, but still you should keep in mind that this is no easy job. since HTML is a definite language, all the tags are well known. you can create a library and store all the tags and use replace in standard T-SQL language instead of going through each word.
You might as well use CRL if you are going to use this more frequently.

best
Go to Top of Page

Sachin.Nand

2937 Posts

Posted - 2011-10-29 : 03:23:11
Cant you just do it with any of the application programs which are more flexible and have a very rich set of functions to do this kind of stuff.TSQL is not optimized for something like this.



PBUH

Go to Top of Page

Kristen
Test

22859 Posts

Posted - 2011-10-29 : 03:34:01
In my experience you need something that will parse the HTML. Otherwise too great a risk that you remove something like this:

100 is < 200, and 300 is > 200

which should have escaped &lt; and &gt;, but will probably display just fine in browsers, and thus may well exist in the code. There is also the issue of what you do with broken code, such as:

<SomeTag xxxx </SomeTag>

a reg-ex type query will be liable to remove everything between the "<" and the ">"
Go to Top of Page

Sachin.Nand

2937 Posts

Posted - 2011-10-29 : 03:45:58
quote:
Originally posted by Kristen

In my experience you need something that will parse the HTML. Otherwise too great a risk that you remove something like this:

100 is < 200, and 300 is > 200

which should have escaped &lt; and &gt;, but will probably display just fine in browsers, and thus may well exist in the code. There is also the issue of what you do with broken code, such as:

<SomeTag xxxx </SomeTag>

a reg-ex type query will be liable to remove everything between the "<" and the ">"



That's why I recommended NOT to use TSQL cause there are always lots of ifs and buts in this kind of stuff..

One example would be java script or a CSS tag embedded in middle of an HTML tag not to mention the junk tags AJAX(if used) adds to HTML which of course you wouldn't want to read as a part of your HTML data.

PBUH

Go to Top of Page
   

- Advertisement -