Unmangling UTF-8 - SQL Server Forums

Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

All Forums

General SQL Server Forums

Script Library

Unmangling UTF-8

Author

Topic

Arnold Fribble
Yak-finder General

1961 Posts

Posted - 2006-02-27 : 09:15:49

You know how it is... you thought the text in the file you bulk inserted was CP-1252. You start using it, and after a while wonder why it seems to have a few bits of garbage in it, and eventually realize that the file you loaded was actually UTF-8.

Here are two functions that might help you unmangle UTF-8 text that's been loaded into a varchar column. The second one has better error checking, but it's probably rather slow if the number of non-ASCII characters in your column is small.


SET QUOTED_IDENTIFIER ON 
GO
SET ANSI_NULLS ON 
GO
/* Returns UTF-16 coded nvarchar string from a string of UTF-8 octets.
   Note that while the input is passed in as a varchar, we are actually
   treating each character as if it were a UTF-8 octet. To do this,
   the number returned by ASCII() must be the value of the octet.
   This, along with the PATINDEX() means that this function will
   probably only work if the database collation uses CP1252 for varchar
   and the collation on the value being passed in as @s uses CP1252 too.

   This function has no error checking and does not detect bogus UTF-8
   sequences.
   The function fails if the returned UTF-16 would exceed 4000 code units.
*/
CREATE FUNCTION dbo.utf8_to_utf16 (@s varchar(8000))
RETURNS nvarchar(4000)
BEGIN
  IF @s IS NULL RETURN NULL
  DECLARE @n int, @r nvarchar(4000), @cn int, @octets int, @ch nvarchar(2)
  SET @r = N''
  WHILE 1 = 1 BEGIN
    -- dubious: unexpected octets (0x80-0xBF, 0xF8-0xFF) are treated like 0x00-0x7F
    SET @n = PATINDEX('%[À-÷]%', @s COLLATE Latin1_General_bin)
    IF @n = 0 BEGIN
      SET @r = @r + @s
      BREAK
    END ELSE BEGIN
      SET @r = @r + SUBSTRING(@s, 1, @n-1)
      SET @cn = ASCII(SUBSTRING(@s, @n, 1))
      IF @cn <= 0xDF BEGIN
        SET @octets = 2
        SET @ch = NCHAR((@cn & 0x1F) * 0x40 +
          (ASCII(SUBSTRING(@s, @n+1, 1)) & 0x3F))
      END ELSE IF @cn <= 0xEF BEGIN
        SET @octets = 3
        SET @ch = NCHAR((@cn & 0x0F) * 0x1000 +
          (ASCII(SUBSTRING(@s, @n+1, 1)) & 0x3F) * 0x40 +
          (ASCII(SUBSTRING(@s, @n+2, 1)) & 0x3F))
      END ELSE BEGIN
        -- code point in a supplementary plane: output UTF-16 surrogate pair
        SET @octets = 4
        SET @ch = NCHAR((@cn & 0x07) * 0x100 + 
            (ASCII(SUBSTRING(@s, @n+1, 1)) & 0x3F) * 0x04 +
            (ASCII(SUBSTRING(@s, @n+2, 1)) & 0x30) / 0x10 + 0xD7C0) +
          NCHAR((ASCII(SUBSTRING(@s, @n+2, 1)) & 0x0F) * 0x40 +
            (ASCII(SUBSTRING(@s, @n+3, 1)) & 0x3F) + 0xDC00)
      END
      SET @r = @r + @ch
      SET @s = SUBSTRING(@s, @n+@octets, 8000)
    END
  END
  RETURN @r
END
GO
SET QUOTED_IDENTIFIER OFF 
GO
SET ANSI_NULLS ON 
GO


SET QUOTED_IDENTIFIER ON 
GO
SET ANSI_NULLS ON 
GO
/* Returns UTF-16 coded nvarchar string from a string of UTF-8 octets.
   Note that while the input is passed in as a varchar, we are actually
   treating each character as if it were a UTF-8 octet. To do this,
   the number returned by ASCII() must be the value of the octet.
   This, along with the PATINDEX() means that this function will
   probably only work if the database collation uses CP1252 for varchar
   and the collation on the value being passed in as @s uses CP1252 too.

   This function is based on utf8to32b, pp 543-545 Unicode Demystified
   by Richard Gillam (Addison-Wesley Professional, 2002).
   Most erroneous UTF-8 octet sequences convert to U+FFFD, but non-shortest
   sequences are allowed through.
   The function fails if the returned UTF-16 would exceed 4000 code units.
*/
CREATE FUNCTION dbo.utf8_to_utf16a (@s varchar(8000))
RETURNS nvarchar(4000)
BEGIN
  DECLARE @states binary(128)
  SET @states =
    -- 0 1 2 3 4 5 6 7 8 9 A B C D E F101112131415161718191A1B1C1D1E1F
    0x00000000000000000000000000000000FFFFFFFFFFFFFFFF01010101020203FF +
    0xFEFEFEFEFEFEFEFEFEFEFEFEFEFEFEFE0000000000000000FEFEFEFEFEFEFEFE +
    0xFEFEFEFEFEFEFEFEFEFEFEFEFEFEFEFE0101010101010101FEFEFEFEFEFEFEFE +
    0xFEFEFEFEFEFEFEFEFEFEFEFEFEFEFEFE0202020202020202FEFEFEFEFEFEFEFE

  DECLARE @size int, @out nvarchar(4000), @outp int, @q int, @state int, @mask int
  SELECT @size = DATALENGTH(@s), @out = 0x, @outp = 0, @q = 1, @state = 0, @mask = 0

  WHILE @q <= @size BEGIN
    DECLARE @c int
    SELECT @c = ASCII(SUBSTRING(@s, @q, 1)), @q = @q + 1,
      @state = CAST(SUBSTRING(@states, 1 + @state*32 + @c/8, 1) AS int)
    IF @state = 0 BEGIN
      SET @outp = @outp + (@c & 0x7F)
      IF @outp <= 0xFFFF
        SET @out = @out + NCHAR(@outp)
      ELSE
        SET @out = @out + NCHAR(@outp / 0x400 + 0xD7C0) + NCHAR((@outp & 0x3FF) + 0xDC00)
      SELECT @outp = 0, @mask = 0
    END ELSE IF @state <= 3 BEGIN
      IF @mask = 0 SET @mask = CAST(SUBSTRING(0x1F0F07, @state, 1) AS int)
      SELECT @outp = (@outp + (@c & @mask)) * 0x40, @mask = 0x3F
    END ELSE BEGIN
      IF @state = 0xFE SET @q = @q - 1
      SELECT @out = @out + NCHAR(0xFFFD), @outp = 0, @state = 0, @mask = 0
    END
  END
  -- end of input is incomplete sequence:
  IF @mask <> 0 SELECT @out = @out + NCHAR(0xFFFD)
  RETURN @out
END
GO
SET QUOTED_IDENTIFIER OFF 
GO
SET ANSI_NULLS ON 
GO

rockmoose
SQL Natt Alfen

3279 Posts

Posted - 2006-03-07 : 15:20:51

quote:
You know how it is... you thought the text in the file you bulk inserted was CP-1252. You start using it, and after a while wonder why it seems to have a few bits of garbage in it, and eventually realize that the file you loaded was actually UTF-8.

The forgotten proc "utf8_to_utf16" to the rescue

Amazing stuff! not surprised it came from you Arnold.

____________
Back to work

rockmoose

mateuszh
Starting Member

1 Post

Posted - 2013-01-29 : 02:47:33

hello,

I looking for sql function convert utf16 to utf8 . mybe you have it or algorithm convert? ;)

Subscribe to SQLTeam.com

SQLTeam.com Articles via RSS

SQLTeam.com Weblog via RSS

- Advertisement -

Resources