You know how it is... you thought the text in the file you bulk inserted was CP-1252. You start using it, and after a while wonder why it seems to have a few bits of garbage in it, and eventually realize that the file you loaded was actually UTF-8.Here are two functions that might help you unmangle UTF-8 text that's been loaded into a varchar column. The second one has better error checking, but it's probably rather slow if the number of non-ASCII characters in your column is small.SET QUOTED_IDENTIFIER ON GOSET ANSI_NULLS ON GO/* Returns UTF-16 coded nvarchar string from a string of UTF-8 octets. Note that while the input is passed in as a varchar, we are actually treating each character as if it were a UTF-8 octet. To do this, the number returned by ASCII() must be the value of the octet. This, along with the PATINDEX() means that this function will probably only work if the database collation uses CP1252 for varchar and the collation on the value being passed in as @s uses CP1252 too. This function has no error checking and does not detect bogus UTF-8 sequences. The function fails if the returned UTF-16 would exceed 4000 code units.*/CREATE FUNCTION dbo.utf8_to_utf16 (@s varchar(8000))RETURNS nvarchar(4000)BEGIN IF @s IS NULL RETURN NULL DECLARE @n int, @r nvarchar(4000), @cn int, @octets int, @ch nvarchar(2) SET @r = N'' WHILE 1 = 1 BEGIN -- dubious: unexpected octets (0x80-0xBF, 0xF8-0xFF) are treated like 0x00-0x7F SET @n = PATINDEX('%[À-÷]%', @s COLLATE Latin1_General_bin) IF @n = 0 BEGIN SET @r = @r + @s BREAK END ELSE BEGIN SET @r = @r + SUBSTRING(@s, 1, @n-1) SET @cn = ASCII(SUBSTRING(@s, @n, 1)) IF @cn <= 0xDF BEGIN SET @octets = 2 SET @ch = NCHAR((@cn & 0x1F) * 0x40 + (ASCII(SUBSTRING(@s, @n+1, 1)) & 0x3F)) END ELSE IF @cn <= 0xEF BEGIN SET @octets = 3 SET @ch = NCHAR((@cn & 0x0F) * 0x1000 + (ASCII(SUBSTRING(@s, @n+1, 1)) & 0x3F) * 0x40 + (ASCII(SUBSTRING(@s, @n+2, 1)) & 0x3F)) END ELSE BEGIN -- code point in a supplementary plane: output UTF-16 surrogate pair SET @octets = 4 SET @ch = NCHAR((@cn & 0x07) * 0x100 + (ASCII(SUBSTRING(@s, @n+1, 1)) & 0x3F) * 0x04 + (ASCII(SUBSTRING(@s, @n+2, 1)) & 0x30) / 0x10 + 0xD7C0) + NCHAR((ASCII(SUBSTRING(@s, @n+2, 1)) & 0x0F) * 0x40 + (ASCII(SUBSTRING(@s, @n+3, 1)) & 0x3F) + 0xDC00) END SET @r = @r + @ch SET @s = SUBSTRING(@s, @n+@octets, 8000) END END RETURN @rENDGOSET QUOTED_IDENTIFIER OFF GOSET ANSI_NULLS ON GO
SET QUOTED_IDENTIFIER ON GOSET ANSI_NULLS ON GO/* Returns UTF-16 coded nvarchar string from a string of UTF-8 octets. Note that while the input is passed in as a varchar, we are actually treating each character as if it were a UTF-8 octet. To do this, the number returned by ASCII() must be the value of the octet. This, along with the PATINDEX() means that this function will probably only work if the database collation uses CP1252 for varchar and the collation on the value being passed in as @s uses CP1252 too. This function is based on utf8to32b, pp 543-545 Unicode Demystified by Richard Gillam (Addison-Wesley Professional, 2002). Most erroneous UTF-8 octet sequences convert to U+FFFD, but non-shortest sequences are allowed through. The function fails if the returned UTF-16 would exceed 4000 code units.*/CREATE FUNCTION dbo.utf8_to_utf16a (@s varchar(8000))RETURNS nvarchar(4000)BEGIN DECLARE @states binary(128) SET @states = -- 0 1 2 3 4 5 6 7 8 9 A B C D E F101112131415161718191A1B1C1D1E1F 0x00000000000000000000000000000000FFFFFFFFFFFFFFFF01010101020203FF + 0xFEFEFEFEFEFEFEFEFEFEFEFEFEFEFEFE0000000000000000FEFEFEFEFEFEFEFE + 0xFEFEFEFEFEFEFEFEFEFEFEFEFEFEFEFE0101010101010101FEFEFEFEFEFEFEFE + 0xFEFEFEFEFEFEFEFEFEFEFEFEFEFEFEFE0202020202020202FEFEFEFEFEFEFEFE DECLARE @size int, @out nvarchar(4000), @outp int, @q int, @state int, @mask int SELECT @size = DATALENGTH(@s), @out = 0x, @outp = 0, @q = 1, @state = 0, @mask = 0 WHILE @q <= @size BEGIN DECLARE @c int SELECT @c = ASCII(SUBSTRING(@s, @q, 1)), @q = @q + 1, @state = CAST(SUBSTRING(@states, 1 + @state*32 + @c/8, 1) AS int) IF @state = 0 BEGIN SET @outp = @outp + (@c & 0x7F) IF @outp <= 0xFFFF SET @out = @out + NCHAR(@outp) ELSE SET @out = @out + NCHAR(@outp / 0x400 + 0xD7C0) + NCHAR((@outp & 0x3FF) + 0xDC00) SELECT @outp = 0, @mask = 0 END ELSE IF @state <= 3 BEGIN IF @mask = 0 SET @mask = CAST(SUBSTRING(0x1F0F07, @state, 1) AS int) SELECT @outp = (@outp + (@c & @mask)) * 0x40, @mask = 0x3F END ELSE BEGIN IF @state = 0xFE SET @q = @q - 1 SELECT @out = @out + NCHAR(0xFFFD), @outp = 0, @state = 0, @mask = 0 END END -- end of input is incomplete sequence: IF @mask <> 0 SELECT @out = @out + NCHAR(0xFFFD) RETURN @outENDGOSET QUOTED_IDENTIFIER OFF GOSET ANSI_NULLS ON GO