How to parse XML encoded as UTF-8 from NVARCHAR (MAX) attribute?

I am facing a problem to parse an XML string stored in a type field NVARCHAR(MAX)

(I cannot change the type of that field).

Here is my table (WorkingHours):

CREATE TABLE WorkingHours(
    [ID] [int] NOT NULL PRIMARY KEY,
    [CONTENT] [nvarchar](MAX) NOT NULL,
    -- ...
);

      

Here's an example of an attribute [CONTENT]

:

<?xml version="1.0" encoding="UTF-8"?>
    <calendar>
        <day number="1" worked_day="no">
            <interval number="1" begin_hour="08:30" end_hour="12:00"/>
            <interval number="2" begin_hour="13:30" end_hour="17:00"/>
            <interval number="3" begin_hour="" end_hour=""/></day>
        <day number="2" worked_day="no">
            <interval number="1" begin_hour="08:30" end_hour="12:00"/>
            <interval number="2" begin_hour="13:30" end_hour="17:00"/>
            <interval number="3" begin_hour="" end_hour=""/>
        </day>
        <day number="3" worked_day="no">
            <interval number="1" begin_hour="08:30" end_hour="12:00"/>
            <interval number="2" begin_hour="13:30" end_hour="17:00"/>
            <interval number="3" begin_hour="" end_hour=""/>
        </day>
        <day number="4" worked_day="no">
            <interval number="1" begin_hour="08:30" end_hour="12:00"/>
            <interval number="2" begin_hour="13:30" end_hour="17:00"/>
            <interval number="3" begin_hour="" end_hour=""/>
        </day>
        <day number="5" worked_day="no">
            <interval number="1" begin_hour="08:30" end_hour="12:00"/>
            <interval number="2" begin_hour="13:30" end_hour="17:00"/>
            <interval number="3" begin_hour="" end_hour=""/>
        </day>
        <day number="6" worked_day="no">
            <interval number="1" begin_hour="" end_hour=""/>
            <interval number="2" begin_hour="" end_hour=""/>
            <interval number="3" begin_hour="" end_hour=""/>
        </day>
        <day number="7" worked_day="no">
            <interval number="1" begin_hour="" end_hour=""/>
            <interval number="2" begin_hour="" end_hour=""/>
            <interval number="3" begin_hour="" end_hour=""/>
        </day>
    </calendar>

      

As you can see, the data encoding is UTF-8 .

Now, I would like to parse this data to create some calculations:

DECLARE @RawContent [nvarchar](MAX) = (
    SELECT wh.[CONTENT]
    FROM [WorkingHours] wh 
    WHERE wh.[ID] = 100);

DECLARE @XMLContent [Xml] = @RawContent; // KO
-- DECLARE @XMLContent [Xml] = CAST(@RawContent AS XML);  // KO
-- DECLARE @XMLContent [Xml] = CONVERT(XML, @RawContent); // KO

-- Just a test to query XML data.
SELECT 
    C.WD.value('@number', 'int') AS DayId         
FROM @XMLContent.nodes('/calendar/day') AS C(WD);   

      

I don't know how to pass the result (nvarchar (max) field containing a UTF-8 XML string) to an XML value. SQL Server returns the following error:

"Unable to switch encoding"

      

It refers to the CAST line (when I define the @XMLContent variable).

Any idea to solve this?

+3


source to share


2 answers


Change the processing directive - this is pointless and incorrect, because the data is already encoded in UTF-16 (since it is stored as NVARCHAR

). If you cannot change the data that already exists, you will have to rely on (slightly fragile) string replacement:

CAST(REPLACE(wh.[CONTENT], '<?xml version="1.0" encoding="UTF-8"?>', '') AS XML)

      



Note that explicitly specifying UTF-16 encoding instead will also work - although it doesn't add anything.

+5


source


Another option is to first convert to a VARCHAR

non-Unicode datatype and then XML

:



DECLARE @RawContent [nvarchar](MAX) = (
    SELECT wh.[CONTENT]
    FROM [WorkingHours] wh 
    WHERE wh.[ID] = 100);

DECLARE @XMLContent XML = CAST(CAST(@RawContent AS VARCHAR(MAX)) AS XML)

-- Just a test to query XML data.
SELECT 
    C.WD.value('@number', 'int') AS DayId         
FROM @XMLContent.nodes('/calendar/day') AS C(WD);   

      

+1


source







All Articles