Characters such as ZÖE, DÉCOR CIARÁN, etc. cannot be read. From my XML?

Question

Characters such as ZÖE, DÉCOR CIARÁN, etc. cannot be read. From my XML?

I have big XML. There are some symbols in my XML like ZÖE, DÉCOR CIARÁN. I am using Java and MarkLogic as my DB. I cannot read my XML with these words, and when I remove those words and verify that it works fine.

My Java code:

    DatabaseClient client = DatabaseClientFactory.newClient(IP, PORT,
        DATABASE_NAME, USERNAME, PWD, Authentication.DIGEST);

    XMLDocumentManager docMgr = client.newXMLDocumentManager();
    DOMHandle xmlhandle = new DOMHandle();
    docMgr.read("/" + filename, xmlhandle);

Modified question: Since I said I couldn't read special characters, now how can I insert special characters so that I get the same format while reading.

Example: When I insert characters like CIARÁN AURÉLIE BARGÈME it is saved, but when I read, the data looks like CIARAN AURELIE BARGEME but not as inserted.

 DatabaseClient client = DatabaseClientFactory.newClient(IP, PORT,
        DATABASE_NAME, USERNAME, PWD, Authentication.DIGEST);

    XMLDocumentManager docMgr = client.newXMLDocumentManager();
    DOMHandle xmlhandle = new DOMHandle();
    docMgr.read("/" + filename, xmlhandle);
    String doc = xmlhandle.ToString();
    String data = Normalizer.normalize(doc, Normalizer.Form.NFD)
                    .replaceAll("[^\\p{ASCII}]", "");

I am using Normalizer to read special characters, otherwise normal xml descriptor is fine.

+3

java xml marklogic

Abhilash reddy May 07 '15 at 11:06

source to share

1 answer

Sunil kumar · Accepted Answer · 2015-05-07T11:15:30+0000

According to the official documentation:

If you specify an encoding and it turns out to be the wrong encoding, then the conversion will most likely fail as you expect.

MarkLogic Server saves text, XML and JSON as UTF-8. In Java, characters in memory and read streams are UTF-16. The Java API automatically converts characters to and from UTF-8.

When writing documents to the server, you need to know if they are already UTF-8 encoded. If the document is not UTF-8, you must specify its encoding, or, most likely, you will receive data with incorrect characters due to incorrect encoding. If you specify an encoding other than UTF-8, the Java API will automatically convert the encoding to UTF-8 when written to MarkLogic.

https://docs.marklogic.com/guide/java/document-operations#id_11208

Characters such as ZÖE, DÉCOR CIARÁN, etc. cannot be read. From my XML?

More articles: