Apache POI Abnormal spaces (Solved: \ u00A0 non-breaking space)

Question

Apache POI Abnormal spaces (Solved: \ u00A0 non-breaking space)

EDIT: Solved answer: was the indispensable space 00a0, not c0a0 non-destructive space.

After using Apache POI to convert from docx to plaintext and then reading plaintext in Java and trying to parse it, I ran into the following problems.

Output:

" "
first characterequals SPACE OR TAB 
false
[B@5e481248
[B@66d3c617
ARRAYTOSTRING SPACE: [32]
ARRAYTOSTRING ?????: [-62, -96]

For the code:

System.out.println("\t\"" + line.substring(0,1) + "\"\n\tfirst characterequals SPACE OR TAB \n\t" + (line.substring(0,1).equals(" ") 
                        || line.substring(0,1).equals("\t") ));
System.out.println(line.substring(0,1).getBytes());
System.out.println(" ".getBytes());
System.out.println("ARRAYTOSTRING SPACE: " + Arrays.toString(" ".getBytes()));
System.out.println("ARRAYTOSTRING ?????: " + Arrays.toString(line.substring(0,1).getBytes()));

String.trim () does not get rid of it
String.replaceAll ("\ s", "") does not get rid of it

I’m trying to parse a huge amount of document material and it’s turning into a major obstacle. I have no idea what is going on or how to interact with it, can anyone shed some light on what is going on here?

+3

java apache-poi

Captain prinny 03 June '15 at 21:00

source to share

1 answer

llogiq · Accepted Answer · 2015-06-03T21:15:27+0000

This translates to hexadecimal bytes c2 a0

, which according to this answer is a UTF-8 encoded unencrypted space. Note that this is not just a space and \ s will not match.

Apache POI Abnormal spaces (Solved: \ u00A0 non-breaking space)

More articles: