Apache POI Abnormal spaces (Solved: \ u00A0 non-breaking space)
EDIT: Solved answer: was the indispensable space 00a0, not c0a0 non-destructive space.
After using Apache POI to convert from docx to plaintext and then reading plaintext in Java and trying to parse it, I ran into the following problems.
Output:
" "
first characterequals SPACE OR TAB
false
[B@5e481248
[B@66d3c617
ARRAYTOSTRING SPACE: [32]
ARRAYTOSTRING ?????: [-62, -96]
For the code:
System.out.println("\t\"" + line.substring(0,1) + "\"\n\tfirst characterequals SPACE OR TAB \n\t" + (line.substring(0,1).equals(" ")
|| line.substring(0,1).equals("\t") ));
System.out.println(line.substring(0,1).getBytes());
System.out.println(" ".getBytes());
System.out.println("ARRAYTOSTRING SPACE: " + Arrays.toString(" ".getBytes()));
System.out.println("ARRAYTOSTRING ?????: " + Arrays.toString(line.substring(0,1).getBytes()));
String.trim () does not get rid of it
String.replaceAll ("\ s", "") does not get rid of it
Iβm trying to parse a huge amount of document material and itβs turning into a major obstacle. I have no idea what is going on or how to interact with it, can anyone shed some light on what is going on here?
source to share
This translates to hexadecimal bytes c2 a0
, which according to this answer is a UTF-8 encoded unencrypted space. Note that this is not just a space and \ s will not match.
source to share