Detecting encoding with Java

Question

Detecting encoding with Java

I have an example that works. In this example (see below) I can detect the encoding of the file using the generic framework from mozilla.

But I want this example to detect the encoding of the input, not the file for the example using the class Scanner

? How do I change the code below to detect the encoding of the input instead of the file?

import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import org.mozilla.universalchardet.UniversalDetector;

public class TestDetector {
  public static void main(String[] args) throws java.io.IOException {
    byte[] buf = new byte[4096];


    java.io.FileInputStream fis = new java.io.FileInputStream("C:\\Users\\khalat\\Desktop\\Java\\toti.txt");


    // (1)
    UniversalDetector detector = new UniversalDetector(null);

    // (2)
    int nread;
    while ((nread = fis.read(buf)) > 0 && !detector.isDone()) {
      detector.handleData(buf, 0, nread);
    }

    // (3)
    detector.dataEnd();

    // (4)
    String encoding = detector.getDetectedCharset();
    if (encoding != null) {
      System.out.println("Detected encoding = " + encoding);
    } else {
      System.out.println("No encoding detected.");
    }

    // (5)
    detector.reset();
  }
}

+3

java

Marcus Apr 21 15 at 15:29

source to share

2 answers

Marcus · Answer 1 · 2015-04-22T12:11:27+0000

I found an elegant example that can test at least so that the charatcht is ISO-8859-1, see the code below.

public class TestIso88591 {
    public static void main(String[] args){
        if(TestIso88591.testISO("ü")){
            System.out.println("True");
        }
        else{
            System.out.println("False");
        }

    }
    public static boolean testISO(String text){
        return  Charset.forName(CharEncoding.ISO_8859_1).newEncoder().canEncode(text);
    }
}

now I have a question for expert Java. Is there any way to check the charachter is it ISO-8859-5 or ISO-8859-7? yes yes i know there is utf-8 but my exact question is how can i check charachter iso-8859-5. because the input data needs to be stored in SAP and SAP can only handle using ISO-8859-1 CHarachter. I need this as soon as.

Rene M. · Answer 2 · 2015-04-21T15:40:43+0000

OK I researched a little more. And the result. It is useless to read bytes from stdin to guess the encoding, because the java API allows you to directly read the input as an already encoded string;) The only function for this repeater is when you get a stream of unknown bytes from a file or a socket, etc. to guess how to decode it in java string.

The following pseudocode is just a theoretical approach to it. But as we found out it doesn't make sense;)

It's very simple.

byte[] buf = new byte[4096];
java.io.FileInputStream fis = new java.io.FileInputStream("C:\\Users\\khalat\\Desktop\\Java\\toti.txt");

UniversalDetector detector = new UniversalDetector(null);

int nread;
while ((nread = fis.read(buf)) > 0 && !detector.isDone()) {
    detector.handleData(buf, 0, nread);
}

What you are doing here is reading from a file into a byte array, which is then passed to the detector.

Replace your FileInputStream with a different reader.

For example, to read everything from standard input:

byte[] buf = new byte[4096];
InputStreamReader isr = new InputStreamReader(System.in);

UniversalDetector detector = new UniversalDetector(null);

int nread = 0;
while ((nread = isr.read(buf, nread, buf.length)) > 0 && !detector.isDone()) {
    detector.handleData(buf, 0, nread);
}

ATTENTION!! This code has not been tested by me. It is only based on Java API Docs. I would also place a BufferedReader between the input stream and the read into the puffer. Also it can't work because of the 4096 byte buffer size. As I can see, my example will work when you immediately enter at least 4096 bytes into the Stdandard IN in one piece, otherwise the while loop will never run.

About the Reader API, the base class java.io.Reader ( http://docs.oracle.com/javase/7/docs/api/java/io/Reader.html#read(char [], %20int, % 20int) ) Defines a method to be read as abstract and any real. should use this method. THAT'S IT!!!

On how to encode a chunk of unknown bytes is impossible. Yes that's right. But you can make a guess like a mozilla detector. Because you have some clues: 1. We expect the bytes to be text 2. we know any byte in any given encoding 3. we can trie to decode multiple bytes in the assumed encoding and compare the resulting string

About us - the experts: Yes, most of them are;) But we don't like doing homework for someone else. We like to fix bugs or give advice. So give a complete example that gives an error that we can fix. Or, as it happened here: we give you advice with some pseudocode. (I don't have time to set up the project and write a working example)

Nice stream of comments;)

Detecting encoding with Java

More articles: