Converting from windows-1256 to UTF-8 causes punctuation issue

I have an Arabic subtitle that I am trying to convert from SRT to VTT. Subtitles seem to use windows-1256 according to the character encoding detector in ICU (Java). The final VTT file is in UTF-8.

The subtitle converts fine and everything looks correct except for the punctuation steps from the left side to the right side. I am using this subtitle on Chromecast, so at first I thought it was a Chromecast issue, but even gedit on Linux has a problem. LibreOffice has no problem, however. Nor does the console output to IntelliJ.

I wrote a simple piece of code to recreate the problem without actually converting from SRT to VTT by simply converting from windows-1256 to UTF-8.

BufferedReader reader = new BufferedReader(
    new InputStreamReader(new FileInputStream("arabic sub.srt"), "windows-1256")
);
String line = null;
BufferedWriter writer = new BufferedWriter(
    new OutputStreamWriter(new FileOutputStream("bad punctuation.srt"), "UTF-8")
);
while((line = reader.readLine())!= null){
    System.out.println(line);
    writer.write(line);
    writer.write("\r\n");
}
writer.close();
reader = new BufferedReader(
    new InputStreamReader(new FileInputStream("bad punctuation.srt"), "UTF-8")
);
line = null;

while((line = reader.readLine())!= null){
    System.out.println(line);
}

      

Here is the output from the IntelliJ console:

Intellij Console

As you can see, the point is on the left side and I think it is correct.

Here's what gedit shows:

gEdit

Most of the text on the right is correct, I think, but the period on the right, which I think is incorrect.

Here is LibreOffice:

enter image description here

This is mostly correct, punctuation on the left, however, the text is also on the left, and I think it should be on the right.

These are the subtitles I'm testing https://www.opensubtitles.org/en/subtitles/5168225/game-of-thrones-fire-and-blood-ar

I also tried another SRT that was originally encoded as UTF-8 and that it worked fine without issue. So my guess is that the conversion issue is from windows-1256.

So what's the problem with how I recode the file?

Thank.

Edit: Forgot the chrome snapshot.

enter image description here

As you can see, the punctuation is on the wrong side.

EDIT: I just noticed that Linux chardet

says it is MacCyrillic

not windows-1256

. But the Java ICU library says windows-1256

. Anyway, if I use MacCyrillic

then the punctuation looks great on gEdit, but the text itself doesn't look like it is now using garbage characters.

+3


source to share


2 answers


Looking at the original subtitle file, I can tell for sure that it is poorly formatted . It seems that full stops appear before the text, even when it is displayed with a left-to-right character set. I believe the correct character set is windows-1256.

The only way this will display correctly is if the punctuation at the beginning of the line is displayed as LTR and the rest of the line is displayed as RTL. You can try to force this by adding the UTF-8 character from left to right right after the punctuation.



If you prefer to fix the original file, you will need to carry over any punctuation marks from the beginning of the line to the end. The parentheses at the beginning of the line must also be canceled.

+1


source


Since encoding has nothing to do with text orientation (LTR and RTL), I think you should use UTF-8 labels specially created for this purpose.

  • left-to-right mark: or (U + 200E)
  • right-to-left mark: or (U + 200F)


In short: a text file has no text orientation information, it is just a text file.

Cf. https://www.w3.org/TR/WCAG-TECHS/H34.html

0


source







All Articles