How to remove BOM from UTF-8 file?

I have a UTF-8 encoded file with a BOM and would like to remove the BOM. Are there linux command line tools to remove a spec from a file?

$ file test.xml
test.xml:  XML 1.0 document, UTF-8 Unicode (with BOM) text, with very long lines

      

+10


source to share


4 answers


The BOM is the Unicode code point U + FEFF; UTF-8 encoding consists of three hexadecimal values ​​0xEF, 0xBB, 0xBF.

With the bash you can create a UTF-8 specification of the special form of citation $''

, which implements Unicode screening: $'\uFEFF'

. So, for bash, a reliable way to remove the UTF-8 BOM from the beginning of a text file is:

sed -i $'1s/^\uFEFF//' file.txt

      

This will leave the file unchanged unless it starts with a UTF-8 BOM, and remove the BOM otherwise.



If you are using some other shell, you may find that "$(printf '\ufeff')"

produces a BOM character (which works with zsh

as well as any shell without an inline printf

, assuming the /usr/bin/printf

version is Gnu), but if you want a Posix-compatible version, you can use:

sed "$(printf '1s/^\357\273\277//)" file.txt

      

(The edit-in-place flag -i

is also a Gnu extension; this version writes the possibly modified file to standard output.)

+16


source


Using VIM



  1. Open file in VIM:

    vi text.xml
    
          

  2. Remove BOM encoding:

    :set nobomb
    
          

  3. Save and exit:

    :wq
    
          

+11


source


You can remove the specification from the file with the command tail

:

tail --bytes=+4 withBOM.txt > withoutBOM.txt

      

+5


source


Ok, just figured it out today and my preferred path was dos2unix:

dos2unix will remove the BOM and also take care of other features of other SOs:

$ sudo apt install dos2unix
$ dos2unix test.xml

      

It is also possible to remove only the bom (-r, - -r emove-bom):

$ dos2unix -r test.xml

      

Note: tested with dos2unix 7.3.4

0


source







All Articles