How to remove BOM from UTF-8 file?
The BOM is the Unicode code point U + FEFF; UTF-8 encoding consists of three hexadecimal values ββ0xEF, 0xBB, 0xBF.
With the bash you can create a UTF-8 specification of the special form of citation $''
, which implements Unicode screening: $'\uFEFF'
. So, for bash, a reliable way to remove the UTF-8 BOM from the beginning of a text file is:
sed -i $'1s/^\uFEFF//' file.txt
This will leave the file unchanged unless it starts with a UTF-8 BOM, and remove the BOM otherwise.
If you are using some other shell, you may find that "$(printf '\ufeff')"
produces a BOM character (which works with zsh
as well as any shell without an inline printf
, assuming the /usr/bin/printf
version is Gnu), but if you want a Posix-compatible version, you can use:
sed "$(printf '1s/^\357\273\277//)" file.txt
(The edit-in-place flag -i
is also a Gnu extension; this version writes the possibly modified file to standard output.)
source to share
Ok, just figured it out today and my preferred path was dos2unix:
dos2unix will remove the BOM and also take care of other features of other SOs:
$ sudo apt install dos2unix $ dos2unix test.xml
It is also possible to remove only the bom (-r, - -r emove-bom):
$ dos2unix -r test.xml
Note: tested with dos2unix 7.3.4
source to share