Bash - Remove all Unicode spaces and replace with normal space
I have a file with a lot of text and it has special space characters, this is Unicode Spaces
I need to replace all of them with the usual space character.
source to share
Simple use of perl:
perl -CSDA -plE 's/\s/ /g' file
but like @ mklement0, in the comments, in the comments, it will match \t
(TAB). If this is a problem, you can use
perl -CSDA -plE 's/[^\S\t]/ /g'
Demo:
Xαββββ
ββββββ―βγX
above containing:
U+00058 X LATIN CAPITAL LETTER X
U+01680 α OGHAM SPACE MARK
U+02002 β EN SPACE
U+02003 β EM SPACE
U+02004 β THREE-PER-EM SPACE
U+02005 β
FOUR-PER-EM SPACE
U+02006 β SIX-PER-EM SPACE
U+02007 β FIGURE SPACE
U+02008 β PUNCTUATION SPACE
U+02009 β THIN SPACE
U+0200A β HAIR SPACE
U+0202F β― NARROW NO-BREAK SPACE
U+0205F β MEDIUM MATHEMATICAL SPACE
U+03000 γ IDEOGRAPHIC SPACE
U+00058 X LATIN CAPITAL LETTER X
through:
perl -CSDA -plE 's/\s/_/g' <<<"Xαββββ
ββββββ―βγX"
note that to replace the demo with an underline, it prints
X_____________X
can also be done with pure bash
LC_ALL=en_US.UTF-8 spaces=$(printf "%b" "\U00A0\U1680\U180E\U2000\U2001\U2002\U2003\U2004\U2005\U2006\U2007\U2008\U2009\U200A\U200B\U202F\U205F\U3000\UFEFF")
while read -r line; do
echo "${line//[$spaces]/ }"
done
LC_ALL=en_US.UTF-8
only needed if your default locale is not UTF-8
. (what you should have if you work with utf8 texts) :) demo:
str="Xαββββ
ββββββ―βγX"
echo "${str//[$spaces]/_}"
prints again:
X_____________X
with sed
- prepare the variable $spaces
as above and use:
sed "s/[$spaces]/ /g" file
Edit - due to some weird copy / paste issues (or locales):
xxd -ps <<<"$spaces"
shows
c2a0e19a80e1a08ee28080e28081e28082e28083e28084e28085e28086e2
8087e28088e28089e2808ae2808be280afe2819fe38080efbbbf0a
digest md5
(two different programs)
md5sum <<<"$spaces"
LC_ALL=C md5 <<<"$spaces"
prints the same md5
35cf5e1d7a5f512031d18f3d2ec6612f -
35cf5e1d7a5f512031d18f3d2ec6612f
source to share
One can identify characters by their unicode, is sed 's/[[:space:]]\+/\ /g'
unlikely to do the trick.
Redesigning another SO answer , we'll list all the unicodes to store in a variable, then use sed to replace (note using -i.bak
, we'll keep a copy of the original file as well)
CHARS=$(printf "%b" "\U00A0\U1680\U180E\U2000\U2001\U2002\U2003\U2004\U2005\U2006\U2007\U2008\U2009\U200A\U200B\U202F\U205F\U3000\UFEFF")
sed -i.bak 's/['"$CHARS"']/ /g' /tmp/file_to_edit.txt
source to share
If you run into this task repeatedly , consider setting strong> (normalize whitespace), a utility (mine) that makes things easier: nws
nws --ascii file # convert non-ASCII whitespace and punctuation to ASCII
nws --ascii -i file # update file in place
Mode --ascii
nws
:
-
transliterates (non-ASCII) Unicode space (e.g. no space without space (
)) and punctuation (e.g. curly quotes (""
), en dash (β
)) ...) to their closest ASCII equivalent -
leaving any other Unicode characters.
This mode is useful for source code samples that have been formatted for display with typographical quotes, em dashes, etc., which usually makes the code hard to digest for compilers / interpreters.
Installing nws
from the npm registry (Linux and macOS)
Note. Even if you don't use Node.js, npm
its package manager works across platforms and is easy to install; trycurl -L https://git.io/n-install | bash
With Node.js installed, install the following:
[sudo] npm install nws-cli -g
Note
- If you need it
sudo
depends on how you installed Node.js and have later changed permissions ; if you get an error pleaseEACCES
try again withsudo
. -
-g
provides a global installation and must be placednws-cli
on your system$PATH
.
Manual installation (any Unix platform with bash
)
- Download this
bash
script asnws
. - Make it executable with
chmod +x nws
. - Move it or symbolically link it to a folder in
$PATH
e.g./usr/local/bin
(macOS) or/usr/bin
(Linux).
Additional reading: POSIX character classes [:space:]
and [:blank:]
and Unicode ASCII without gaps
In UTF-8-based locales, POSIX-compliant utilities must make POSIX-class characters [:space:]
and [:blank:]
(non-ASCII) Unicode matches .
It depends on the correct classification of unicode characters according to POSIX-required character classifications , which directly correspond to character classes as [:space:]
available in patterns and regular expressions.
There are two pitfalls :
-
Unicode is an evolving standard (version 9 at the time of this writing); your platform UTF-8 charmap may not be active .
- For example, the
Ubuntu 16.04
following characters are not classified properly and therefore do not match[:space:]
/[:blank:]
:
no-space, picture space, narrow free space, next line
- For example, the
-
The utilities should use the active charmap locale , but there are regrettable exceptions - the following utilities are NOT Unicode related (there could be more) :
-
Among the GNU utilities (since coreutils v8.27):
-
cut
,tr
-
-
Mawk , the
awk
default implementation for Ubuntu. -
Among BSD / macOS utilities (as for macOS 10.12):
-
awk
-
-
So on a platform with the current UTF-8 charmap, the following command should work sed
, but note that it [:space:]
also matches tabs and therefore replaces them with a single space as well:
sed 's/[[:space:]]/ /g' file
source to share
If you are using python3 it worked for me, its temporary code but it works.
FILENAME = 'File.txt'
OUTPUTNAME = 'Fixed.txt'
f = open(FILENAME, 'r+', encoding='utf8')
o = open(OUTPUTNAME, 'w+', encoding='utf8')
for line in f:
for ch in line:
if ch == '\u2003':
ch = ' '
o.write(ch)
else:
o.write(ch)
o.close()
f.close()
source to share