Bash - Remove all Unicode spaces and replace with normal space

Question

Bash - Remove all Unicode spaces and replace with normal space

I have a file with a lot of text and it has special space characters, this is Unicode Spaces

I need to replace all of them with the usual space character.

+3

bash unicode sed spaces

Kuzeko Apr 26. 17 at 15:49

source to share

4 answers

One can identify characters by their unicode, is sed 's/[[:space:]]\+/\ /g'

unlikely to do the trick.

Redesigning another SO answer , we'll list all the unicodes to store in a variable, then use sed to replace (note using -i.bak

, we'll keep a copy of the original file as well)

 CHARS=$(printf "%b" "\U00A0\U1680\U180E\U2000\U2001\U2002\U2003\U2004\U2005\U2006\U2007\U2008\U2009\U200A\U200B\U202F\U205F\U3000\UFEFF")

 sed -i.bak 's/['"$CHARS"']/ /g' /tmp/file_to_edit.txt

+1

Kuzeko Apr 26. 17 at 15:49

source to share

If you run into this task repeatedly , consider setting strong> (normalize whitespace), a utility (mine) that makes things easier: nws

nws --ascii file # convert non-ASCII whitespace and punctuation to ASCII

nws --ascii -i file  # update file in place

Mode --ascii

nws

:

transliterates (non-ASCII) Unicode space (e.g. no space without space (

)) and punctuation (e.g. curly quotes ( ""

), en dash ( –

)) ...) to their closest ASCII equivalent
leaving any other Unicode characters.

This mode is useful for source code samples that have been formatted for display with typographical quotes, em dashes, etc., which usually makes the code hard to digest for compilers / interpreters.

Installing `nws`

from the npm registry (Linux and macOS)

^{Note. Even if you don't use Node.js, npm

its package manager works across platforms and is easy to install; try
curl -L https://git.io/n-install | bash}

With Node.js installed, install the following:

[sudo] npm install nws-cli -g

Note

If you need it sudo

depends on how you installed Node.js and have later changed permissions ; if you get an error please EACCES

try again with sudo

.
-g

provides a global installation and must be placed nws-cli

on your system $PATH

.

Manual installation (any Unix platform with `bash`

)

Download this bash

script as nws

.
Make it executable with chmod +x nws

.
Move it or symbolically link it to a folder in $PATH

e.g. /usr/local/bin

(macOS) or /usr/bin

(Linux).

Additional reading: POSIX character classes `[:space:]`

and `[:blank:]`

and Unicode ASCII without gaps

In UTF-8-based locales, POSIX-compliant utilities must make POSIX-class characters [:space:]

and [:blank:]

(non-ASCII) Unicode matches .

It depends on the correct classification of unicode characters according to POSIX-required character classifications , which directly correspond to character classes as [:space:]

available in patterns and regular expressions.

There are two pitfalls :

Unicode is an evolving standard (version 9 at the time of this writing); your platform UTF-8 charmap may not be active .
- For example, the Ubuntu 16.04
  
  following characters are not classified properly and therefore do not match [:space:]
  
  / [:blank:]
  
  :
  no-space, picture space, narrow free space, next line
The utilities should use the active charmap locale , but there are regrettable exceptions - the following utilities are NOT Unicode related (there could be more) :
- Among the GNU utilities (since coreutils v8.27):
  - cut
    
    , tr
- Mawk , the awk
  
  default implementation for Ubuntu.
- Among BSD / macOS utilities (as for macOS 10.12):
  - awk

So on a platform with the current UTF-8 charmap, the following command should work sed

, but note that it [:space:]

also matches tabs and therefore replaces them with a single space as well:

sed 's/[[:space:]]/ /g' file

+1

mklement0 Apr 28 17 at 21:57

source to share

If you are using python3 it worked for me, its temporary code but it works.

FILENAME = 'File.txt'
OUTPUTNAME = 'Fixed.txt'
f = open(FILENAME, 'r+', encoding='utf8')
o = open(OUTPUTNAME, 'w+', encoding='utf8')
for line in f:
    for ch in line:
        if ch == '\u2003':
            ch = ' '
            o.write(ch)
        else:
            o.write(ch)
o.close()
f.close()

0

Russian Weeaboosky Dec 29. 17 at 21:06

source to share

jm666 · Accepted Answer · 2017-04-26T17:02:21+0000

Simple use of perl:

perl -CSDA -plE 's/\s/ /g' file

but like @ mklement0, in the comments, in the comments, it will match \t

(TAB). If this is a problem, you can use

perl -CSDA -plE 's/[^\S\t]/ /g'

Demo:

X            　X

above containing:

U+00058 X LATIN CAPITAL LETTER X
U+01680   OGHAM SPACE MARK
U+02002   EN SPACE
U+02003   EM SPACE
U+02004   THREE-PER-EM SPACE
U+02005   FOUR-PER-EM SPACE
U+02006   SIX-PER-EM SPACE
U+02007   FIGURE SPACE
U+02008   PUNCTUATION SPACE
U+02009   THIN SPACE
U+0200A   HAIR SPACE
U+0202F   NARROW NO-BREAK SPACE
U+0205F   MEDIUM MATHEMATICAL SPACE
U+03000 　 IDEOGRAPHIC SPACE
U+00058 X LATIN CAPITAL LETTER X

through:

perl -CSDA -plE 's/\s/_/g'  <<<"X            　X"

note that to replace the demo with an underline, it prints

X_____________X

can also be done with pure bash

LC_ALL=en_US.UTF-8 spaces=$(printf "%b" "\U00A0\U1680\U180E\U2000\U2001\U2002\U2003\U2004\U2005\U2006\U2007\U2008\U2009\U200A\U200B\U202F\U205F\U3000\UFEFF")

while read -r line; do
    echo "${line//[$spaces]/ }"
done

LC_ALL=en_US.UTF-8

only needed if your default locale is not UTF-8

. (what you should have if you work with utf8 texts) :) demo:

str="X            　X"
echo "${str//[$spaces]/_}"

prints again:

X_____________X

with sed

- prepare the variable $spaces

as above and use:

sed "s/[$spaces]/ /g" file

Edit - due to some weird copy / paste issues (or locales):

xxd -ps <<<"$spaces"

shows

c2a0e19a80e1a08ee28080e28081e28082e28083e28084e28085e28086e2
8087e28088e28089e2808ae2808be280afe2819fe38080efbbbf0a

digest md5

(two different programs)

md5sum <<<"$spaces"
LC_ALL=C md5 <<<"$spaces"

prints the same md5

35cf5e1d7a5f512031d18f3d2ec6612f  -
35cf5e1d7a5f512031d18f3d2ec6612f

Bash - Remove all Unicode spaces and replace with normal space

Installing nws from the npm registry (Linux and macOS)

Manual installation (any Unix platform with bash )

Additional reading: POSIX character classes [:space:] and [:blank:] and Unicode ASCII without gaps

More articles:

Installing `nws`

from the npm registry (Linux and macOS)

Manual installation (any Unix platform with `bash`

)

Additional reading: POSIX character classes `[:space:]`

and `[:blank:]`

and Unicode ASCII without gaps