Bash - Remove all Unicode spaces and replace with normal space

I have a file with a lot of text and it has special space characters, this is Unicode Spaces

I need to replace all of them with the usual space character.

+3


source to share


4 answers


Simple use of perl:

perl -CSDA -plE 's/\s/ /g' file

      

but like @ mklement0, in the comments, in the comments, it will match \t

(TAB). If this is a problem, you can use

perl -CSDA -plE 's/[^\S\t]/ /g'

      

Demo:

Xαš€β€‚β€ƒβ€„β€…β€†β€‡β€ˆβ€‰β€Šβ€―βŸγ€€X

      

above containing:

U+00058 X LATIN CAPITAL LETTER X
U+01680 αš€ OGHAM SPACE MARK
U+02002   EN SPACE
U+02003   EM SPACE
U+02004   THREE-PER-EM SPACE
U+02005 β€… FOUR-PER-EM SPACE
U+02006   SIX-PER-EM SPACE
U+02007   FIGURE SPACE
U+02008 β€ˆ PUNCTUATION SPACE
U+02009   THIN SPACE
U+0200A β€Š HAIR SPACE
U+0202F β€― NARROW NO-BREAK SPACE
U+0205F   MEDIUM MATHEMATICAL SPACE
U+03000 γ€€ IDEOGRAPHIC SPACE
U+00058 X LATIN CAPITAL LETTER X

      

through:

perl -CSDA -plE 's/\s/_/g'  <<<"Xαš€β€‚β€ƒβ€„β€…β€†β€‡β€ˆβ€‰β€Šβ€―βŸγ€€X"

      

note that to replace the demo with an underline, it prints

X_____________X

      

can also be done with pure bash

LC_ALL=en_US.UTF-8 spaces=$(printf "%b" "\U00A0\U1680\U180E\U2000\U2001\U2002\U2003\U2004\U2005\U2006\U2007\U2008\U2009\U200A\U200B\U202F\U205F\U3000\UFEFF")

while read -r line; do
    echo "${line//[$spaces]/ }"
done

      



LC_ALL=en_US.UTF-8

only needed if your default locale is not UTF-8

. (what you should have if you work with utf8 texts) :) demo:

str="Xαš€β€‚β€ƒβ€„β€…β€†β€‡β€ˆβ€‰β€Šβ€―βŸγ€€X"
echo "${str//[$spaces]/_}"

      

prints again:

X_____________X

      

with sed

- prepare the variable $spaces

as above and use:

sed "s/[$spaces]/ /g" file

      

Edit - due to some weird copy / paste issues (or locales):

xxd -ps <<<"$spaces"

      

shows

c2a0e19a80e1a08ee28080e28081e28082e28083e28084e28085e28086e2
8087e28088e28089e2808ae2808be280afe2819fe38080efbbbf0a

      

digest md5

(two different programs)

md5sum <<<"$spaces"
LC_ALL=C md5 <<<"$spaces"

      

prints the same md5

35cf5e1d7a5f512031d18f3d2ec6612f  -
35cf5e1d7a5f512031d18f3d2ec6612f

      

+3


source


One can identify characters by their unicode, is sed 's/[[:space:]]\+/\ /g'

unlikely to do the trick.

Redesigning another SO answer , we'll list all the unicodes to store in a variable, then use sed to replace (note using -i.bak

, we'll keep a copy of the original file as well)



 CHARS=$(printf "%b" "\U00A0\U1680\U180E\U2000\U2001\U2002\U2003\U2004\U2005\U2006\U2007\U2008\U2009\U200A\U200B\U202F\U205F\U3000\UFEFF")

 sed -i.bak 's/['"$CHARS"']/ /g' /tmp/file_to_edit.txt 

      

+1


source


If you run into this task repeatedly , consider setting strong> (normalize whitespace), a utility (mine) that makes things easier: nws

nws --ascii file # convert non-ASCII whitespace and punctuation to ASCII

nws --ascii -i file  # update file in place

      

Mode --ascii

nws

:

  • transliterates (non-ASCII) Unicode space (e.g. no space without space (  

    )) and punctuation (e.g. curly quotes ( ""

    ), en dash ( –

    )) ...) to their closest ASCII equivalent

  • leaving any other Unicode characters.

This mode is useful for source code samples that have been formatted for display with typographical quotes, em dashes, etc., which usually makes the code hard to digest for compilers / interpreters.


Installing nws

from the npm registry (Linux and macOS)

Note. Even if you don't use Node.js, npm

its package manager works across platforms and is easy to install; try
curl -L https://git.io/n-install | bash

With Node.js installed, install the following:

[sudo] npm install nws-cli -g

      

Note

  • If you need it sudo

    depends on how you installed Node.js and have later changed permissions ; if you get an error please EACCES

    try again with sudo

    .
  • -g

    provides a global installation and must be placed nws-cli

    on your system $PATH

    .

Manual installation (any Unix platform with bash

)

  • Download this bash

    script
    as nws

    .
  • Make it executable with chmod +x nws

    .
  • Move it or symbolically link it to a folder in $PATH

    e.g. /usr/local/bin

    (macOS) or /usr/bin

    (Linux).

Additional reading: POSIX character classes [:space:]

and [:blank:]

and Unicode ASCII without gaps

In UTF-8-based locales, POSIX-compliant utilities must make POSIX-class characters [:space:]

and [:blank:]

(non-ASCII) Unicode matches
.

It depends on the correct classification of unicode characters according to POSIX-required character classifications , which directly correspond to character classes as [:space:]

available in patterns and regular expressions.

There are two pitfalls :

  • Unicode is an evolving standard (version 9 at the time of this writing); your platform UTF-8 charmap may not be active .

    • For example, the Ubuntu 16.04

      following characters are not classified properly and therefore do not match [:space:]

      / [:blank:]

      :
      no-space, picture space, narrow free space, next line
  • The utilities should use the active charmap locale , but there are regrettable exceptions - the following utilities are NOT Unicode related (there could be more) :

    • Among the GNU utilities (since coreutils v8.27):

      • cut

        , tr

    • Mawk , the awk

      default implementation for Ubuntu.

    • Among BSD / macOS utilities (as for macOS 10.12):

      • awk

So on a platform with the current UTF-8 charmap, the following command should work sed

, but note that it [:space:]

also matches tabs and therefore replaces them with a single space as well:

sed 's/[[:space:]]/ /g' file

      

+1


source


If you are using python3 it worked for me, its temporary code but it works.

FILENAME = 'File.txt'
OUTPUTNAME = 'Fixed.txt'
f = open(FILENAME, 'r+', encoding='utf8')
o = open(OUTPUTNAME, 'w+', encoding='utf8')
for line in f:
    for ch in line:
        if ch == '\u2003':
            ch = ' '
            o.write(ch)
        else:
            o.write(ch)
o.close()
f.close()

      

0


source







All Articles