How do I split a UTF-8 string into an escape sequence provided as a command line argument in Python 3?

I am trying to split UTF-8 strings with the delimiter provided as a command line argument in Python3. The TAB "\ t" character must be valid. Unfortunately I haven't found a solution to interpret the escape sequence as such. I wrote a small test script called "test.py"

  1 # coding: utf8
  2 import sys
  3 
  4 print(sys.argv[1])
  5 
  6 l1 = u"12345\tktktktk".split(sys.argv[1])
  7 print(l1)
  8 
  9 l2 = u"633\tbgt".split(sys.argv[1])
 10 print(l2)

      

I tried to run this script as follows (inside a guake shell on Linux kubuntu host):

  • python3 test.py \ t
  • python3 test.py \ t
  • python3 test.py '\ t'
  • python3 test.py "\ t"

None of these solutions worked. I also tried this with a larger file containing "real" (and unfortunately sensitive data) where, for some strange reason, in many (but far from all) cases the lines were split correctly when using the 1st call ...

What is the correct way to get Python 3 to interpret a command line argument as an escape sequence and not as a string?

+3


source to share


2 answers


You can use $

:

python3 test.py $'\t'

      

ANSI_002dC-Quoting

Words of the form $ 'string' are specially processed. The word is expanded to a string with backslash substitution as specified in the ANSI C standard. Subsequent backslash sequences, if present, are decoded as follows:

\a
alert (bell)

\b
backspace

\e
\E
an escape character (not ANSI C)

\f
form feed

\n
newline

\r
carriage return

\t
horizontal tab <-
............

      

Output:

$ python3 test.py $'\t'

['12345', 'ktktktk']
['633', 'bgt']

      



wiki.bash-hackers

This is especially useful if you want to supply special characters as arguments to some programs, such as a newline in sed.

The resulting text is processed as if it were single. No additional extensions occur.

The $ '...' syntax derives from ksh93, but is portable for most modern shells, including pdksh. The specification for this was adopted for the SUS 7 release. There are still some laggards, such as most of the ash variants, including the dash (besides busybox built into the "bash compatibility" functions).

Or using python:

 arg = bytes(sys.argv[1], "utf-8").decode("unicode_escape")

print(arg)

l1 = u"12345\tktktktk".split(arg)
print(l1)

l2 = u"633\tbgt".split(arg)
print(l2)

      

Output:

$ python3 test.py '\t'

['12345', 'ktktktk']
['633', 'bgt']

      

+4


source


At least in Bash on Linux, you need to use CTRL + V

+ TAB

:

Example:

python utfsplit.py '``CTRL+V TAB``'

      



Otherwise your code works:

$ python3.4 utfsplit.py '       '

['12345', 'ktktktk']
['633', 'bgt']

      

NB: Tab icons don't actually show up here :)

0


source







All Articles