Grep and tail -f for UTF-16 binary - trying to use plain awk

How can I achieve the equivalent:

tail -f file.txt | grep 'regexp'

      

to output only buffered lines matching a regex of type 'Result'

from file type:

$ file file.txt
file.txt:Little-endian UTF-16 Unicode text, with CRLF line terminators

      

Example stream content tail -f

below converted to utf-8

:

Package end.

Total warnings: 40
Total errors: 0
Elapsed time: 24.4267192 secs.
...Package Executed.

Result: Success

      

Awk?

Piping problems before grep

led me to awk

as a stop shop solution for stripping offensive characters as well as generating consistent strings from a regex.

awk

seems to give the most promising results, however I find that it returns the entire stream, not individual matching lines:

tail -f file.txt | awk '{sub("/[^\x20-\x7F]/", "");/Result/;print}'
Package end.

Total warnings: 40
Total errors: 0
Elapsed time: 24.4267192 secs.
...Package Executed.

Result: Success

      

What i tried

  • converting flow and pipelines to grep

    tail -f file.txt | iconv -t UTF-8 | grep 'regexp'
    
          

  • using luit

    to change the encoding of the terminal according to this post

    luit -encoding UTF-8 -- tail -f file.txt | grep 'regexp'
    
          

  • remove the characters ASCII

    described here , then go togrep

    tail -f file.txt | tr -d '[^\x20-\x7F]' | grep 'regexp'
    tail -f file.txt | sed 's/[^\x00-\x7F]//' | grep 'regexp'
    
          

  • Various combinations of the above using flags grep

    --line-buffered

    , -a

    as well assed -u

  • using luit -encoding UTF-8 --

    previously transferred to the previous
  • using a file with the same coding that contains a regex for grep -f

Why did they fail

  • Most attempts, it just doesn't print anything on the screen because it grep

    searches 'regexp'

    when the text actually looks like '\x00r\x00e\x00g\x00e\x00x\x00p'

    - for example, it 'R'

    returns a string 'Result: Success'

    , but 'Result'

    won
  • If the full regex gets a match, for example in a use case grep -f

    , it will return the entire stream and it just doesn't seem to return matched strings
  • Piping through sed

    or tr

    or iconv

    seems to break the pipe into grep

    , and grep

    still remains only to match individual characters

Edit

I scanned the raw file formatted utf-16

with xxd

with the aim of using a regex to match the encoding, which gave the following output:

$ tail file.txt | xxd
00000000: 0050 0061 0063 006b 0061 0067 0065 0020  .P.a.c.k.a.g.e.
00000010: 0065 006e 0064 002e 000d 000a 000d 000a  .e.n.d..........
00000020: 0054 006f 0074 0061 006c 0020 0077 0061  .T.o.t.a.l. .w.a
00000030: 0072 006e 0069 006e 0067 0073 003a 0020  .r.n.i.n.g.s.:.
00000040: 0034 0030 000d 000a 0054 006f 0074 0061  .4.0.....T.o.t.a
00000050: 006c 0020 0065 0072 0072 006f 0072 0073  .l. .e.r.r.o.r.s
00000060: 003a 0020 0030 000d 000a 0045 006c 0061  .:. .0.....E.l.a
00000070: 0070 0073 0065 0064 0020 0074 0069 006d  .p.s.e.d. .t.i.m
00000080: 0065 003a 0020 0032 0034 002e 0034 0032  .e.:. .2.4...4.2
00000090: 0036 0037 0031 0039 0032 0020 0073 0065  .6.7.1.9.2. .s.e
000000a0: 0063 0073 002e 000d 000a 002e 002e 002e  .c.s............
000000b0: 0050 0061 0063 006b 0061 0067 0065 0020  .P.a.c.k.a.g.e.
000000c0: 0045 0078 0065 0063 0075 0074 0065 0064  .E.x.e.c.u.t.e.d
000000d0: 002e 000d 000a 000d 000a 0052 0065 0073  ...........R.e.s
000000e0: 0075 006c 0074 003a 0020 0053 0075 0063  .u.l.t.:. .S.u.c
000000f0: 0063 0065 0073 0073 000d 000a 000d 000a  .c.e.s.s........
00000100: 00

      

+3


source to share


2 answers


I figured out that a simple regex to ignore any characters between letters in the search string might work ...

This matches 'Result'

, allowing any character in between each letter ...

$ tail -f file.txt | grep -a 'R.e.s.u.l.t'
Result: Success

$ tail -f file.txt | awk '/R.e.s.u.l.t./'
Result: Success

      



or according to this answer : so as not to print all the tedious dots ...

search="Result"
tail -f file.txt | grep -a -e "$(echo "$search" | sed 's/./&./g')"

      

0


source


The sloppiest solution, which should work on Cygwin, fixes your statement awk

:

tail -f file.txt | \
    LC_CTYPE=C awk '{ gsub("[^[:print:]]", ""); if($0 ~ /Result/) print; }'

      

It has a few bugs that cancel each other out, like tail

cutting the UTF-16LE file in awkward places, but awk

removes what we hope is garbage.



A reliable solution might be:

tail -c +1 -f file.txt | \
    script -qc 'iconv -f UTF-16LE -t UTF-8' /dev/null | grep Result

      

but it reads the whole file and I don't know how well Cygwin works using script

to convince iconv

not to buffer (it will work on GNU / Linux).

+1


source







All Articles