Grep and give informative output

I want to see how many times a certain word has been mentioned in a file / lines.

My dummy examples look like this:

cat words
blue
red 
green
yellow 

cat text
TEXTTEXTblueTEXTTEXTblue
TEXTTEXTgreenblueTEXTTEXT
TEXTTEXyeowTTEXTTEXTTEXT

      

I'm doing it:

for i in $(cat words); do grep "$i" text | wc >> output; done

cat output
  2       2      51
  0       0       0
  1       1      26
  0       0       0

      

But I really want to get:
1. The word that was used as a variable. 2. How many lines (except for text beats) the word was found.

The preferred output looks like this:

blue    3   2
red     0   0 
green   1   1
yellow  0   0

      

$ 1 - the variable that was grep'ed
$ 2 - how many times the variable was found in the text
$ 3 - the number of lines found

Hopefully someone can help me do this with grep, awk, sed as they are fast enough for a large dataset, but Perl one liner helped me too.

Edit

Tried this

   for i in $(cat words); do grep "$i" text > out_${i}; done && wc out*  

      

and it looks nice, but some words are longer than 300 letters, so I can't create a file with a name like the word.

+3


source to share


5 answers


You can use an option grep

-o

that only prints the matched portions of the corresponding line, with each match on a separate output line .

while IFS= read -r line; do
    wordcount=$(grep -o "$line" text | wc -l)
    linecount=$(grep -c "$line" text)
    echo $line $wordcount $linecount
done < words | column -t

      

You can put everything on one line to make it one liner.



If a column gives a column too long error, you can use printf as long as you know the maximum number of characters. Use below instead echo

and remove channel to column:

printf "%-20s %-2s %-2s\n" "$line" $wordcount $linecount

      

Replace 20 with your maximum word length and other numbers as needed.

+4


source


Here's a similar Perl solution; but rather written as a complete script.

#!/usr/bin/perl

use 5.012;

die "USAGE: $0 wordlist.txt [text-to-search.txt]\n" unless @ARGV;

my $wordsfile = shift @ARGV;
my @wordlist = do {
    open my $words_fh, "<", $wordsfile or die "Can't open $wordsfile: $!";
    map {chomp; length() ? $_ : ()} <$words_fh>;
};

my %words;
while (<>) {
    for my $word (@wordlist) {
        my $cnt = 0;
        $cnt++ for /\Q$word\E/g;
        $words{$word}[0] += $cnt;
        $words{$word}[1] += 1&!! $cnt; # trick to force 1 or 0.
    }
}

# sorts output after frequency. remove `sort {...}` to get unsorted output.
for my $key (sort {$words{$b}->[0] <=> $words{$a}->[0] or $a cmp $b} keys %words) {
    say join "\t", $key, @{ $words{$key} };
}

      

Output example:



blue    3       2
green   1       1
red     0       0
yellow  0       0

      

Advantage over bash script: each file is read only once.

+3


source


This gets pretty ugly as a one-liner Perl (partly because it needs to get data from two files and only one can be sent to stdin, partly because of the requirement to count both the number of lines matched and the total number of matches), but here you are go:

perl -E 'undef $|; open $w, "<", "words"; @w=<$w>; chomp @w; $r{$_}=[0,{}] for @w; my $re = join "|", @w; while(<>) { $l++; while (/($re)/g) { $r{$1}[0]++; $r{$1}[1]{$l}++; } }; say "$_\t$r{$_}[0]\t" . scalar keys %{$r{$_}[1]} for @w' < text

      

This requires perl 5.10 or newer, but changing it to support 5.8 and earlier is trivial. (Change -E

to -E

, change say

to, print

and add \n

to the end of each line of output.)

Output:

blue    3   2
red     0   0
green   1   1
yellow  0   0

      

+1


source


awk (gawk) oneliner can save you from the grep puzzle:

  awk 'NR==FNR{n[$0];l[$0];next;}{for(w in n){ s=$0;t=gsub(w,"#",s); n[w]+=t;l[w]+=t>0?1:0;}}END{for(x in n)print x,n[x],l[x]}' words text

      

format the code a bit:

awk 'NR==FNR{n[$0];l[$0];next;}
    {for(w in n){ s=$0;
        t=gsub(w,"#",s); 
        n[w]+=t;l[w]+=t>0?1:0;}
    }END{for(x in n)print x,n[x],l[x]}' words text

      

check your example:

kent$  awk 'NR==FNR{n[$0];l[$0];next;}{for(w in n){ s=$0;t=gsub(w,"#",s); n[w]+=t;l[w]+=t>0?1:0;}}END{for(x in n)print x,n[x],l[x]}' words text
yellow  0 0
red  0 0
green 1 1
blue 3 2

      

if you want to format your output you can simply pipe the awk output to column -t

so it looks like this:

yellow  0  0
red     0  0
green   1  1
blue    3  2

      

+1


source


awk '
NR==FNR { words[$0]; next }
{
   for (word in words) {
      count = gsub(word,word)
      if (count) {
         counts[word] += count
         lines[word]++
      }
   }
}
END { for (word in words) printf "%s %d %d\n", word, counts[word], lines[word] }
' file

      

+1


source







All Articles