Grep and give informative output

Question

Grep and give informative output

I want to see how many times a certain word has been mentioned in a file / lines.

My dummy examples look like this:

cat words
blue
red 
green
yellow 

cat text
TEXTTEXTblueTEXTTEXTblue
TEXTTEXTgreenblueTEXTTEXT
TEXTTEXyeowTTEXTTEXTTEXT

I'm doing it:

for i in $(cat words); do grep "$i" text | wc >> output; done

cat output
  2       2      51
  0       0       0
  1       1      26
  0       0       0

But I really want to get:
1. The word that was used as a variable. 2. How many lines (except for text beats) the word was found.

The preferred output looks like this:

blue    3   2
red     0   0 
green   1   1
yellow  0   0

$ 1 - the variable that was grep'ed
$ 2 - how many times the variable was found in the text
$ 3 - the number of lines found

Hopefully someone can help me do this with grep, awk, sed as they are fast enough for a large dataset, but Perl one liner helped me too.

Edit

Tried this

   for i in $(cat words); do grep "$i" text > out_${i}; done && wc out*

and it looks nice, but some words are longer than 300 letters, so I can't create a file with a name like the word.

+3

bash grep awk perl sed

PoGibas Jan 26. 13 at 10:30

source to share

5 answers

Here's a similar Perl solution; but rather written as a complete script.

#!/usr/bin/perl

use 5.012;

die "USAGE: $0 wordlist.txt [text-to-search.txt]\n" unless @ARGV;

my $wordsfile = shift @ARGV;
my @wordlist = do {
    open my $words_fh, "<", $wordsfile or die "Can't open $wordsfile: $!";
    map {chomp; length() ? $_ : ()} <$words_fh>;
};

my %words;
while (<>) {
    for my $word (@wordlist) {
        my $cnt = 0;
        $cnt++ for /\Q$word\E/g;
        $words{$word}[0] += $cnt;
        $words{$word}[1] += 1&!! $cnt; # trick to force 1 or 0.
    }
}

# sorts output after frequency. remove `sort {...}` to get unsorted output.
for my $key (sort {$words{$b}->[0] <=> $words{$a}->[0] or $a cmp $b} keys %words) {
    say join "\t", $key, @{ $words{$key} };
}

Output example:

blue    3       2
green   1       1
red     0       0
yellow  0       0

Advantage over bash script: each file is read only once.

+3

amon Jan 26. 13 at 13:16

source to share

This gets pretty ugly as a one-liner Perl (partly because it needs to get data from two files and only one can be sent to stdin, partly because of the requirement to count both the number of lines matched and the total number of matches), but here you are go:

perl -E 'undef $|; open $w, "<", "words"; @w=<$w>; chomp @w; $r{$_}=[0,{}] for @w; my $re = join "|", @w; while(<>) { $l++; while (/($re)/g) { $r{$1}[0]++; $r{$1}[1]{$l}++; } }; say "$_\t$r{$_}[0]\t" . scalar keys %{$r{$_}[1]} for @w' < text

This requires perl 5.10 or newer, but changing it to support 5.8 and earlier is trivial. (Change -E

to -E

, change say

to, print

and add \n

to the end of each line of output.)

Output:

blue    3   2
red     0   0
green   1   1
yellow  0   0

+1

Dave sherohman Jan 26. At 12:54 pm

source to share

awk (gawk) oneliner can save you from the grep puzzle:

  awk 'NR==FNR{n[$0];l[$0];next;}{for(w in n){ s=$0;t=gsub(w,"#",s); n[w]+=t;l[w]+=t>0?1:0;}}END{for(x in n)print x,n[x],l[x]}' words text

format the code a bit:

awk 'NR==FNR{n[$0];l[$0];next;}
    {for(w in n){ s=$0;
        t=gsub(w,"#",s); 
        n[w]+=t;l[w]+=t>0?1:0;}
    }END{for(x in n)print x,n[x],l[x]}' words text

check your example:

kent$  awk 'NR==FNR{n[$0];l[$0];next;}{for(w in n){ s=$0;t=gsub(w,"#",s); n[w]+=t;l[w]+=t>0?1:0;}}END{for(x in n)print x,n[x],l[x]}' words text
yellow  0 0
red  0 0
green 1 1
blue 3 2

if you want to format your output you can simply pipe the awk output to column -t

so it looks like this:

yellow  0  0
red     0  0
green   1  1
blue    3  2

+1

Kent Jan 26. 13 at 19:56

source to share

awk '
NR==FNR { words[$0]; next }
{
   for (word in words) {
      count = gsub(word,word)
      if (count) {
         counts[word] += count
         lines[word]++
      }
   }
}
END { for (word in words) printf "%s %d %d\n", word, counts[word], lines[word] }
' file

+1

Ed morton Jan 27. 13 at 12:37

source to share

Vivek · Accepted Answer · 2013-01-26T10:51:44+0000

You can use an option grep

-o

that only prints the matched portions of the corresponding line, with each match on a separate output line .

while IFS= read -r line; do
    wordcount=$(grep -o "$line" text | wc -l)
    linecount=$(grep -c "$line" text)
    echo $line $wordcount $linecount
done < words | column -t

You can put everything on one line to make it one liner.

If a column gives a column too long error, you can use printf as long as you know the maximum number of characters. Use below instead echo

and remove channel to column:

printf "%-20s %-2s %-2s\n" "$line" $wordcount $linecount

Replace 20 with your maximum word length and other numbers as needed.

Grep and give informative output

More articles: