Grep and give informative output
I want to see how many times a certain word has been mentioned in a file / lines.
My dummy examples look like this:
cat words
blue
red
green
yellow
cat text
TEXTTEXTblueTEXTTEXTblue
TEXTTEXTgreenblueTEXTTEXT
TEXTTEXyeowTTEXTTEXTTEXT
I'm doing it:
for i in $(cat words); do grep "$i" text | wc >> output; done
cat output
2 2 51
0 0 0
1 1 26
0 0 0
But I really want to get:
1. The word that was used as a variable. 2. How many lines (except for text beats) the word was found.
The preferred output looks like this:
blue 3 2
red 0 0
green 1 1
yellow 0 0
$ 1 - the variable that was grep'ed
$ 2 - how many times the variable was found in the text
$ 3 - the number of lines found
Hopefully someone can help me do this with grep, awk, sed as they are fast enough for a large dataset, but Perl one liner helped me too.
Edit
Tried this
for i in $(cat words); do grep "$i" text > out_${i}; done && wc out*
and it looks nice, but some words are longer than 300 letters, so I can't create a file with a name like the word.
source to share
You can use an option grep
-o
that only prints the matched portions of the corresponding line, with each match on a separate output line .
while IFS= read -r line; do
wordcount=$(grep -o "$line" text | wc -l)
linecount=$(grep -c "$line" text)
echo $line $wordcount $linecount
done < words | column -t
You can put everything on one line to make it one liner.
If a column gives a column too long error, you can use printf as long as you know the maximum number of characters. Use below instead echo
and remove channel to column:
printf "%-20s %-2s %-2s\n" "$line" $wordcount $linecount
Replace 20 with your maximum word length and other numbers as needed.
source to share
Here's a similar Perl solution; but rather written as a complete script.
#!/usr/bin/perl
use 5.012;
die "USAGE: $0 wordlist.txt [text-to-search.txt]\n" unless @ARGV;
my $wordsfile = shift @ARGV;
my @wordlist = do {
open my $words_fh, "<", $wordsfile or die "Can't open $wordsfile: $!";
map {chomp; length() ? $_ : ()} <$words_fh>;
};
my %words;
while (<>) {
for my $word (@wordlist) {
my $cnt = 0;
$cnt++ for /\Q$word\E/g;
$words{$word}[0] += $cnt;
$words{$word}[1] += 1&!! $cnt; # trick to force 1 or 0.
}
}
# sorts output after frequency. remove `sort {...}` to get unsorted output.
for my $key (sort {$words{$b}->[0] <=> $words{$a}->[0] or $a cmp $b} keys %words) {
say join "\t", $key, @{ $words{$key} };
}
Output example:
blue 3 2
green 1 1
red 0 0
yellow 0 0
Advantage over bash script: each file is read only once.
source to share
This gets pretty ugly as a one-liner Perl (partly because it needs to get data from two files and only one can be sent to stdin, partly because of the requirement to count both the number of lines matched and the total number of matches), but here you are go:
perl -E 'undef $|; open $w, "<", "words"; @w=<$w>; chomp @w; $r{$_}=[0,{}] for @w; my $re = join "|", @w; while(<>) { $l++; while (/($re)/g) { $r{$1}[0]++; $r{$1}[1]{$l}++; } }; say "$_\t$r{$_}[0]\t" . scalar keys %{$r{$_}[1]} for @w' < text
This requires perl 5.10 or newer, but changing it to support 5.8 and earlier is trivial. (Change -E
to -E
, change say
to, print
and add \n
to the end of each line of output.)
Output:
blue 3 2
red 0 0
green 1 1
yellow 0 0
source to share
awk (gawk) oneliner can save you from the grep puzzle:
awk 'NR==FNR{n[$0];l[$0];next;}{for(w in n){ s=$0;t=gsub(w,"#",s); n[w]+=t;l[w]+=t>0?1:0;}}END{for(x in n)print x,n[x],l[x]}' words text
format the code a bit:
awk 'NR==FNR{n[$0];l[$0];next;}
{for(w in n){ s=$0;
t=gsub(w,"#",s);
n[w]+=t;l[w]+=t>0?1:0;}
}END{for(x in n)print x,n[x],l[x]}' words text
check your example:
kent$ awk 'NR==FNR{n[$0];l[$0];next;}{for(w in n){ s=$0;t=gsub(w,"#",s); n[w]+=t;l[w]+=t>0?1:0;}}END{for(x in n)print x,n[x],l[x]}' words text
yellow 0 0
red 0 0
green 1 1
blue 3 2
if you want to format your output you can simply pipe the awk output to column -t
so it looks like this:
yellow 0 0
red 0 0
green 1 1
blue 3 2
source to share