How to stop greed with grep from bash
I have a html page with the following content:
[...]
<tr><td class="n"><a href="play-1.0.1.zip">play-1.0.1.zip</a></td></tr>
<tr><td class="n"><a href="play-1.0.2.1.zip">play-1.0.2.1.zip</a></td></tr>
<tr><td class="n"><a href="play-1.0.2.zip">play-1.0.2.zip</a></td></tr>
[...]
And I would like to extract only
play-1.0.1.zip
play-1.0.2.1.zip
play-1.0.2.zip
to find the latest version (in this case it will be play.1.0.2.1.zip)
So I tried with
cat tmp.html | grep "<a href=\".*\""
<a href="play-1.0.1.zip">play-1.0.1.zip</a></td><td class="m"
<a href="play-1.0.2.1.zip">play-1.0.2.1.zip</a></td><td class="m"
<a href="play-1.0.2.zip">play-1.0.2.zip</a></td><td class="m"
So, I tried with lazy:
cat tmp.html | grep "<a href=\".*?\""
and negation of quotes
cat tmp.html | grep "<a href=\"[^\"]*?\""
both of them return nothing
I only need to get the relevant part (not the href) and then find the last one, but I am stuck with this greed problem ...
-
Thanks a lot for all the answers, they were very helpful, hard to decide which one is correct, in the end I solved it with:
grep -v '.*-RC.*' index.html | grep -oP 'play-1.*?.zip' | sort -Vru | head -1
Unlike other answers, this can be done entirely with grep.
Your output is slightly different from your input - additional elements appear. For the purposes of this answer, I'm going to use this file:
<tr><td class="n"><a href="play-1.0.1.zip">play-1.0.1.zip</a></td><td class="n"><a href="play-1.0.1.zip">play-1.0.1.zip</a></td></tr>
<tr><td class="n"><a href="play-1.0.2.1.zip">play-1.0.2.1.zip</a></td><td class="n"><a href="play-1.0.1.zip">play-1.0.1.zip</a></td></tr>
<tr><td class="n"><a href="play-1.0.2.zip">play-1.0.2.zip</a></td><td class="n"><a href="play-1.0.1.zip">play-1.0.1.zip</a></td></tr>
There are several things you need to do here. First, you need to set the correct grep switches. You need:
- -o only print the matched portion of each line
- -P use Perl-compatible regular expression engine
Now can you use? modifier to prevent greedy matching:
grep -o -P '<a href=".*?"' test.html
<a href="play-1.0.1.zip"
<a href="play-1.0.1.zip"
<a href="play-1.0.2.1.zip"
<a href="play-1.0.1.zip"
<a href="play-1.0.2.zip"
<a href="play-1.0.1.zip"
This is not entirely true, so we will bind the regex to the first match of the string:
grep -o -P '^<tr><td class="n"><a href=".*?"' test.html
<tr><td class="n"><a href="play-1.0.1.zip"
<tr><td class="n"><a href="play-1.0.2.1.zip"
<tr><td class="n"><a href="play-1.0.2.zip"
This is correct data, but with too many cracks. We need to use zero width assertions (part of the PCRE syntax). Essentially the regex bits that are ignored for the matched pattern.
grep -o -P '(?<=^<tr><td class="n"><a href=").*?(?=")' test.html
play-1.0.1.zip
play-1.0.2.1.zip
play-1.0.2.zip
Now you can do whatever you need to do to sort the list. More information on zero width assertions can be found here: http://www.regular-expressions.info/lookaround.html
With GNU tools, you can do
grep -oP '(?<=<td class="n"><a href=")[^"]+' | sort -Vr | head -1
$ grep 'href=' tmp.html | sed 's/.*href="\(.*\)".*/\1/'
play-1.0.1.zip
play-1.0.2.1.zip
play-1.0.2.zip
Haven't seen cut (and I like its brevity and speed) like this:
cut -d \ "- f4 tmp.html | sort -Vu | tail -1
output:
play-1.0.2.1.zip
try with a switch -E
:
piotrekkr@piotrekkr-desktop:~$ echo '<a href="play-1.0.1.zip">play-1.0.1.zip</a></td>' | grep -E '<a href=".*?"'
<a href="play-1.0.1.zip">play-1.0.1.zip</a></td>
grep
doesn't seem like the right tool for this, as you want to extract the swap.
Here's a perl one-liner that will do it, though:
$ perl -ne 'while(/<a href="([^"]+)"/g){print $1, "\n";}' input
play-1.0.1.zip
play-1.0.2.1.zip
play-1.0.2.zip
Using the answer provided by Craig Andrews with OSX support added.
grep -o -P '(?<=^<tr><td class="n"><a href=").*?(?=")' /test.html | sort -n -r -k1.10,12
Result:
play-1.0.2.1.zip
play-1.0.2.zip
play-1.0.1.zip
Awk is a great tool if you know the field numbers:
awk -F\" '$4 ~ /play.*zip/{ print $4 }'
Or is it some kind of dirty way; search for all zip files:
cat file | tr '"' '\n' | grep -e '.zip$' | sort -u
This will give you all the zip files. The tr utility is used too much, it just replaces the character, in this case replacing each double quote with a newline, nicely getting the quoted data on its own line where you can grep it. The -u variety avoids duplication.
The perl way:
cat thefile | perl -anF'"' -e 'print $F[3],"\n";($v)=$F[3]=~/(\d.*\d)/;$m=$v if$v gt $m;}{print "max=$m\n";'
output:
play-1.0.1.zip play-1.0.2.1.zip play-1.0.2.zip max=1.0.2.1