How to stop greed with grep from bash

Question

How to stop greed with grep from bash

I have a html page with the following content:

[...]
<tr><td class="n"><a href="play-1.0.1.zip">play-1.0.1.zip</a></td></tr>
<tr><td class="n"><a href="play-1.0.2.1.zip">play-1.0.2.1.zip</a></td></tr>
<tr><td class="n"><a href="play-1.0.2.zip">play-1.0.2.zip</a></td></tr>
[...]

And I would like to extract only

play-1.0.1.zip
play-1.0.2.1.zip
play-1.0.2.zip

to find the latest version (in this case it will be play.1.0.2.1.zip)

So I tried with

cat tmp.html | grep "<a href=\".*\""

<a href="play-1.0.1.zip">play-1.0.1.zip</a></td><td class="m"
<a href="play-1.0.2.1.zip">play-1.0.2.1.zip</a></td><td class="m"
<a href="play-1.0.2.zip">play-1.0.2.zip</a></td><td class="m"

So, I tried with lazy:

cat tmp.html | grep "<a href=\".*?\""

and negation of quotes

cat tmp.html | grep "<a href=\"[^\"]*?\""

both of them return nothing

I only need to get the relevant part (not the href) and then find the last one, but I am stuck with this greed problem ...

-

Thanks a lot for all the answers, they were very helpful, hard to decide which one is correct, in the end I solved it with:

grep -v '.*-RC.*' index.html | grep -oP 'play-1.*?.zip' | sort -Vru | head -1

+3

bash regex grep

opensas 15 Mar At 13:01

source to share

9 replies

With GNU tools, you can do

grep -oP '(?<=<td class="n"><a href=")[^"]+' | sort -Vr | head -1

+5

glenn jackman 15 Mar 12 at 13:49

source to share

$ grep 'href=' tmp.html | sed 's/.*href="\(.*\)".*/\1/'
play-1.0.1.zip
play-1.0.2.1.zip
play-1.0.2.zip

+3

strkol 15 Mar 12 at 13:16

source to share

Haven't seen cut (and I like its brevity and speed) like this:

cut -d \ "- f4 tmp.html | sort -Vu | tail -1

output:

play-1.0.2.1.zip

+3

jokmi 27 nov. 12 at 8:46

source to share

try with a switch -E

:

piotrekkr@piotrekkr-desktop:~$ echo '<a href="play-1.0.1.zip">play-1.0.1.zip</a></td>' | grep -E '<a href=".*?"'
<a href="play-1.0.1.zip">play-1.0.1.zip</a></td>

+2

piotrekkr 15 Mar 12 at 13:10

source to share

grep

doesn't seem like the right tool for this, as you want to extract the swap.

Here's a perl one-liner that will do it, though:

$ perl -ne 'while(/<a href="([^"]+)"/g){print $1, "\n";}' input 
play-1.0.1.zip
play-1.0.2.1.zip
play-1.0.2.zip

+1

Mat 15 Mar 12 at 13:15

source to share

Using the answer provided by Craig Andrews with OSX support added.

grep -o -P '(?<=^<tr><td class="n"><a href=").*?(?=")' /test.html | sort -n -r -k1.10,12

Result:

play-1.0.2.1.zip
play-1.0.2.zip
play-1.0.1.zip

+1

E1Suave Apr 10 12 at 15:31

source to share

Awk is a great tool if you know the field numbers:

awk -F\" '$4 ~ /play.*zip/{ print $4 }'

Or is it some kind of dirty way; search for all zip files:

cat file | tr '"' '\n' | grep -e '.zip$' | sort -u

This will give you all the zip files. The tr utility is used too much, it just replaces the character, in this case replacing each double quote with a newline, nicely getting the quoted data on its own line where you can grep it. The -u variety avoids duplication.

0

Ian Roddis 15 Mar 12 at 13:52

source to share

The perl way:

cat thefile | perl -anF'"' -e 'print $F[3],"\n";($v)=$F[3]=~/(\d.*\d)/;$m=$v if$v gt $m;}{print "max=$m\n";'

output:

play-1.0.1.zip
play-1.0.2.1.zip
play-1.0.2.zip
max=1.0.2.1

0

Toto 15 Mar 12 at 14:16

source to share

Craig andrews · Accepted Answer · 2012-03-15T13:32:44+0000

Unlike other answers, this can be done entirely with grep.

Your output is slightly different from your input - additional elements appear. For the purposes of this answer, I'm going to use this file:

<tr><td class="n"><a href="play-1.0.1.zip">play-1.0.1.zip</a></td><td class="n"><a href="play-1.0.1.zip">play-1.0.1.zip</a></td></tr>
<tr><td class="n"><a href="play-1.0.2.1.zip">play-1.0.2.1.zip</a></td><td class="n"><a href="play-1.0.1.zip">play-1.0.1.zip</a></td></tr>
<tr><td class="n"><a href="play-1.0.2.zip">play-1.0.2.zip</a></td><td class="n"><a href="play-1.0.1.zip">play-1.0.1.zip</a></td></tr>

There are several things you need to do here. First, you need to set the correct grep switches. You need:

-o only print the matched portion of each line
-P use Perl-compatible regular expression engine

Now can you use? modifier to prevent greedy matching:

grep -o -P '<a href=".*?"' test.html

<a href="play-1.0.1.zip"
<a href="play-1.0.1.zip"
<a href="play-1.0.2.1.zip"
<a href="play-1.0.1.zip"
<a href="play-1.0.2.zip"
<a href="play-1.0.1.zip"

This is not entirely true, so we will bind the regex to the first match of the string:

grep -o -P '^<tr><td class="n"><a href=".*?"' test.html

<tr><td class="n"><a href="play-1.0.1.zip"
<tr><td class="n"><a href="play-1.0.2.1.zip"
<tr><td class="n"><a href="play-1.0.2.zip"

This is correct data, but with too many cracks. We need to use zero width assertions (part of the PCRE syntax). Essentially the regex bits that are ignored for the matched pattern.

grep -o -P '(?<=^<tr><td class="n"><a href=").*?(?=")' test.html

play-1.0.1.zip
play-1.0.2.1.zip
play-1.0.2.zip

Now you can do whatever you need to do to sort the list. More information on zero width assertions can be found here: http://www.regular-expressions.info/lookaround.html

How to stop greed with grep from bash

More articles: