play-1....">

How to stop greed with grep from bash

I have a html page with the following content:

[...]
<tr><td class="n"><a href="play-1.0.1.zip">play-1.0.1.zip</a></td></tr>
<tr><td class="n"><a href="play-1.0.2.1.zip">play-1.0.2.1.zip</a></td></tr>
<tr><td class="n"><a href="play-1.0.2.zip">play-1.0.2.zip</a></td></tr>
[...]

      

And I would like to extract only

play-1.0.1.zip
play-1.0.2.1.zip
play-1.0.2.zip

      

to find the latest version (in this case it will be play.1.0.2.1.zip)

So I tried with

cat tmp.html | grep "<a href=\".*\""

      

<a href="play-1.0.1.zip">play-1.0.1.zip</a></td><td class="m"
<a href="play-1.0.2.1.zip">play-1.0.2.1.zip</a></td><td class="m"
<a href="play-1.0.2.zip">play-1.0.2.zip</a></td><td class="m"

      

So, I tried with lazy:

cat tmp.html | grep "<a href=\".*?\""

      

and negation of quotes

cat tmp.html | grep "<a href=\"[^\"]*?\""

      

both of them return nothing

I only need to get the relevant part (not the href) and then find the last one, but I am stuck with this greed problem ...

-

Thanks a lot for all the answers, they were very helpful, hard to decide which one is correct, in the end I solved it with:

grep -v '.*-RC.*' index.html | grep -oP 'play-1.*?.zip' | sort -Vru | head -1

      

+3
bash regex grep


source to share


9 replies


Unlike other answers, this can be done entirely with grep.

Your output is slightly different from your input - additional elements appear. For the purposes of this answer, I'm going to use this file:

<tr><td class="n"><a href="play-1.0.1.zip">play-1.0.1.zip</a></td><td class="n"><a href="play-1.0.1.zip">play-1.0.1.zip</a></td></tr>
<tr><td class="n"><a href="play-1.0.2.1.zip">play-1.0.2.1.zip</a></td><td class="n"><a href="play-1.0.1.zip">play-1.0.1.zip</a></td></tr>
<tr><td class="n"><a href="play-1.0.2.zip">play-1.0.2.zip</a></td><td class="n"><a href="play-1.0.1.zip">play-1.0.1.zip</a></td></tr>

      

There are several things you need to do here. First, you need to set the correct grep switches. You need:

  • -o only print the matched portion of each line
  • -P use Perl-compatible regular expression engine

Now can you use? modifier to prevent greedy matching:



grep -o -P '<a href=".*?"' test.html

<a href="play-1.0.1.zip"
<a href="play-1.0.1.zip"
<a href="play-1.0.2.1.zip"
<a href="play-1.0.1.zip"
<a href="play-1.0.2.zip"
<a href="play-1.0.1.zip"

      

This is not entirely true, so we will bind the regex to the first match of the string:

grep -o -P '^<tr><td class="n"><a href=".*?"' test.html

<tr><td class="n"><a href="play-1.0.1.zip"
<tr><td class="n"><a href="play-1.0.2.1.zip"
<tr><td class="n"><a href="play-1.0.2.zip"

      

This is correct data, but with too many cracks. We need to use zero width assertions (part of the PCRE syntax). Essentially the regex bits that are ignored for the matched pattern.

grep -o -P '(?<=^<tr><td class="n"><a href=").*?(?=")' test.html

play-1.0.1.zip
play-1.0.2.1.zip
play-1.0.2.zip

      

Now you can do whatever you need to do to sort the list. More information on zero width assertions can be found here: http://www.regular-expressions.info/lookaround.html

+6


source to share


With GNU tools, you can do



grep -oP '(?<=<td class="n"><a href=")[^"]+' | sort -Vr | head -1

      

+5


source to share


$ grep 'href=' tmp.html | sed 's/.*href="\(.*\)".*/\1/'
play-1.0.1.zip
play-1.0.2.1.zip
play-1.0.2.zip

      

+3


source to share


Haven't seen cut (and I like its brevity and speed) like this:

cut -d \ "- f4 tmp.html | sort -Vu | tail -1

output:

play-1.0.2.1.zip

+3


source to share


try with a switch -E

:

piotrekkr@piotrekkr-desktop:~$ echo '<a href="play-1.0.1.zip">play-1.0.1.zip</a></td>' | grep -E '<a href=".*?"'
<a href="play-1.0.1.zip">play-1.0.1.zip</a></td>

      

+2


source to share


grep

doesn't seem like the right tool for this, as you want to extract the swap.

Here's a perl one-liner that will do it, though:

$ perl -ne 'while(/<a href="([^"]+)"/g){print $1, "\n";}' input 
play-1.0.1.zip
play-1.0.2.1.zip
play-1.0.2.zip

      

+1


source to share


Using the answer provided by Craig Andrews with OSX support added.

grep -o -P '(?<=^<tr><td class="n"><a href=").*?(?=")' /test.html | sort -n -r -k1.10,12

      

Result:

play-1.0.2.1.zip
play-1.0.2.zip
play-1.0.1.zip

      

+1


source to share


Awk is a great tool if you know the field numbers:

awk -F\" '$4 ~ /play.*zip/{ print $4 }'

      

Or is it some kind of dirty way; search for all zip files:

cat file | tr '"' '\n' | grep -e '.zip$' | sort -u

      

This will give you all the zip files. The tr utility is used too much, it just replaces the character, in this case replacing each double quote with a newline, nicely getting the quoted data on its own line where you can grep it. The -u variety avoids duplication.

0


source to share


The perl way:

cat thefile | perl -anF'"' -e 'print $F[3],"\n";($v)=$F[3]=~/(\d.*\d)/;$m=$v if$v gt $m;}{print "max=$m\n";'

      

output:

play-1.0.1.zip
play-1.0.2.1.zip
play-1.0.2.zip
max=1.0.2.1

      

0


source to share







All Articles
Loading...
X
Show
Funny
Dev
Pics