Python Regex Match Line If Ends?

Question

Python Regex Match Line If Ends?

This is what I am trying to clean up:

        <p>Some.Title.html<br />
<a href="https://www.somelink.com/yep.html" rel="nofollow">https://www.somelink.com/yep.html</a><br />
Some.Title.txt<br />
<a href="https://www.somelink.com/yeppers.txt" rel="nofollow">https://www.somelink.com/yeppers.txt</a><br />

I've tried several options:

match = re.compile('^(.+?)<br \/><a href="https://www.somelink.com(.+?)">',re.DOTALL).findall(html)

I am looking for matching strings with and without the "p" tag. The "p" tag appears only in the first instance. Ugly on python so i'm pretty rusty, searched here and google and nothing seemed to be the same. Thanks for any help. I really appreciate the help I get here as I get stuck.

The required output is the index:

<a href="Some.Title.html">http://www.SomeLink.com/yep.html</a>
<a href="Some.Title.txt">http://www.SomeLink.com/yeppers.txt</a>

+3

python regex

Bobby peters 31 jul. 17 at 2:42

source to share

1 answer

Matthew barlowe · Answer 1 · 2017-07-31T03:20:30+0000

Using the Beautiful soup and requests module is perfect for something like this and not a regex as commenters pointed out above.

import requests
import bs4

html_site = 'www.google.com' #or whatever site you need scraped
site_data = requests.get(html_site) # downloads site into a requests object
site_parsed = bs4.BeautifulSoup(site_data.text) #converts site text into bs4 object
a_tags = site_parsed.select('a') #this will select all 'a' tags and return list of them

This is just a simple code that will select all tags from the html site and save them in a list with the above format. I would suggest checking here for a good bs4 tutorial and here for the actual docs.

Python Regex Match Line If Ends?

More articles: