Python Regex Match Line If Ends?

This is what I am trying to clean up:

        <p>Some.Title.html<br />
<a href="https://www.somelink.com/yep.html" rel="nofollow">https://www.somelink.com/yep.html</a><br />
Some.Title.txt<br />
<a href="https://www.somelink.com/yeppers.txt" rel="nofollow">https://www.somelink.com/yeppers.txt</a><br />

      

I've tried several options:

match = re.compile('^(.+?)<br \/><a href="https://www.somelink.com(.+?)">',re.DOTALL).findall(html)

      

I am looking for matching strings with and without the "p" tag. The "p" tag appears only in the first instance. Ugly on python so i'm pretty rusty, searched here and google and nothing seemed to be the same. Thanks for any help. I really appreciate the help I get here as I get stuck.

The required output is the index:

<a href="Some.Title.html">http://www.SomeLink.com/yep.html</a>
<a href="Some.Title.txt">http://www.SomeLink.com/yeppers.txt</a>

      

+3


source to share


1 answer


Using the Beautiful soup and requests module is perfect for something like this and not a regex as commenters pointed out above.

import requests
import bs4

html_site = 'www.google.com' #or whatever site you need scraped
site_data = requests.get(html_site) # downloads site into a requests object
site_parsed = bs4.BeautifulSoup(site_data.text) #converts site text into bs4 object
a_tags = site_parsed.select('a') #this will select all 'a' tags and return list of them

      



This is just a simple code that will select all tags from the html site and save them in a list with the above format. I would suggest checking here for a good bs4 tutorial and here for the actual docs.

+3


source







All Articles