Beautiful soup - find the first link in the article
I am creating a python solution for this problem , but I am having trouble getting some edge cases.
The problem I am facing occurs for a page like this , where this link is the one to be retrieved as it is the first one outside the parentheses. Conversely, some articles like this where the link appears before the first parenthesis.
I currently handle these cases by initially iterating over the elements and text in the first paragraph tag (gated version) and checking which one is found first between '(' and <a>
. If <a>
found first (meaning that before the parenthesis was reached ), I just grab this link. If the parenthesis is found first, I wait until the parentheses are closed and then grab the next one. "
Basically, I just end up with a direct child of the first paragraph, which could be done with something like:
soup = BeautifulSoup(response.content, "lxml")
soup.select_one("#mw-content-text > p > a")
I think a select statement like this would work here to find the first reference in the prefix from the beginning of t <p>
to the first brace, or (if there is no reference in the prefix) find the reference immediately after the closing brace using something similar to what I am doing currently:
`findNext('a').attrs['href']`
If this approach is to be used, several problems arise, including: 1. How to actually get the prefix before the first parenthesis with only direct children "
Is there an optimized way to do this? If there is a better approach, what would it be?
source to share
This problem reminds me that there are problems with popular algorithms and data structures when you need to check if parentheses or other parentheses are balanced. For problems like this, it is convenient to use a stack data structure.
So, in this case, we will push to drain if there is an open parenthesis and pop out of it if there is a closing one. A valid reference for us would be when the stack is empty:
import requests
from bs4 import BeautifulSoup, NavigableString, Tag
urls = [
"https://en.wikipedia.org/wiki/Modern_Greek",
"https://en.wikipedia.org/wiki/Diglossia"
]
with requests.Session() as session:
for url in urls:
response = session.get(url)
soup = BeautifulSoup(response.content, "html.parser")
stack = []
for child in soup.select_one("#mw-content-text > p").children:
if isinstance(child, NavigableString):
if "(" in child:
stack.append("(")
if ")" in child:
stack.pop()
if isinstance(child, Tag) and child.name == "a" and not stack:
print(child.get_text())
break
He prints dialects
for the Modern Greek page and linguistics
for Diglossia. Both cases are being processed.
source to share