BeautifulSoup Parse text after <b> and before </br>
I have this code trying to parse search results from a grant website (please find the url in the code, I cannot post the link until my rep is higher), "Year" and "Award Amount" after tags and before tags.
Two questions:
1) Why does this only return the 1st table?
2) Anyway, I can get the text that after the lines (e.g. year and premium amount) and (i.e. actual number e.g. 2015 and $ 100,000)
In particular:
Here's my script:
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'http://www.ned.org/wp-content/themes/ned/search/grant-search.php?' \
'organizationName=®ion=ASIA&projectCountry=China&amount=&fromDate=&toDate=&' \
'projectFocus%5B%5D=&search=&maxCount=25&orderBy=Year&start=1&sbmt=1'
r = requests.get(url)
html_content = r.text
soup = BeautifulSoup(html_content, "html.parser")
tables = soup.find_all('table')
data = {
'col_names': [],
'info' : [],
'year_amount':[]
}
index = 0
for table in tables:
rows = table.find_all('tr')[1:]
for row in rows:
cols = row.find_all('td')
data['col_names'].append(cols[0].get_text())
data['info'].append(cols[1].get_text())
try:
data['year_amount'].append(cols[2].get_text())
except IndexError:
data['year_amount'].append(None)
grant_df = pd.DataFrame(data)
index += 1
filename = 'grant ' + str(index) + '.csv'
grant_df.to_csv(filename)
+3
source to share
1 answer
I would suggest approaching parsing the table in a different way. All information is available in the first row of each table. Thus, you can parse the text of the string like this:
Code:
text = '\n'.join([x.strip() for x in rows[0].get_text().split('\n')
if x.strip()]).replace(':\n', ': ')
data_dict = {k.strip(): v.strip() for k, v in
[x.split(':', 1) for x in text.split('\n')]}
How?
This takes text and
- splits it into newlines
- removes blank lines
- removes any leading / trailing space
- concatenates lines together into one text
- concatenates any line ending on the
:
next line
Then:
- split the text again with a new line
- split each line into
:
- separate any spaces of the text outlines on either side of
:
- use separator text as key and value for
dict
Security Code:
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'http://www.ned.org/wp-content/themes/ned/search/grant-search.php?' \
'organizationName=®ion=ASIA&projectCountry=China&amount=&' \
'fromDate=&toDate=&projectFocus%5B%5D=&search=&maxCount=25&' \
'orderBy=Year&start=1&sbmt=1'
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
data = []
for table in soup.find_all('table'):
rows = table.find_all('tr')
text = '\n'.join([x.strip() for x in rows[0].get_text().split('\n')
if x.strip()]).replace(':\n', ': ')
data_dict = {k.strip(): v.strip() for k, v in
[x.split(':', 1) for x in text.split('\n')]}
if data_dict.get('Award Amount'):
data.append(data_dict)
grant_df = pd.DataFrame(data)
print(grant_df.head())
Results:
Award Amount Description \
0 $84,907 To strengthen the capacity of China rights d...
1 $204,973 To provide an effective forum for free express...
2 $48,000 To promote religious freedom in China. The org...
3 $89,000 To educate and train civil society activists o...
4 $65,000 To encourage greater public discussion, transp...
Organization Name Project Country Project Focus \
0 NaN Mainland China Rule of Law
1 Princeton China Initiative Mainland China Freedom of Information
2 NaN Mainland China Rule of Law
3 NaN Mainland China Democratic Ideas and Values
4 NaN Mainland China Rule of Law
Project Region Project Title Year
0 Asia Empowering the Chinese Legal Community 2014
1 Asia Supporting Free Expression and Open Debate for... 2014
2 Asia Religious Freedom, Rights Defense and Rule of ... 2014
3 Asia Education on Civil Society and Democratization 2014
4 Asia Promoting Democratic Policy Change in China 2014
+1
source to share