Scrambling a webpage from JavaScript with BeautifulSoup
guys! I appeal to you again. I'm fine with cleaning up simple tagged websites, but I've come across a pretty complex website with JavaScript lately. As a result, I would like to get all the scores at the bottom of the page in table format (csv). Like User, Income Score, EPS Score.
I was hoping to figure it out myself, but it didn't work out.
Here is my code:
from urllib import urlopen
from bs4 import BeautifulSoup
html = urlopen("https://www.estimize.com/jpm/fq3-2016?sort=rank&direction=asc&estimates_per_page=142&show_confirm=false")
soup = BeautifulSoup(html.read(), "html.parser")
print(soup.findAll('script')[11].string.encode('utf8'))
The output is in a weird format and I don't know how to extract the data in the appropriate form. Any help would be appreciated!
source to share
It looks like the data you are trying to fetch is in the data model, which means it is in JSON. If you do a little disassembly with the following:
import json
import re
data_string = soup.findAll('script')[11].string.encode('utf8')
data_string = data_string.split("DataModel.parse(")[1]
data_string = data_string.split(");")[0]
// parse out erroneous html
while re.search('\<[^\>]*\>', datastring):
data_string = ''.join(datastring.split(re.search('\<[^\>]*\>', datastring).group(0)))
// parse out other function parameters, leaving you with the json
data_you_want = json.loads(data_string.split(re.search('\}[^",\}\]]+,', data_string).group(0))[0]+'}')
print(data_you_want["estimate"])
>>> {'shares': {'shares_hash': {'twitter': None, 'stocktwits': None, 'linkedin': None}}, 'lastRevised': None, 'id': None, 'revenue_points': None, 'sector': 'financials', 'persisted': False, 'points': None, 'instrumentSlug': 'jpm', 'wallstreetRevenue': 23972, 'revenue': 23972, 'createdAt': None, 'username': None, 'isBlind': False, 'releaseSlug': 'fq3-2016', 'statement': '', 'errorRanges': {'revenue': {'low': 21247.3532016398, 'high': 26820.423240734}, 'eps': {'low': 1.02460526459765, 'high': 1.81359679579922}}, 'eps_points': None, 'rank': None, 'instrumentId': 981, 'eps': 1.4, 'season': '2016-fall', 'releaseId': 52773}
DataModel.parse is a javascript method which means it ends with a parenthesis and colon. the parameter for the function is the JSON object you want. By loading it into json.loads , you can access it like a dictionary.
From there, you will reassign the data to the form you want it to be in the csv.
source to share
This is how I solved the problem using a few tips above:
from bs4 import BeautifulSoup
from urllib import urlopen
import json
import csv
f = csv.writer(open("estimize.csv", "a"))
f.writerow(["User Name", "Revenue Estimate", "EPS Estimate"])
html = "https://www.estimize.com/jpm/fq3-2016?sort=rank&direction=asc&estimates_per_page=142&show_confirm=false"
html = urlopen(html)
soup = BeautifulSoup(html.read(), "html.parser").encode('utf8')
data_string = soup.split("\"allEstimateRows\":")[1]
data_string = data_string.split(",\"tableSortDirection")[0]
data = json.loads(data_string)
for item in data:
f.writerow([item["userName"], item["revenue"], item["eps"]])
source to share