Scrambling a webpage from JavaScript with BeautifulSoup

Question

Scrambling a webpage from JavaScript with BeautifulSoup

guys! I appeal to you again. I'm fine with cleaning up simple tagged websites, but I've come across a pretty complex website with JavaScript lately. As a result, I would like to get all the scores at the bottom of the page in table format (csv). Like User, Income Score, EPS Score.

I was hoping to figure it out myself, but it didn't work out.

Here is my code:

from urllib import urlopen
from bs4 import BeautifulSoup
html = urlopen("https://www.estimize.com/jpm/fq3-2016?sort=rank&direction=asc&estimates_per_page=142&show_confirm=false")
soup = BeautifulSoup(html.read(), "html.parser")
print(soup.findAll('script')[11].string.encode('utf8'))

The output is in a weird format and I don't know how to extract the data in the appropriate form. Any help would be appreciated!

+3

javascript python csv web-scraping beautifulsoup

Anna Ignashkina Mar 30 17 at 14:13

source to share

2 answers

B.Adler · Answer 1 · 2017-03-30T23:00:10+0000

It looks like the data you are trying to fetch is in the data model, which means it is in JSON. If you do a little disassembly with the following:

import json
import re

data_string = soup.findAll('script')[11].string.encode('utf8')
data_string = data_string.split("DataModel.parse(")[1]
data_string = data_string.split(");")[0]

// parse out erroneous html
while re.search('\<[^\>]*\>', datastring):
    data_string = ''.join(datastring.split(re.search('\<[^\>]*\>', datastring).group(0)))

// parse out other function parameters, leaving you with the json
data_you_want = json.loads(data_string.split(re.search('\}[^",\}\]]+,', data_string).group(0))[0]+'}')

print(data_you_want["estimate"])
>>> {'shares': {'shares_hash': {'twitter': None, 'stocktwits': None, 'linkedin': None}}, 'lastRevised': None, 'id': None, 'revenue_points': None, 'sector': 'financials', 'persisted': False, 'points': None, 'instrumentSlug': 'jpm', 'wallstreetRevenue': 23972, 'revenue': 23972, 'createdAt': None, 'username': None, 'isBlind': False, 'releaseSlug': 'fq3-2016', 'statement': '', 'errorRanges': {'revenue': {'low': 21247.3532016398, 'high': 26820.423240734}, 'eps': {'low': 1.02460526459765, 'high': 1.81359679579922}}, 'eps_points': None, 'rank': None, 'instrumentId': 981, 'eps': 1.4, 'season': '2016-fall', 'releaseId': 52773}

DataModel.parse is a javascript method which means it ends with a parenthesis and colon. the parameter for the function is the JSON object you want. By loading it into json.loads , you can access it like a dictionary.

From there, you will reassign the data to the form you want it to be in the csv.

Anna Ignashkina · Answer 2 · 2017-04-02T11:27:51+0000

This is how I solved the problem using a few tips above:

from bs4 import BeautifulSoup
from urllib import urlopen
import json
import csv

f = csv.writer(open("estimize.csv", "a"))
f.writerow(["User Name", "Revenue Estimate", "EPS Estimate"])

html = "https://www.estimize.com/jpm/fq3-2016?sort=rank&direction=asc&estimates_per_page=142&show_confirm=false"
html = urlopen(html)
soup = BeautifulSoup(html.read(), "html.parser").encode('utf8')
data_string = soup.split("\"allEstimateRows\":")[1]
data_string = data_string.split(",\"tableSortDirection")[0]
data = json.loads(data_string)

for item in data:
    f.writerow([item["userName"], item["revenue"], item["eps"]])

Scrambling a webpage from JavaScript with BeautifulSoup

More articles: