Scrambling data from several previous dates on the WSJ stock website

Question

Scrambling data from several previous dates on the WSJ stock website

I am clearing data from WSJ Biggest Gainers site. I'm new to Python, so I'm pretty sure it's simple. I just cannot find a clear answer to this question.

At the moment my code is only loading data from one page, but I want it to go back to previous days of data, for example, and find_all

or fetch data from charts. How do I change the url in the code for this? I am using Python 3.4.3 and bs4.

The best part is that the URLs of the sites of the previous days only differ by a few numbers.

For example, This is the last Friday http://online.wsj.com/mdc/public/page/2_3021-gainnnm-gainer-20150731.html?mod=mdc_pastcalendar

This is last Thursday

http://online.wsj.com/mdc/public/page/2_3021-gainnnm-gainer-20150730.html?mod=mdc_pastcalendar

Ideally, I would like to be able to change the month, date, or year if I want, and then loop through the different page urls to get the data I want.

Here is my code:

import requests 
from bs4 import BeautifulSoup


url = 'http://online.wsj.com/mdc/public/page/2_3021-gainnyse-gainer.html'

r = requests.get(url)           #downloads website html

soup = BeautifulSoup(r.content)         #soup calls the data

v_data = soup.select('.text') 

for symbol in v_data:
    print(symbol.text)

I just want to loop this function for the past X days. I tried making a list of urls to run with no luck. It's also a lot of work to make a list of urls, so if I could use something like% s or% d for month, year and date, then that would be better.

+3

python beautifulsoup

j.doe 08 Aug 15 at 14:43

source to share

1 answer

Padraic cunningham · Answer 1 · 2015-08-08T15:10:00+0000

You can use start date, then - = day using timedelta, passing the date to the url with str.format and strftime:

import requests
from bs4 import BeautifulSoup
from datetime import date,timedelta
start_url = "http://online.wsj.com/mdc/public/page/2_3021-gainnnm-gainer-{}.html?mod=mdc_pastcalendar"

start = date.today()
for _ in range(5):
    url = start_url.format(start.strftime("%Y%m%d"))
    start -= timedelta(days=1)
    r = requests.get(url)           #downloads website html
    soup = BeautifulSoup(r.content)         #soup calls the data
    v_data = soup.select('.text')
    for symbol in v_data:
        print(symbol.text)

Just create any date you want. If you want a specific start date, just create a datetime object:

import requests
from bs4 import BeautifulSoup
from datetime import datetime,timedelta
start_url = "http://online.wsj.com/mdc/public/page/2_3021-gainnnm-gainer-{}.html?mod=mdc_pastcalendar"

start = datetime(2015,07,31)
for _ in range(5):
    print("Data for {}".format(start.strftime("%b %d %Y")))
    url = start_url.format(start.strftime("%Y%m%d"))
    start -= timedelta(days=1)
    r = requests.get(url)           #downloads website html
    soup = BeautifulSoup(r.content)         #soup calls the data
    v_data = soup.select('.text')
    for symbol in v_data:
        print(symbol.text.rstrip())
    print(" ")

Output:

Data for Jul 31 2015

|
WHAT THIS?
|
1

MoneyGram International (MGI)
2

YRC Worldwide (YRCW)
3

Immersion (IMMR)
4

Skywest (SKYW)
5

Vital Therapies (VTL)
6

..........................

Data for Jul 30 2015

|
WHAT THIS?
|
1

H&E Equipment Services (HEES)
2

Senomyx (SNMX)
3

eHealth (EHTH)
4

Nutrisystem (NTRI)
5

Open Text (OTEX)
6

LivePerson (LPSN)
7

Sonus Networks (SONS)
8

FormFactor (FORM)
9

Pegasystems (PEGA)
10

Town Sports International Holdings (CLUB)
11

FARO Technologies (FARO)
12

Presbia (LENS)
13

If you only want to include weekdays and get n

days, we need to add a little more logic.

import requests
from bs4 import BeautifulSoup
from datetime import datetime, timedelta

start_url = "http://online.wsj.com/mdc/public/page/2_3021-gainnnm-gainer-{}.html?mod=mdc_pastcalendar"

start = datetime(2015, 7, 31)


def only_weekdays_range(start, n):
    i = 0
    wk_days = {0, 1, 2, 3, 4}
    while i != n:
        while start.weekday() not in wk_days:
            start -= timedelta(days=1)
        yield start
        i += 1
        start -= timedelta(days=1)



for dte in (only_weekdays_range(start, 2)):
    print("Data for {}".format(start.strftime("%b %d %Y")))
    url = start_url.format(start.strftime("%Y%m%d"))
    print(url)
    r = requests.get(url)  #downloads website html
    soup = BeautifulSoup(r.content)  #soup calls the data
    v_data = soup.select('.text')
    for symbol in v_data:
        print(symbol.text.rstrip())
    print(" ")

only_weekdays_range

will receive n

days from our start date, excluding weekends. You can do this: print(list(only_weekdays_range(datetime(2015, 7, 26), 2)))

. We get [datetime.datetime(2015, 7, 24, 0, 0), datetime.datetime(2015, 7, 23, 0, 0)]

that Friday 24th

and Thursday 23rd

because our starting day is Sunday26th

If you want to exclude holidays as well, that's a little more. Another approach would be to only shrink n

when you get the data returned from v_data

, but that can lead to infinite loops for various reasons.

Scrambling data from several previous dates on the WSJ stock website

More articles: