Obfuscation by clearing multiple urls using Python, but the data doesn't change when I repeat the site page numbers?

Question

Obfuscation by clearing multiple urls using Python, but the data doesn't change when I repeat the site page numbers?

I've been messing around with web scraping with queries and beautifulsoup and I'm getting some odd results when I try to loop through multiple pages of message board data, adding 1 page number for each loop.

Below is an example where I view page 1 on the bulletin board and then on page 2. Just to test myself, I type the url I click and then the first entry found on that page, the urls look correct but the first entry is the same for both. But if I copy and paste these two urls, I definitely see different content on the page.

Can anyone tell me if this is a problem for my code, or has something to do with the way the data is structured in this forum that gives me these results? Thanks in advance!

from bs4 import BeautifulSoup

import requests

n_pages = 2
base_link = 'http://tigerboard.com/boards/list.php?board=4&page='

for i in range (1,n_pages+1):
    link = base_link+str(i)
    html_doc = requests.get(link)
    soup = BeautifulSoup(html_doc.text,"lxml")
    bs_tags = soup.find_all("div",{"class":"msgline"})
    posts=[]
    for post in bs_tags:
        posts.append(post.text)
    print link
    print posts[0]

>     http://tigerboard.com/boards/list.php?board=4&page=1
>     52% of all websites are in English, but  - catbirdseat MU - 3/23/17 14:41:06
>     http://tigerboard.com/boards/list.php?board=4&page=2
>     52% of all websites are in English, but  - catbirdseat MU - 3/23/17 14:41:06

+3

python python-requests beautifulsoup

mizzou541 23 Mar 17 at 19:56

source to share

1 answer

Antoine bolvy · Accepted Answer · 2017-03-23T21:18:35+0000

The website implementation is bogus. For some reason, this requires a specific cookie PHPSESSID

or it won't return another page other than the first page, regardless of the setting page

.

Setting this cookie fixes the problem:

from bs4 import BeautifulSoup

import requests

n_pages = 2
base_link = 'http://tigerboard.com/boards/list.php?board=4&page='

for i in range (1,n_pages+1):
    link = base_link+str(i)
    html_doc = requests.get(link, headers={'Cookie': 'PHPSESSID=notimportant'})
    soup = BeautifulSoup(html_doc.text,"lxml")
    bs_tags = soup.find_all("div",{"class":"msgline"})
    posts=[]
    for post in bs_tags:
        posts.append(post.text)
    print link
    print posts[0]

Another solution would be to use session , because the first request (of the first page) will set the cookie to a real one and it will be sent in subsequent requests.

It was fun to debug!

Obfuscation by clearing multiple urls using Python, but the data doesn't change when I repeat the site page numbers?

More articles: