Obfuscation by clearing multiple urls using Python, but the data doesn't change when I repeat the site page numbers?

I've been messing around with web scraping with queries and beautifulsoup and I'm getting some odd results when I try to loop through multiple pages of message board data, adding 1 page number for each loop.

Below is an example where I view page 1 on the bulletin board and then on page 2. Just to test myself, I type the url I click and then the first entry found on that page, the urls look correct but the first entry is the same for both. But if I copy and paste these two urls, I definitely see different content on the page.

Can anyone tell me if this is a problem for my code, or has something to do with the way the data is structured in this forum that gives me these results? Thanks in advance!

from bs4 import BeautifulSoup

import requests

n_pages = 2
base_link = 'http://tigerboard.com/boards/list.php?board=4&page='

for i in range (1,n_pages+1):
    link = base_link+str(i)
    html_doc = requests.get(link)
    soup = BeautifulSoup(html_doc.text,"lxml")
    bs_tags = soup.find_all("div",{"class":"msgline"})
    posts=[]
    for post in bs_tags:
        posts.append(post.text)
    print link
    print posts[0]

>     http://tigerboard.com/boards/list.php?board=4&page=1
>     52% of all websites are in English, but  - catbirdseat MU - 3/23/17 14:41:06
>     http://tigerboard.com/boards/list.php?board=4&page=2
>     52% of all websites are in English, but  - catbirdseat MU - 3/23/17 14:41:06

      

+3


source to share


1 answer


The website implementation is bogus. For some reason, this requires a specific cookie PHPSESSID

or it won't return another page other than the first page, regardless of the setting page

.

Setting this cookie fixes the problem:

from bs4 import BeautifulSoup

import requests

n_pages = 2
base_link = 'http://tigerboard.com/boards/list.php?board=4&page='

for i in range (1,n_pages+1):
    link = base_link+str(i)
    html_doc = requests.get(link, headers={'Cookie': 'PHPSESSID=notimportant'})
    soup = BeautifulSoup(html_doc.text,"lxml")
    bs_tags = soup.find_all("div",{"class":"msgline"})
    posts=[]
    for post in bs_tags:
        posts.append(post.text)
    print link
    print posts[0]

      



Another solution would be to use session , because the first request (of the first page) will set the cookie to a real one and it will be sent in subsequent requests.

It was fun to debug!

+3


source







All Articles