Passing the login screen with Python mechanics

EDIT (10/30): The solution is found at the bottom of this post.

Hello to all,

I am new to "web scraping" scripting and am trying to scrape data from GISIS pages with Python. Although I originally tried to do it with a requests

, D8Amonk post on SO led me to mechanize

, which worked pretty well for the most part.

I managed to get around the initial 403 errors I was getting by adding the headers found on kumar but now face the problem of not being able to navigate past the login screen for GISIS to its current, relevant web pages.

Julian Todd's excellent post on ScraperWiki helped me a lot in understanding how to disable annoying view controls and work with the _doPostBack () page mechanism. Unfortunately, the login page still ignores mechanization attempts when the form is finished submitting - it doesn't recognize that a password, username, and password were entered.

Below are my code snippets:

import os
import sys
import webbrowser
import mechanize
import urllib2
import cookielib
from bs4 import BeautifulSoup

header = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
       'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
       'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
       'Accept-Encoding': 'none',
       'Accept-Language': 'en-US,en;q=0.8',
       'Connection': 'keep-alive'}
request = urllib2.Request('https://gisis.imo.org/Public/SHIPS/Default.aspx', None, header)

...

jar = cookielib.CookieJar()
browser = mechanize.Browser()
browser.set_cookiejar(jar)
browser.set_handle_robots(False)

browser.open(request)
browser.select_form(nr=0)
browser.form.set_all_readonly(False)
browser.form['ctl00$cpMain$ddlAuthorityType'] = ['PUBLIC']
browser.form['ctl00$cpMain$txtUsername'] = username
browser.form['ctl00$cpMain$txtPassword'] = password
browser.find_control('ctl00$cpMain$cbxRemember').selected = False
browser.find_control('ctl00$cpMain$btnRegister').disabled = True
browser["__EVENTTARGET"] = "lnkNext"
browser["__EVENTARGUMENT"] = ""
resp = browser.submit()
print '-- Request Made Successfully --'
return resp.read()

      

resp.read()

it is then written to an .HTML file and opened in Firefox. Commenting out and terminating the lines browser.form[...]

led to an interesting discovery: if Authority is enabled in the form submission (in this case "Publish"), then the web page will recognize the Authority, but complains that the username and password should.

However, if the authorization line is commented out, then the generated web page will recognize that the username and password have been entered, but will prompt for the choice of Authority (in this case, the username field will be filled in correctly, but the password field will be blank, I'm not sure if this is desired or intended behavior). Likewise, as long as the authority line is still commented out, I can comment out the username or password line in my code, and as a result, the web page will ask for Authority and no matter what the other field has been commented out (i.e. If i just send the password, then the page will ask you for credentials and username).

Does anyone have any suggestions on what I might be doing wrong, or where else to look? This seems like a rather unusual problem - a Google search did not give any similar problems that other people have experienced.

PS This is my first post on StackOverflow. I tried attaching images to explain the scenarios I described, but apparently they are missing the turnips needed to post images. I apologize profusely if I was too verbose or did something wrong i.e. formatting my message - please correct me !!

EDIT (10/30): Came back to this project after moving on to other things and figured out a solution. Solution below:

This was actually not as difficult to fix as I would have thought. Modification __EVENTTARGET

and __EVENTARGUMENT

was not required. Instead, it was necessary to change the tags __VIEWSTATE

and __VIEWSTATEGENERATOR

. The correct values ​​to use were found by examining successful POST requests made in Firebug . Sample code looks like this:

browser.form['__VIEWSTATE'] = 'blablabla'
browser.form['__VIEWSTATEGENERATOR'] = 'blablabla'

      

Modifying both values ​​successfully allows me to enter the main page. Hope this helps someone!

+3


source to share


1 answer


Thanks for the tip to use Firebug (or Chrome's built-in development tools) to validate the content of the request and see which form fields are actually being sent back to the server. I needed to add an extra field {'SubmitLogin':'Sign In'}

to authenticate my server.



0


source







All Articles