BeautifulSoup - Create a Website with Login and Site Search Engine

I am trying to clear the data of the International Maritime Organization ( https://gisis.imo.org/Public/PAR/Search.aspx ) when attacking shipping vessels between dates ("is between" in the site's search engine) 2002-01-01, 2005- 12-31.

Fill in the dates and click add

I used bs4 and previously requested python modules to clear financial data from yahoo and weather data from wunderground, but this site requires a username and password (under the "public" account type). Also, since I said that the data needs a search / filter before I can access the html on the page:

As soon as I click on the line here, it expands to the image below. (Before anyone asks why I dont just load the dataset and pull from there: DL is filtered for some reason and not all columns are given out (e.g. IMO number).

enter image description here

CERTIFICATE THE DATA I TRY ON THE WAY FROM THIS PAGE and I need (item, css path):

  • incident position

    #ctl00_bodyPlaceHolder_ctl00_pnlDetail > table:nth-child(4) > tbody > tr:nth-child(1) > td:nth-child(2) > span
    
          

  • date

    #ctl00_bodyPlaceHolder_ctl00_pnlDetail > table:nth-child(4) > tbody > tr:nth-child(6) > td.content > span
    
          

  • ship name

    #ctl00_bodyPlaceHolder_ctl00_pnlDetail > table:nth-child(4) > tbody > tr:nth-child(4) > td:nth-child(2) > span
    
          

Needless to say, this seems like a daunting task. Any recommendations?

Here is the OLD code I used to clear the weather data (nothing has changed because I dont know where to start in terms of the login / filtering process: http://pythonfiddle.com/get-wx-data

0


source to share


1 answer


requests

won't be enough. You want to look at mechanize

: http://wwwsearch.sourceforge.net/mechanize/

The good thing about mechanize

is that it maintains state from page to page as opposed to requests

. (Perhaps you could do this with just help requests

, but I'm not that smart.) Here's an example of a simple login interaction.



It would be great if the IMO site were so easy. Instead, it is ASP based, which means it is relatively scratch annoying. Some of the details will vary from site to site, so I suggest two things in particular: take a look at the Network tab of your browser's developer tools and read this ScraperWiki report on working with ASP sites.

Good luck!

+1


source







All Articles