Node.js scraping with chrome remote interface

I am trying to clean up a site secured by Distil Networks where using selenium (with Python) will always fail.

I did some searches and I came to the conclusion that the site can detect that you are using Selenium using some kind of javascript. Then I took the loot in chrome-remote-interface

as if that's what I want, but then I got stuck.

What I would like to do is automate the following steps:

  • Open a Chrome instance
  • Go to the page
  • Run some javascript
  • Collecting data and saving to file
  • Repeat steps 2 - 4

I know that I can open a Chrome instance for debugging:

google-chrome --remote-debugging-port=9222

      

And I can open a console on node with:

chrome-remote-interface -t 127.0.0.1 -p 9222 inspect -r

      

I can also run simple scripts like

Page.navigate({url:"https://google.com"})
Runtime.evaluate({expression:"1+1"})

      

But, how can I not get the DOM directly on Node.js, like what I could do on the Chrome Developer Tools console. Basically, I want to run scripts on node, like what I could do in the Chrome Developer Tools console.

In addition, there is chrome-remote-interface

not enough documentation for scrambling. Are there any good links for this?

+3


source to share


1 answer


The JavaScript expressions evaluated Runtime.evaluate

are executed in the context of the page, just like what happens in the DevTools console.

You can interact with the DOM using DOM

, for example DOM.getDocument

, DOM.querySelector

etc.

Also remember that chrome-remote-interface

- it's basically a library meaning it allows you to create your own Node.js applications chrome-remote-interface inspect

- it's just a utility.

There are several places where you can get help:



If you ask something more specific, I would be happy to help you with that.

Finally, you can take a look at automated-chrome-profiling

which I think is structurally similar to what you are trying to achieve.

+1


source







All Articles