Cleaning up web pages with PhantomJS

Is there a way to execute all JavaScripts on a web page exactly like the browser without specifying which function to execute? In most of the examples I've seen, they seem to indicate which piece of JavaScript you want to execute from the cleaned web page. I need to clear all content and execute all java scripts like a browser and get the final code that we can see with google validation?

I'm sure there must be some way, but the example code from PhantomJS doesn't seem to have an example addressing this.

+3


source to share


2 answers


You will not specify what will be done from the page using PhantomJS. You open a page with PhantomJS, and all JavaScript that runs in Chrome or Firefox is also executed in PhantomJS. It is a complete browser without a head.

There are some differences. Clicking on the download link will not download. The rendering engine that PhantomJS 1.x is based on is almost 4 years old, so some pages just render differently because PhantomJS 1.x may not support this feature. (PhantomJS 2 is on the way and is now in unofficial "alpha" status)



So, you need to script every interaction the user makes on the page using JavaScript or CoffeeScript. You are not calling page functions. You are manipulating DOM elements to simulate the user interacting with the page in the browser. This needs to be done so crudely because the PhantomJS API does not provide high-level custom functionality. If you want that, you should have a look at CasperJS , which is built on top of PhantomJS / SlimerJS.

There you have features such as click

, wait

, fetchText

, etc.

+2


source


This will work, put this in a file called "scrape.js" and execute it with phantomjs. Pass your url as first arg



// Usage: phantomjs scrape.js http://your.url.to.scrape.com
"use strict";
var sys = require("system"),
    page = require("webpage").create(),
    logResources = false,
    url = sys.args[1]

//console.log('fetch from', url);

function printArgs() {
    var i, ilen;
    for (i = 0, ilen = arguments.length; i < ilen; ++i) {
        console.log("    arguments[" + i + "] = " + JSON.stringify(arguments[i]));
    }
    console.log("");
}



////////////////////////////////////////////////////////////////////////////////


page.onLoadFinished = function() {
   page.evaluate(function() {
		     console.log(document.body.innerHTML);
     });
};
// window.console.log(msg);
page.onConsoleMessage = function() {
    printArgs.apply(this, arguments);
    phantom.exit(0);
};



////////////////////////////////////////////////////////////////////////////////

page.open(url);
      

Run codeHide result


0


source







All Articles