@coya/web-scraper

Web scraper on top of PhantomJS or Chromium

Stats

StarsIssuesVersionUpdatedCreatedSize
@coya/web-scraper
000.2.24 years ago4 years agoMinified + gzip package size for @coya/web-scraper in KB

Readme

Web Scraper

Web scraper on top of PhantomJS or Chromium.
If you chose to use PhantomJS, the module is designed as a connection client/server between the PhantomJS web scraper server and a client acting like a driver and sending scraping HTTP requests to the server.
Chromium is different because it is driven directly from NodeJS.

Installation

npm install @coya/web-scraper

Build (for dev)

git clone https://github.com/Cooya/WebScraper
cd WebScraper
npm install // it will also install the development dependencies
npm install phantomjs -g // if you need PhantomJS, install it globally
npm run build
npm run example // run the example script in "examples" folder

Usage examples

The package allows to inject JS function :

const { ChromiumScraper } = require('@coya/web-scraper');

// if you want to use PhantomJS instead of Chromium
// const { PhantomScraper } = require('@coya/web-scraper');

const scraper = ChromiumScraper.getInstance();

const getLinks = function() { // return all links from the requested page
    return $('a').map(function(i, elt) {
        return $(elt).attr('href');
    }).get();
};

scraper.request({
    url: 'cooya.fr',
    fct: getLinks // function injected in the page environment
})
.then(function(result) {
    console.log(result); // returned value of the injected function
    scraper.close(); // end the client/server connection and kill the web scraper subprocess
}, function(error) {
    console.error(error);
    scraper.close();
});

Or to inject JS function from an external script :

const { ChromiumScraper } = require('@coya/web-scraper');

// if you want to use PhantomJS instead of Chromium
// const { PhantomScraper } = require('@coya/web-scraper');

const scraper = ChromiumScraper.getInstance();

scraper.request({
    url: 'cooya.fr',
    fct: __dirname + '/externalScript.js', // external script exporting the function to be injected
})
.then(function(result) {
    console.log(result); // returned value of the injected function
    scraper.close(); // end the client/server connection and kill the web scraper subprocess
}, function(error) {
    console.error(error);
    scraper.close();
});

externalScript.js :

module.exports = function() { // return all links from the requested page
    return $('a').map(function(i, elt) {
        return $(elt).attr('href');
    }).get();
};

Methods

ScraperClient.getInstance()

The ScraperClient object is a singleton, only one client can be created, so this method is required to get the client instance.

request(params)

Send a request to a specific url and inject JavaScript into the page associated. Return a promise with the result in parameter.

Parameter Type Description Default value
params object see below for details about this none

close()

Terminate the PhantomJS web scraper process that will allow to end the current NodeJS script properly.

Request parameters spec

Parameter Type Description Required
url string target url yes
fct function JS function to inject into the page yes
fct string path to script path and function to inject separated by hash key (e.g. "path/to/script/script.js#functionToCall") yes
referer string referer header parameter set in each request optional
args object object passed to the injected function optional
debug boolean enable the debug mode (verbose) optional

If you find any bugs or have a feature request, please open an issue on github!

The npm package download data comes from npm's download counts api and package details come from npms.io.