A simple web scraper for node.js using promises and css selectors.
How it works
All you need is define an json object with:
- url: the web page url to be scraped (mandatory)
- forEach: the html element(s) where the scraper should search (optional). If no specified, html will be used.
- get: a object specifying what data we want to get back from each forEach element.
Both forEach element and get values must be css selectors.
Briefly: for each forEach element found in the web DOM, web-scraper will return an object with the same structure as the get param containing the data in the corresponding property. Let's see some examples.
How to use it
Since web-scraper returns a Promise, given a correct search object (see examples below) you can choose between:
const scraper = ;var mySearch = ...;;
Cleaner option, but remember that you can only use await inside an async function:
const scraper = ;var mySearch = ...;asynctrylet data = await ;//do whatever you need with **data**catcherr//Some error ocurred. Handle it!!;
In this example, we are fetching some data from The Matrix page at IMDB
1. Get the film name, available in the web title
Our search params will be:
var mySearch =url : ''get:filmTitle: 'div#ratingWidget p strong';
this will return:
Notice that the result will be an array only if the search result contains more than one element. This way you will need to check wether the result is a single element or a set of them. We'll see it in some following examples, but let's continue with The Matrix.
2. Get the film name with a different search params.
For every search, we can find several ways to get the desired data. In this case, we can get the film title with this configuration:
var mySearch =url : ''get:filmTitle: 'strong'forEach:'div#ratingWidget p';
With this configuration we are requesting our scraper to search for a strong element inside every p that is children of div#ratingWidget. But in the inspected DOM, this only ocurrs one time, so again, we will get:
3. Get all the cast inside a single object.
Since there is a div identified with titleCast containing all the cast, we can get all the names with:
var mySearch =url : ''get:names: 'span[itemprop="name"]'forEach:'div#titleCast';
In this case, as there is just one element identified with 'div#titleCast', we are getting just one object containing an array of names, one for each actor/actress:
4. Get one object for each actor and actress, containing the character's name too.
What if we wanted to get a set of objects, containg each one the name of the actor/actress and the name of the played character?
var mySearch =url : ''get:name: 'span[itemprop="name"]'character: 'td.character'forEach:'div#titleCast table tbody tr.even, div#titleCast table tbody tr.odd';
Will return us:
5. Gathering links
Links ("a" elements) can be considered quite special because they contain a couple of elements we may want to store: the href attribute and the anchor text. That's why when we ask our scraper to gather a elements, it'll return us both pieces of data. An example where we are interested in a set of links:
var mySearch =url: ''get:linkToPerson: 'td[itemprop="actor"] a'forEach: 'div#titleCast table tbody tr';
Will return us:
6. Grouping data
Sometimes we may need to group the desired data. Let's see an example: in this case we are gathering players from a random NBA game
We could make a request to our scraper with this simple params:
var mySearch =url: ''get:playerName: 'th[csk]'points:'td[data-stat="pts"]'forEach: 'table#box_atl_basic tbody tr, table#box_cha_basic tbody tr';
This is what we get:
Yes, this way we'll get all the players with their respective points, but: what team did they play for?
To get than info grouped by team, as each roster has it own table, we just need to pass an array instead a comma separated list of elements. In this example, note the transformation at 'forEach' field:
var mySearch =url: ''get:playerName: 'th[csk]'points:'td[data-stat="pts"]'forEach: 'table#box_atl_basic tbody tr' 'table#box_cha_basic tbody tr';
Now, our scraper will search separetely all the players games and points inside each of the elements (in this case, tables) for those players, so at the end we'll get separated stats by teams, where each array position contains each team players and stats:
Check out the test folder to find more examples.
You can run the tests executing
from console (after you set your current directory to the project root). After tests execution, you'll have access to a coverage report both at the terminal and coverage folder (that will be created automatically) which contains an html report (simply double click on index.html to access a more detailed report).
Installation is quite easy using npm:
npm i @josedonas/web-scraper
You can get some extra info about this module at web-scraper npm web page
- Jose Antonio González Doñas - LinkedIn
This project is licensed under the Apache 2.0 License - see the LICENSE.md file for details