article-parser

Extract main article, main image and meta data from URL

README

article-parser


Extract main article, main image and meta data from URL.
NPM
CI test Coverage Status
CodeQL JavaScript Style Guide


Intro


article-parser is a part of tool sets for content builder:

- feed-reader: extract & normalize RSS/ATOM/JSON feed
- article-parser: extract main article from given URL
- oembed-parser: extract oEmbed data from supported providers

You can use one or combination of these tools to build news sites, create automated content systems for marketing campaign or gather dataset for NLP projects...

  1. ```
  2.                                     
  3.                              article-parser
  4.                                              
  5.                           
  6. feed-reader feed entries                      content database public APIs
  7.                           
  8.                                              
  9.                              oembed-parser  
  10.                                     
  11. ```

Demo




Install & Usage


Node.js


  1. ``` sh
  2. npm i article-parser

  3. # pnpm
  4. pnpm i article-parser

  5. # yarn
  6. yarn add article-parser
  7. ```

  1. ```ts
  2. // es6 module
  3. import { extract } from 'article-parser'

  4. // CommonJS
  5. const { extract } = require('article-parser')

  6. // or specify exactly path to CommonJS variant
  7. const { extract } = require('article-parser/dist/cjs/article-parser.js')
  8. ```

Deno


  1. ```ts
  2. import { extract } from 'https://esm.sh/article-parser'
  3. ```

Browser


  1. ```ts
  2. import { extract } from 'https://unpkg.com/article-parser@latest/dist/article-parser.esm.js'
  3. ```

Please check the examples for reference.


Deta cloud


For Deta devs please refer the source code and guideline here or simply click the button below.
Deploy


APIs


  - [transformation object](#transformation-object)
- [sanitize-html's options](#sanitize-htmls-options)


extract()


Load and extract article data. Return a Promise object.

Syntax


  1. ```ts
  2. extract(String input)
  3. extract(String input, Object parserOptions)
  4. extract(String input, Object parserOptions, Object fetchOptions)
  5. ```

Parameters


input required

URL string links to the article or HTML content of that web page.

For example:

  1. ``` js
  2. import { extract } from 'article-parser'

  3. const input = 'https://www.cnbc.com/2022/09/21/what-another-major-rate-hike-by-the-federal-reserve-means-to-you.html'
  4. extract(input)
  5.   .then(article => console.log(article))
  6.   .catch(err => console.error(err))
  7. ```

The result - article - can be null or an object with the following structure:

  1. ```ts
  2. {
  3.   url: String,
  4.   title: String,
  5.   description: String,
  6.   image: String,
  7.   author: String,
  8.   content: String,
  9.   published: Date String,
  10.   source: String, // original publisher
  11.   links: Array, // list of alternative links
  12.   ttr: Number, // time to read in second, 0 = unknown
  13. }
  14. ```

Click here for seeing an actual result.


parserOptions optional

Object with all or several of the following properties:

  - wordsPerMinute: Number, to estimate time to read. Default 300.
  - descriptionTruncateLen: Number, max num of chars generated for description. Default 210.
  - descriptionLengthThreshold: Number, min num of chars required for description. Default 180.
  - contentLengthThreshold: Number, min num of chars required for content. Default 200.

For example:

  1. ``` js
  2. import { extract } from 'article-parser'

  3. extract('https://www.cnbc.com/2022/09/21/what-another-major-rate-hike-by-the-federal-reserve-means-to-you.html', {
  4.   descriptionLengthThreshold: 120,
  5.   contentLengthThreshold: 500
  6. })
  7. ```

fetchOptions optional

You can use this param to set request headers to fetch.

For example:

  1. ``` js
  2. import { extract } from 'article-parser'

  3. const url = 'https://www.cnbc.com/2022/09/21/what-another-major-rate-hike-by-the-federal-reserve-means-to-you.html'
  4. extract(url, null, {
  5.   headers: {
  6.     'user-agent': 'Opera/9.60 (Windows NT 6.0; U; en) Presto/2.1.1'
  7.   }
  8. })
  9. ```

You can also specify a proxy endpoint to load remote content, instead of fetching directly.

For example:

  1. ``` js
  2. import { extract } from 'article-parser'

  3. const url = 'https://www.cnbc.com/2022/09/21/what-another-major-rate-hike-by-the-federal-reserve-means-to-you.html'

  4. extract(url, null, {
  5.   headers: {
  6.     'user-agent': 'Opera/9.60 (Windows NT 6.0; U; en) Presto/2.1.1'
  7.   },
  8.   proxy: {
  9.     target: 'https://your-secret-proxy.io/loadXml?url=',
  10.     headers: {
  11.       'Proxy-Authorization': 'Bearer YWxhZGRpbjpvcGVuc2VzYW1l...'
  12.     }
  13.   }
  14. })
  15. ```

Passing requests to proxy is useful while running article-parser on browser. View examples/browser-article-parser as reference example.

For more info about proxy authentication, please refer HTTP authentication

For a deeper customization, you can consider using Proxy to replacefetch behaviors with your own handlers.


Transformations


Sometimes the default extraction algorithm may not work well. That is the time when we need transformations.

By adding some functions before and after the main extraction step, we aim to come up with a better result as much as possible.

There are 2 methods to play with transformations:

- addTransformations(Object transformation | Array transformations)
- removeTransformations(Array patterns)

At first, let's talk about transformation object.

transformation object


In article-parser, transformation is an object with the following properties:

- patterns: required, a list of regexps to match the URLs
- pre: optional, a function to process raw HTML
- post: optional, a function to process extracted article

Basically, the meaning of transformation can be interpreted like this:

with the urls which match these patterns <br>

let's run pre function to normalize HTML content <br>

then extract main article content with normalized HTML, and if success <br>

let's run post function to normalize extracted article content


article-parser extraction process

Here is an example transformation:

  1. ``` js
  2. {
  3.   patterns: [
  4.     /([\w]+.)?domain.tld\/*/,
  5.     /domain.tld\/articles\/*/
  6.   ],
  7.   pre: (document) => {
  8.     // remove all .advertise-area and its siblings from raw HTML content
  9.     document.querySelectorAll('.advertise-area').forEach((element) => {
  10.       if (element.nodeName === 'DIV') {
  11.         while (element.nextSibling) {
  12.           element.parentNode.removeChild(element.nextSibling)
  13.         }
  14.         element.parentNode.removeChild(element)
  15.       }
  16.     })
  17.     return document
  18.   },
  19.   post: (document) => {
  20.     // with extracted article, replace all h4 tags with h2
  21.     document.querySelectorAll('h4').forEach((element) => {
  22.       const h2Element = document.createElement('h2')
  23.       h2Element.innerHTML = element.innerHTML
  24.       element.parentNode.replaceChild(h2Element, element)
  25.     })
  26.     // change small sized images to original version
  27.     document.querySelectorAll('img').forEach((element) => {
  28.       const src = element.getAttribute('src')
  29.       if (src.includes('domain.tld/pics/150x120/')) {
  30.         const fullSrc = src.replace('/pics/150x120/', '/pics/original/')
  31.         element.setAttribute('src', fullSrc)
  32.       }
  33.     })
  34.     return document
  35.   }
  36. }
  37. ```

- To write better transformation logic, please refer linkedom and Document Object.

addTransformations(Object transformation | Array transformations)


Add a single transformation or a list of transformations. For example:

  1. ``` js
  2. import { addTransformations } from 'article-parser'

  3. addTransformations({
  4.   patterns: [
  5.     /([\w]+.)?abc.tld\/*/
  6.   ],
  7.   pre: (document) => {
  8.     // do something with document
  9.     return document
  10.   },
  11.   post: (document) => {
  12.     // do something with document
  13.     return document
  14.   }
  15. })

  16. addTransformations([
  17.   {
  18.     patterns: [
  19.       /([\w]+.)?def.tld\/*/
  20.     ],
  21.     pre: (document) => {
  22.       // do something with document
  23.       return document
  24.     },
  25.     post: (document) => {
  26.       // do something with document
  27.       return document
  28.     }
  29.   },
  30.   {
  31.     patterns: [
  32.       /([\w]+.)?xyz.tld\/*/
  33.     ],
  34.     pre: (document) => {
  35.       // do something with document
  36.       return document
  37.     },
  38.     post: (document) => {
  39.       // do something with document
  40.       return document
  41.     }
  42.   }
  43. ])
  44. ````

The transformations without patterns will be ignored.

removeTransformations(Array patterns)


To remove transformations that match the specific patterns.

For example, we can remove all added transformations above:

  1. ``` js
  2. import { removeTransformations } from 'article-parser'

  3. removeTransformations([
  4.   /([\w]+.)?abc.tld\/*/,
  5.   /([\w]+.)?def.tld\/*/,
  6.   /([\w]+.)?xyz.tld\/*/
  7. ])
  8. ```

Calling removeTransformations() without parameter will remove all current transformations.

Priority order


While processing an article, more than one transformation can be applied.

Suppose that we have the following transformations:

  1. ```ts
  2. [
  3.   {
  4.     patterns: [
  5.       /http(s?):\/\/google.com\/*/,
  6.       /http(s?):\/\/goo.gl\/*/
  7.     ],
  8.     pre: function_one,
  9.     post: function_two
  10.   },
  11.   {
  12.     patterns: [
  13.       /http(s?):\/\/goo.gl\/*/,
  14.       /http(s?):\/\/google.inc\/*/
  15.     ],
  16.     pre: function_three,
  17.     post: function_four
  18.   }
  19. ]
  20. ```

As you can see, an article from goo.gl certainly matches both them.

In this scenario, article-parser will execute both transformations, one by one:

function_one -> function_three -> extraction -> function_two -> function_four


sanitize-html's options


article-parser uses sanitize-html to make a clean sweep of HTML content.

Here is the default options

Depending on the needs of your content system, you might want to gather some HTML tags/attributes, while ignoring others.

There are 2 methods to access and modify these options in article-parser.

- getSanitizeHtmlOptions()
- setSanitizeHtmlOptions(Object sanitizeHtmlOptions)

Read sanitize-html docs for more info.


Quick evaluation


  1. ``` sh
  2. git clone https://github.com/ndaidong/article-parser.git
  3. cd article-parser
  4. pnpm i

  5. npm run eval {URL_TO_PARSE_ARTICLE}
  6. ```

License

The MIT License (MIT)