Miss any of our Open RFC calls?Watch the recordings here! »

words-n-numbers

4.0.1 • Public • Published

Words'n'numbers

Extracting arrays of words and optionally numbers and emojis / emoticons from strings. For Node.js and the browser. When you need more than just [a-z]. Part of document processing for search-index and nowsearch.xyz.

Inspired by extractwords

NPM version NPM downloads Build Status Known Vulnerabilities JavaScript Style Guide MIT License

Initiating

Node.js

const wnn = require('words-n-numbers')
// wnn available

Browser

<script src="wnn.js"></script>
 
<script>
  //wnn available
</script> 
 

Use

The default regex should catch every unicode character from for every language.

Only words

let stringOfWords = 'A 1000000 dollars baby!'
wnn.extract(stringOfWords)
// returns ['A', 'dollars', 'baby']

Only words, converted to lowercase

let stringOfWords = 'A 1000000 dollars baby!'
wnn.extract(stringOfWords, { toLowercase: true })
// returns ['a', 'dollars', 'baby']

Predefined regex for words and numbers, converted to lowercase

let stringOfWords = 'A 1000000 dollars baby!'
wnn.extract(stringOfWords, { regex: wnn.wordsNumbers, toLowercase: true })
// returns ['a', '1000000', 'dollars', 'baby']

Predefined regex for words and emoticons, converted to lowercase

let stringOfWords = 'A ticket to 大阪 costs ¥2000 👌😄 😢'
wnn.extract(stringOfWords, { regex: wnn.wordsEmojis, toLowercase: true })
// returns [ 'A', 'ticket', 'to', '大阪', 'costs', '👌😄', '😢' ]

Predefined regex for numbers and emoticons

let stringOfWords = 'A ticket to 大阪 costs ¥2000 👌😄 😢'
wnn.extract(stringOfWords, { regex: wnn.numbersEmojis, toLowercase: true })
// returns [ '2000', '👌😄', '😢' ]

Predefined regex for words, numbers and emoticons, converted to lowercase

let stringOfWords = 'A ticket to 大阪 costs ¥2000 👌😄 😢'
wnn.extract(stringOfWords, { regex: wnn.wordsNumbersEmojis, toLowercase: true })
// returns [ 'a', 'ticket', 'to', '大阪', 'costs', '2000', '👌😄', '😢' ]

Predefined regex for #tags

let stringOfWords = 'A #49ticket to #大阪 or two#tickets costs ¥2000 👌😄😄 😢'
wnn.extract(stringOfWords, { regex: wnn.tags, toLowercase: true })
// returns [ '#49ticket', '#大阪' ]

Predefined regex for @usernames

let stringOfWords = 'A #ticket to #大阪 costs bob@bob.com, @alice and @美林 ¥2000 👌😄😄 😢'
wnn.extract(stringOfWords, { regex: wnn.tags, toLowercase: true })
// returns [ '@alice123', '@美林' ]

Custom regex

let stringOfWords = 'This happens at 5 o\'clock !!!'
wnn.extract(stringOfWords, { regex: '[a-z\'0-9]+' })
// returns ['This', 'happens', 'at', '5', 'o\'clock']

API

Extract function

Returns an array of words and optionally numbers.

wnn.extract(stringOfText, \<options-object\>)

Options object

{
  regex: '[custom or predefined regex]',  // defaults to wnn.words
  toLowercase: [true / false]             // defaults to false
}

Predefined regex'es

wnn.words              // only words, any language <-- default
wnn.numbers            // only numbers
wnn.emojis             // only emojis
wnn.wordsNumbers       // words (any language) and numbers
wnn.wordsEmojis        // words (any language) and emojis
wnn.numbersEmojis      // numbers and emojis
wnn.wordsNumbersEmojis // words (any language), numbers and emojis
wnn.tags               // #tags (any language
wnn.usernames          // @usernames (any language)

Languages supported

Supports most languages supported by stopword, and others too. Some languages like Japanese and Chinese simplified needs to be tokenized. May add tokenizers at a later stage.

PR's welcome

PR's and issues are more than welcome =)

Install

npm i words-n-numbers

DownloadsWeekly Downloads

27

Version

4.0.1

License

MIT

Unpacked Size

22.3 kB

Total Files

11

Last publish

Collaborators

  • avatar