npm promotes metadefinitions
Learn about our RFC process, Open RFC meetings & more.Join in the discussion! »

happynodetokenizer

2.1.3 • Public • Published

HappyNodeTokenizer

A basic Twitter aware tokenizer.

A Javascript port of HappyFunTokenizer.py by Christopher Potts and HappierFunTokenizing.py by H. Andrew Schwartz.

Install

  npm install --save happynodetokenizer

Usage

HappyNodeTokenizer exposes an asynchronous function called tokenize() and a synchronous function called tokenizeSync(). tokenizeSync() can also take a callback function as its third argument. The en-GB spelling can be used as well (i.e. tokenise() and tokeniseSync()).

Async/Await

const tokenizer = require('happynodetokenizer');
const text = 'A big long string of text...';
const opts = {
  'logs': 2,
  'mode': 'stanford',
  'normalize': false,
  'preserveCase': false,
  'strict': false,
  'tag': false,
};
const getTokens = async (text) => {
  const tokens = await tokenizer.tokenize(text, opts);
  console.log(tokens);
};

If no tokens are found and opts.strict = false, happynodetokenizer will return an empty array.

Async .then().catch()

const tokenizer = require('happynodetokenizer');
const text = 'A big long string of text...';
const opts = {
  'logs': 2,
  'mode': 'stanford',
  'normalize': false,
  'preserveCase': false,
  'strict': false,
  'tag': false,
};
tokenizer.tokenize(text, opts)
  .then((tokens) => {
    console.log(tokens);
  })
  .catch((err) => {
    throw new Error(err);
  });

Callback

const tokenizer = require('happynodetokenizer');
const text = 'A big long string of text...';
const opts = {
  'logs': 2,
  'mode': 'stanford',
  'normalize': false,
  'preserveCase': false,
  'strict': false,
  'tag': false,
};
tokenizer.tokenizeSync(text, opts, (err, tokens) => {
  if (err) throw new Error(err);
  console.log(tokens);
});

Sync

const tokenizer = require('happynodetokenizer');
const text = 'A big long string of text...';
const opts = {
  'logs': 2,
  'mode': 'stanford',
  'normalize': false,
  'preserveCase': false,
  'strict': false,
  'tag': false,
};
const tokens = tokenizer.tokenizeSync(text, opts);
console.log(tokens);

Output Examples

Default (opts.tag = false)

Input = "RT @ #happyfuncoding: this is a typical Twitter tweet :-)"

['rt', '@', '#happyfuncoding', ':', 'this', 'is', 'a', 'typical', 'twitter', 'tweet', ':-)']

opts.tag = true

Input = "RT @ #happyfuncoding: this is a typical Twitter tweet :-)"

[
  { value: 'rt', tag: 'word' },
  { value: '@', tag: 'punct' },
  { value: '#happyfuncoding', tag: 'hashtag' },
  { value: ':', tag: 'punct' },
  { value: 'this', tag: 'word' },
  { value: 'is', tag: 'word' },
  { value: 'a', tag: 'word' },
  { value: 'typical', tag: 'word' },
  { value: 'twitter', tag: 'word' },
  { value: 'tweet', tag: 'word' },
  { value: ':-)', tag: 'emoticon' }
]

See tags below for more detail.

The Options Object

The options object and its properties are optional. The defaults are:

{
  'logs': 2,
  'mode': 'stanford',
  'normalize': false,
  'preserveCase': false,
  'strict': false,
  'tag': false,
};

logs

Number - valid options: 0, 1, 2 (default), 3

Used to control console.log, console.warn, and console.error outputs.

  • 0 = suppress all logs
  • 1 = print errors only
  • 2 = print errors and warnings
  • 3 = print all console logs

mode

string - valid options: stanford (default), or dlatk

stanford mode uses the original HappyFunTokenizer pattern. See Github.

dlatk mode uses the modified HappierFunTokenizing pattern. See Github.

normalize

boolean - valid options: true, or false (default)

Normalize strings. E.g. when set to true, mañana becomes manana.

preserveCase

boolean - valid options: true, or false (default)

Preserves the case of the input string. Does not affect emoticons.

strict

boolean - valid options: true, or false (default)

When strict is set to true, functions throw errors over very minor things. Good for debugging.

  • false = functions fail gracefully
  • true = functions throw lots of errors

tag

boolean - valid options: true, or false (default)

Return an array of tagged token objects instead of just an array of tokens

  • true = return an array of token objects: [ {value: 'token', tag: 'word' }, ... ]
  • false = return an array of tokens: [ 'token', 'another', 'word', ... ]

Tags

When opts.tag === true, HappyNodeTokenizer will output an array of token objects. Each token object has two properties: value and tag. The value is the token itself, the tag is a descriptor based on one of the following depending on which opt.mode you are using:

Tag Stanford DLATK Example
phone ✔️ ✔️ +1 (800) 123-4567
url ✔️ http://www.youtube.com
url_scheme ✔️ http://
url_authority ✔️ [0-3]
url_path_query ✔️ /index.html?s=search
htmltag ✔️ <em class='grumpy'>
emoticon ✔️ ✔️ >:(
username ✔️ ✔️ @phughesmcr
hashtag ✔️ ✔️ #tokenizing
punct ✔️ ✔️ ,
word ✔️ ✔️ hello
<UNK> ✔️ ✔️ (anthing left unmatched)

Testing

To compare the results of HappyNodeTokenizer against HappyFunTokenizer and HappierFunTokenizing, run:

npm run test

The goal of this project is to provide a Node.js port of HappyFunTokenizer and HappierFunTokenizing. Therefore, any pull requests with test failures will not be accepted.

License

(C) 2017-20 P. Hughes. All rights reserved.

Shared under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported license.

Install

npm i happynodetokenizer

DownloadsWeekly Downloads

13

Version

2.1.3

License

CC-BY-NC-SA-3.0

Unpacked Size

52.4 kB

Total Files

10

Last publish

Collaborators

  • avatar