Jargon is a TypeScript/JavaScript library for tokenization and lemmatization. It finds variations on canonical terms and converts them to a single form.

For example, in tech, you might see 'node js' or 'NodeJS' or 'node.js' and want them understood as the same term. That’s lemmatization.

Quick start

npm install "@clipperhouse/jargon@latest"

Then create a file, preferably TypeScript.

// demo.ts
import jargon from '@clipperhouse/jargon';
import stackexchange from '@clipperhouse/jargon/stackexchange'; // a dictionary
const text = 'I ❤️ Ruby on Rails and vue';
const lemmas = jargon.Lemmatize(text, stackexchange);
// I ❤️ ruby-on-rails and vue.js
// demo.js
const jargon = require('@clipperhouse/jargon');
const stackexchange = require('@clipperhouse/jargon/stackexchange');
const text = 'I ❤️ Ruby on Rails and vue';
const lemmas = jargon.Lemmatize(text, stackexchange);
// I ❤️ ruby-on-rails and vue.js

What’s it doing?

jargon tokenizes the incoming text, identifying punctuation and spaces. It understands tech-ish terms as single words, such as asp.net and TCP/IP, and #hangtags and @handles (other tokenizers would see two words).

Those tokens go to the lemmatizer, with a dictionary. The lemmatizer passes over tokens, and asks the dictionary if it recognizes them. It handles multi-token phrases like 'Ruby on Rails', converting it a single ruby-on-rails token.

It is insensitive to spaces, hyphens, dots, slashes and case -- so it handles a lot of variation that would be difficult to get right with simple search-and-replace or regex.

These rules are defined in a Dictionary. In the above examples, stackexchange is the dictionary, and it knows about react vs react.js. It also understands synonyms, such as ecmascript ↔ javascript.

Another example is the contractions dictionary. It'll split tokens like it'll into two tokens it and will.


npm i @clipperhouse/jargon

