For example, in tech, you might see 'node js' or 'NodeJS' or 'node.js' and want them understood as the same term. That’s lemmatization.
npm install "@clipperhouse/jargon@latest"
Then create a file, preferably TypeScript.
// demo.ts;; // a dictionary;;console.loglemmas.toString;// I ❤️ ruby-on-rails and vue.js
// demo.jsconst jargon = ;const stackexchange = ;const text = 'I ❤️ Ruby on Rails and vue';const lemmas = jargon;console;// I ❤️ ruby-on-rails and vue.js
What’s it doing?
jargon tokenizes the incoming text, identifying punctuation and spaces. It understands tech-ish terms as single words, such as asp.net and TCP/IP, and #hangtags and @handles (other tokenizers would see two words).
Those tokens go to the lemmatizer, with a
dictionary. The lemmatizer passes over tokens, and asks the dictionary if it recognizes them. It handles multi-token phrases like 'Ruby on Rails', converting it a single
It is insensitive to spaces, hyphens, dots, slashes and case -- so it handles a lot of variation that would be difficult to get right with simple search-and-replace or regex.
These rules are defined in a Dictionary. In the above examples,
Another example is the
contractions dictionary. It'll split tokens like
it'll into two tokens