The task is simple: take a String as input, and return an Array of every word boundary.
This implements the Unicode 8.0 Text Segmentation Algorithm. That makes it valid for English and European languages; but it's terrible for Chinese, Japanese, and other languages that do not have any characters between words.
You may need to install a prerequisite:
apt-get install libicu-dev or
dnf install libicu-devel. (Node itself depends on ICU; you just need the
(On Mac: try
brew install icu4c && brew link icu4c --force)
Add it to your project:
npm install --save node-word-boundaries
Then use it:
const word_boundaries = require('word-boundaries') const text = 'See Jack run.' // f const boundaries = word_boundaries.find_word_boundaries(text) console.log(boundaries) // 0, 3, 4, 8, 9, 12, 13 const parts = word_boundaries.split(text) console.log(parts) // 'See', ' ', 'Jack', ' ', 'run', '.'
Boundary indices are pretty standard in C-like languages. As a refresher: they point to the spaces between characters in a String. Visually:
S e e J a c k r u n . ^ - - ^ ^ - - - ^ ^ - - ^ ^ 0 2 4 6 8 10 12 1 3 5 7 9 11 13
The input must be a valid Unicode. In particular, a string like
is invalid (it's a low surrogate followed by a high surrogate); that will cause
undefined behavior. (This constraint is true of most programs that deal with
- node-icu-tokenizer: returns tokens, not boundaries. Also, returns a much larger data structure.
- node-icu-wordsplit: returns tokens, not boundaries. Also, takes a Locale argument, even though tr29 is locale-independent. Though Unicode is locale-independent, ICU isn't.
- overview-js-tokenizer: returns tokens. This project is a fork.
mocha -w in the background as you implement features. Write tests in
Pull requests are welcome! In particular, this library could use:
- More unit tests
- Options, especially those suggested at http://www.unicode.org/reports/tr29
AGPL-3.0. This project is (c) Overview Services Inc. and Adam Hooper. Please contact both should you desire a more permissive license.