    0.1.0 • Public

    Word Boundaries

    The task is simple: take a String as input, and return an Array of every word boundary.

    This implements the Unicode 8.0 Text Segmentation Algorithm. That makes it valid for English and European languages; but it's terrible for Chinese, Japanese, and other languages that do not have any characters between words.


    You may need to install a prerequisite: apt-get install libicu-dev or dnf install libicu-devel. (Node itself depends on ICU; you just need the development headers.)

    (On Mac: try brew install icu4c && brew link icu4c --force)

    Add it to your project: npm install --save node-word-boundaries

    Then use it:

    const word_boundaries = require('word-boundaries')
    const text = 'See Jack run.'
    // f
    const boundaries = word_boundaries.find_word_boundaries(text)
    console.log(boundaries) // 0, 3, 4, 8, 9, 12, 13
    const parts = word_boundaries.split(text)
    console.log(parts) // 'See', ' ', 'Jack', ' ', 'run', '.'

    Boundary indices are pretty standard in C-like languages. As a refresher: they point to the spaces between characters in a String. Visually:

     S e e   J a c k   r u n .
    ^ - - ^ ^ - - - ^ ^ - - ^ ^
    0   2   4   6   8  10  12
      1   3   5   7   9  11  13


    The input must be a valid Unicode. In particular, a string like \uDC00\uD800 is invalid (it's a low surrogate followed by a high surrogate); that will cause undefined behavior. (This constraint is true of most programs that deal with Strings.)


    • node-icu-tokenizer: returns tokens, not boundaries. Also, returns a much larger data structure.
    • node-icu-wordsplit: returns tokens, not boundaries. Also, takes a Locale argument, even though tr29 is locale-independent. Though Unicode is locale-independent, ICU isn't.
    • overview-js-tokenizer: returns tokens. This project is a fork.


    Download and npm install.

    Run mocha -w in the background as you implement features. Write tests in test/.


    Pull requests are welcome! In particular, this library could use:


    AGPL-3.0. This project is (c) Overview Services Inc. and Adam Hooper. Please contact both should you desire a more permissive license.


