A document processing pipeline mostly used with search-index
DocProc is a pumpify chain of
transform streams that turns Plain Old JSON Objects into a format that
can be indexed by
Each processed document must have the following fields:
id- document id
vector- vector, used for ranking
stored- the document that will be cached
raw- the unadulterated document
normalised- the "cleaned up" document.
tokenised- the tokenised document.
id: 'one'text: 'the first doc'
id: 'one'normalised: id: 'one' text: 'the first doc'raw: id: 'one' text: 'the first doc'stored: id: 'one' text: 'the first doc'tokenised: id: 'one' text: 'the' 'first' 'doc'vector:id: one: 1 '*': 1text: doc: 1 first: 1 the: 1 '*': 1'*': one: 1 '*': 1 doc: 1 first: 1 the: 1
...after being passeds through docProc.
You can also compose document processing pipelines by reusing the stages provided, or by creating new ones using the node.js transform stream specification:
A function that returns a writable stream that contains a sensible default document processing pipeline
A function that takes in an Array of pipeline stages where every stage is a transform stream and returns a writable stream.
A transform stream that calculates term frequency.
A transform stream that calculates the composite vector- used for searching accross all fields.
A transform stream that creates sort vectors.
A transform stream that defines the parts of each document that are to be cached in the index itself.
A transform stream that determines which fields can be searched on individually. In order to make indexes smaller, you can index fields that can be searched on.
A transform stream that takes an unprocessed document and converts it
into a structure that can be processed by
A transform stream that converts text to lower case.
A transform stream that converts non-string fields into Strings.
A transform stream that removes stopwords
A transform stream that will do nothing other than print out the state
of the document to
console.log. Use this when developing and
A transform stream that splits fields down into their individual linguistic tokens