# Text-tokenizer

**Repository Path**: mirrors_Orange-OpenSource/Text-tokenizer

## Basic Information

- **Project Name**: Text-tokenizer
- **Description**: c++ library to tokenize text (UTF-8 encoding) in typed tokens. This is a basic fonctionality for almost all Natural Language Processing (NLP) approaches. The library has a simple API and is initialised with a rule file defining the token types and the regular expressions to match
- **Primary Language**: Unknown
- **License**: BSD-3-Clause
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2020-08-18
- **Last Updated**: 2026-01-17

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# Text Segmenter Library

Library to segment raw text (UTF-8) into typed segments using a set of regular expression.
An example of segmentation rule is in example/seg.data

## License

This library is under the [3-Clause BSD Licence](https://opensource.org/licenses/BSD-3-Clause). See [LICENSE.md](LICENSE.md)


## Author

Johannes Heinecke

## Requirements

needs `boost_regex` to run and `boost_regex_dev`, `curl` and  `python` to compile


## Build

the build process will download http://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt, which is needed to define the nature of unicode characters.

    mkdir build
    cd build
    cmake -DCMAKE_BUILD_TYPE=Release [-DCMAKE_INSTALL_PREFIX=/usr/local]  ..
(you can also use `Debug`, especially for `test_valgrind`)

    make [-j 4]
    make mini_test
    make valgrind_test
	
## Testing

    build/example/textSegmenter [options] example/seg.data example/text.txt
      options
	 --out json|col|long|short   define output format
	 --select                    output only segments which are not marked 'ignore'

## Install library
    cd build
    make install

## API

create an instance of the segmenter:

    Segmenter *seg = new Segmenter(string data);

read a text from a UTF8 string into a variable of the type wstring

    string inputtext; // input text in UTF-8
    Text result; // instance which contains the segmented text
    boolean output_ignored_segments = false; // if true, do not ouput segment types which are marked as "ignored" in the rules file

    wchar_t *unicode = Unicode::fromUTF8(inputtext.c_str());

segment the text

    seg->segment(wstring(unicode), result, boolean output_ignored_segments);


output the result (several possibilities)

    result.json(cout); // output the result in json to stdout
    result.longOutput(cerr); // output the result in a verbose format to stderr

    // output the result into a file (simple tsv format)
    ofstream outfile("outputfile.txt");
    result.columns(outfile);
    outfile.close();


See [example/textSegmenter.cc](example/textSegmenter.cc) for more information. 
Do not forget to add `-I/usr/local/include/segmenter` and `-llibsegment ` to your c++ compiler