Segmenting and tagging with vidyut.cheda

Warning

This module is incomplete and may be deleted in a future release. We recommend using the Dharmamitra analyzer instead if possible.

vidyut.cheda segments Sanskrit expressions into words then annotates those words with their morphological data. Our segmenter is optimized for real-time and interactive usage: it is fast, low-memory, and capably handles pathological input.

The main class here is Chedaka, which defines a segmenter. The main return type is Token, which contains the segmented text with its associated Pada data.

Example usage:

from vidyut.cheda import Chedaka

chedaka = Chedaka("/path/to/vidyut-data")

for token in chedaka.run('gacCati'):
    print(token.text, token.data)