Automatic Linguistic Text Processing
NLP technology
Linguistic investigations and applications is part of UIS RUSSIA project. Complex of linguistic processors are accomplished to provide for automatic text processing.
The procedures include:
- processing of electronic text in several main formats (ASCII, HTML, MS Word) in Windows and operating as DLL;
- morphological analysis of Russian/English texts;
- terms' recognition/sense disambiguation;
- thematic analysis -event categorization, indexing, annotation/summarization;
- download of results to an Oracle database server.
The main instrument is a Socio-Political Thesaurus (Thesaurus). Its current version incorporates 70,000 concepts/descriptors with synonyms, including 6,500 geographic names. The tool assists in identifying main and subordinate topics in a document.
The technology provides for up to 500 Mb of electronic texts to be processed and integrated into the University Information System RUSSIA daily.
The ALTP results are exploited to provide for advanced search tools.
Search engine
The UIS RUSSIA provides for innovative search instruments. On top of traditional tools, value-added content-based instruments are available, including:
- several classification schemes (systems of subject headings, SSH) including UIS RUSSIA SSH and
Congressional Research, Service Library of Congress Legislative Indexing Vocabulary Top Terms;
- Socio-Political Thesaurus with Thesaurus-based query refinement;
- RF GRNTI/Gosudarstvennu rubricator nauchno-tekhnicheskoi informatsi - SSH for academic publications in economics and social sciences;
- JEL /Journal of Economic Literature classification system-based categorization software.
Special informers are implemented to assist in query refinement.