Skip to content

aligusnet/InformationRetrieval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Information Retrieval

Build Status

Definitions

List of definitions of key types and concepts.

Corpus

The project defines the way how the collection of documents is organized in the corpus and blocks.

  • Corpus - collection of text documents organized into blocks;
  • Block - subset of corpus, small enough to fit processing in memory;
  • Document - piece of text with metadata, the most important metadata is DocumentId;
  • DocumentId - unique identifier of the document.

InformationRetrieval

The projects defines a number of types to process text documents organized in corpus.

  • Tranformer - converts a corpus of documents, preserving the structure of the corpus, but changing the presentation: texts parsing/cleaning/tokenization etc.
  • Indexer - builds an index from a corpus.
  • Token is a tuple of term, document id and term's position in the document.
  • BuildableIndex is a type used to build an index out of list of tokens, created SearchableIndex_.
  • SearchableIndex supports search for a term in the corpus.
  • Boolean Search Engine - performs text serching in the corpus using the index. Supports AND/OR/NOT operators.

Wikidump

A set of types to build a corpus from a Wikipedia's dump.

Releases

No releases published

Packages

No packages published

Languages