Skip to content

pms-search/FullTextSearch

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

46 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PMS Full-Text Search Engine for .NET Core

License: MIT Travis Status

Full-Text Search Engine with no external dependencies written in C# for .NET Core.

The aim of this project is to showcase algorithms, data structures and techniques that are used to create full-text search engines.

Getting Started

On Windows:

  1. Download and build code. Use the following commands:

    dotnet restore
    dotnet build
  2. Open folder with binaries: bin\Debug\netcoreapp2.0

  3. Start the following command. Replace DATA_PATH with a path to Datasets folder

    run_test.bat DATA_PATH
  4. If everything goes well the following messages are printed:

    Log from index construction:

    dotnet Protsyk.PMS.FullText.ConsoleUtil.dll index "F:\Sources\FullTextSearch\Datasets"
    
    PMS Full-Text Search (c) Petro Protsyk 2017-2018
    F:\Sources\FullTextSearch\Datasets\Simple\TestFile001.txt
    F:\Sources\FullTextSearch\Datasets\Simple\TestFile002.txt
    F:\Sources\FullTextSearch\Datasets\Simple\TestFile003.txt
    Indexed documents: 3, time: 00:00:00.1010004

    Dump of the index (for each term in the dictionary - the list of all occurrences):

    dotnet Protsyk.PMS.FullText.ConsoleUtil.dll print
    
    PMS Full-Text Search (c) Petro Protsyk 2017-2018
    2017 -> [1,1,9]
    algorithms -> [1,1,19]
    and -> [1,1,20]
    apple -> [3,1,1]
    banana -> [3,1,2]
    build -> [1,1,25]
    c -> [1,1,16]
    data -> [1,1,21]
    demonstrate -> [1,1,18]
    ...

    Search with query WORD(pms):

    dotnet Protsyk.PMS.FullText.ConsoleUtil.dll search "WORD(pms)"
    
    {filename:"TestFile001.txt", size:"180", created:"2018-04-02T10:09:41.4208444+02:00"}
    {[1,1,1]}
    
    {filename:"TestFile002.txt", size:"29", created:"2018-04-02T10:09:41.4248447+02:00"}
    {[2,1,1]}
    
    Documents found: 2, matches: 2, time: 00:00:00.0564721

    Lookup in the dictionary using a pattern i.e. all terms matching pattern:

    dotnet Protsyk.PMS.FullText.ConsoleUtil.dll lookup pet*
    petro-mariya-sophie
    Terms found: 1, time: 00:00:00.0704173
    
    dotnet Protsyk.PMS.FullText.ConsoleUtil.dll lookup projct~1
    project
    Terms found: 1, time: 00:00:00.0847931

Query Language

  • WORD(apple) - single word
  • WILD(app*) - wildcard pattern
  • EDIT(apple, 1) - Levenshtein (edit distance, fuzzy search)

Conjunction operators

  • OR - boolean or
  • AND - boolean and
  • SEQ - sequence of words, phrase

Examples of queries:

  • AND(WORD(apple), OR(WILD(a*), EDIT(apple, 1)))
  • SEQ(WORD(hello), WORD(world))

Data Structures

  • Dictionary of the persistent index is implemented using: Ternary Search Tree.
  • Key-value storage for document metadata is based on persistent B-Tree implementation: B-Tree.

Algorithms

References

Links

Releases

No releases published

Packages

No packages published

Languages

  • C# 99.8%
  • Batchfile 0.2%