Simple text file indexing library
PM> Install-Package Photosphere.SearchEngine
The library implements means for text file indexing by words.
Features:
- build search index by files and directories;
- allow to add/remove to/from index files and directories without search locks;
- allow to search files for whole word or prefix;
- monitor files and directories for changes and operative actualize index in accordion with these changes.
With library shipped small desktop app that allows you to use all library opportunity in demo cases.
The library main component is SearchEngine
class, which get next functionality:
- method for adding path to index;
- method for removing path from index;
- method for search set of word/prefix entries in files;
- method for search files by word or prefix;
- set of events that raises for begining/ending file indexing/removing/updating or for file path changed.
Search index is invert index, i.e. map 'word' — 'set of entries in files that contains this word'
. As main data structure uses compressed prefix tree (PATRICIA trie). This ensures the proportionality of expected search time to length of search query.
There built direct index for fast removing of file from index by map 'file' — 'set of words from this file'
. This is necessary i.e. while file removing from file system, the file removes from index post-factum (by event). That means we doesn't have list of keys, that need to delete from index. Direct index provides such list. This is allow to avoid full index scan.
File registers into search system file by versions. A file version is complex contains with file path, last write date and creation date.
Content of added pathes monitored for file and directories creations, removings and renamings. Changes of content of indexed file leads to the file reindexing. Adding or removing a file leads to adding or removing it to/from index. File and directories renaming affects only on file pathes.
Removing of file from index is only marking non actual files versions as dead. Real index clean up will produces in background.
- Best suited for implementations of search string with autocomplete over regular documents base.
- Best suited for not large files.
- Indexing unit is file.
- Files count doesn't matter.
Contains three projects:
Photosphere.SearchEngine
Photosphere.SearchEngine.IntegrationTests
Photosphere.SearchEngine.DemoApp
Based on .NET Framework 4.7.
NuGet packages:
System.Runtime.CompilerServices.Unsafe
— needed forNonBlocking.ConcurrentDictionary
;UDE.CSharp
— port of Mozilla Universal Charset Detector on .NET: tool for file encoding recognition.
Vendored code, that not convenient to use as NuGet-packages:
- https://github.com/VSadov/NonBlocking — lock-free implementation of
ConcurrentDictionary
; - https://github.com/khalidsalomao/SimpleHelpers.Net — convenient wrapper for
UDE.CSharp
; - https://github.com/Microsoft/vscode-filewatcher-windows — mechanism for
FileSystemWatcher
events consolidation from Visual Studio Code (uses partially).
Main object is SearchEngine
class instance, that provided all needed functionality.
var searchEngine = SearchEngineFactory.New();
or
var settings = new SearchEngineSettings();
var searchEngine = SearchEngineFactory.New(settings);
Settings object SearchEngineSettings
has followings options:
SupportedFilesExtensions
— set of file extensions in lowercase; by default:txt
,log
,cs
,js
,fs
,css
,sql
;FileParsers
— set of parsers, that you can implement for yourself;GcCollect
— flag for management force garbage collection after index cleaning;CleaUpIntervalMs
— double value, determines index cleaning interval in milliseconds.
var isAdded = searchEngine.Add(pathToFolderOrFile);
var isRemoved = searchEngine.Remove(pathToFolderOrFile);
searchEngine.FileIndexingStarted += args => Console.WriteLine($"File {args.Path} indexing is started");
searchEngine.FileIndexingEnded += args => Console.WriteLine($"File {args.Path} indexing is ended");
searchEngine.FileRemovingStarted += args => Console.WriteLine($"File {args.Path} removing is started");
searchEngine.FileRemovingEnded += args => Console.WriteLine($"File {args.Path} removing is ended");
searchEngine.FileUpdateInitiated += args => Console.WriteLine($"File {args.Path} update is started");
searchEngine.FilePathChanged += args => Console.WriteLine($"File {args.Path} path is changed");
searchEngine.PathWatchingStarted += args => Console.WriteLine($"Path {args.Path} added to watcher");
searchEngine.PathWatchingEnded += args => Console.WriteLine($"Path {args.Path} removed from watcher");
searchEngine.FileUpdateFailed += args => Console.WriteLine($"Update of {args.Path} failed: {args.Error.Message}");
searchEngine.IndexCleanUpFailed += args => Console.WriteLine($"Index clean up failed: {args.Error.Message}");
searchEngine.Search("foo"); // returns all entries starts with prefix "foo"
searchEngine.Search("foo", wholeWord: true); // returns all entries of word "foo"
searchEngine.SearchFiles("foo"); // returns all files starts with prefix "foo"
searchEngine.SearchFiles("foo", wholeWord: true); // returns all files of word "foo"
For example:
public class CsFileParser : IFileParser
{
private static readonly string[] FilesExts = { "cs" };
public IEnumerable<string> FileExtensions => FilesExts;
public IEnumerable<ParsedWord> Parse(IFileVersion fileVersion, Encoding encoding = null)
{
// parse your file here; fileVersion.Path contains path to actual file
}
}
and pass it to settings object
var settings = new SearchEngineSettings
{
FileParsers = new [] {new CsFileParser()}
}
var searchEngine = SearchEngineFactory.New(settings);
While cs
files parsing will be apply CsFileParser
, instead standart parser.
Simplest two-panel window. Panel with catalog tree on the left. Search index with options and result form on the right. Search options:
- Only files — search files or prefixes;
- Whole word — by word or prefix.