Skip to content

RunDevelopment/RegexRetrieval

Repository files navigation

RegexRetrieval

This is a fun little C# project providing a CLI and library for fast searches on fixed word list using a regex-like query syntax. The resulting list of words is guaranteed to be in the original order.

The algorithms performing the search are called Retrievers. There are two retrievers currently implemented:

  1. Array retriever:
    The most simple retriever possible. Given a query, it will generate a regular expression and match it against every word.
  2. Regex retriever:
    This one is a bit smarter. It will try to narrow down the search range by analyzing the query and using Matchers to make a pre-selection, filtering out words which can't match the query.

Regex retriever

This section will describe the inner workings of the Regex Retriever, so get ready for some details!

The Regex retriever employs a number of Matchers to narrow the search range, and a word index to optimize queries which can only match a small number of words.

A Matcher is an algorithm which given some criteria will return a list of words which match the criteria. The returned list is guaranteed to...

  1. ...be a subset of the fixed word list on which to perform search operations.
  2. ...be in the order of the fixed word list.
  3. ...contain all elements matching the criteria.

Such a list is called a selection.

Note: A selection might still contain some elements of the fixed word list which do not match the criteria. If all elements in the selection match the criteria, the selection is called minimal.

Note: Selections are not usually collections of strings but collections of integers where each integer is the position of a word in the fixed word list. This integer representation has a number of advantages but the main reason is the memory-efficiency and the simplicity of implementing set operations on such collections.

Word index

The word index is not a matcher as it does not return a selection. Instead, it is a simple hash map mapping each word of the fixed word list to its position. It is used to quickly check all possible words of queries which can only match a small finite number of words.

Given such a query, all possible words are generated and individually checked using the words index.

Example: The query [gs]et can only match get and set.

Length matcher

This simple matcher will return a minimal selection based on word length.

All length intervals of the form [a, b] with 0 <= a <= b are supported.

If a query only contains placeholders, the result of this matcher will be the final result.

Example: The query ?e? can only match words of length 3 while ??* matches all words with length >= 2.

Substring matcher (SSM)

Given strings which must be substrings of a word to match the query, it will return a selection.

If a given string contains a character which no word contains, the empty selection will be returned.

Example: All words which match the query *ll* must have ll as a substring.

Positional substring matcher (PSSM)

Given a list of positional substring (PSS) which must be PSSs of a word to match the query, it will return a selection.

The basic idea is that we know the position or a range of positions for each substring, i.e. the substring abc of the query abc* has to be at the start of every word matching the query. This drastically reduces the number of words which have to be tested compared to the SSM.

A PSS is a substring which also has a range in which the substring has to occur associated with it. A PSS is a tuple (S, L, R) where S is the substring, and L and R are intervals defining the number of characters which have to be left and right to S.
S is PSS of a word w iff there exists a position i in w for which w.SubString(i, S.Length) == S and i∈L and (w.Length - S.Length - i)∈R.
Example: ("abc", [0,2], [1,1]) is PSS of "abcd", "xabcx", and "xxabcx" but not "abc" or "abcxx".

Because there are a lot of possible PSSs of one substring, the PSSM only looks at PSS for which L or R are fixed, meaning that there is only one number which is in the interval. To simplify even more, there are two variants of the PSSM: A left to right (LTR) variant which only handles PSSs with fixed L and a right to left (RTL) variant which only handles PSSs with fixed R.

The LTR PSSM is implemented as a list of substring tries where the list index (zero-based) of a substring trie is this position of the substring. The value of all nodes of each substring trie is the selection of words which have the string encoded by the path of the nodes as a substring at the position given by list index.
Given a PSS with fixed L, the PSSM will choose the substring trie for which the index is in L. This trie is then used to get the most specific selection given the PSS.

Note: A RTL PSSM is implemented as an LTR PSSM which is constructed on the list of words for which the characters of every word are in reversed order. The substrings of the PSSs given the RTL PSSM also have to be reversed.

Example: All words which match the query awe*some must start with awe and end with some. More complex: All words which match the query ?e??*e? must have an e at index 1 and an e as the second last letter.

CLI

This project also comes with a small command line interface to debug and test the library. Build and run the RegexRetriever.CLI project to open the CLI.

Every input is interpreted as a command or query. Enter an empty input to exit the program.

For debugging purposes, the program will automatically load ($LOAD) a word list (a text file where each line is one word) and open the input for the retriever ($CREATE).

Commands

All commands start with a dollar sign $ followed by the command name (case sensitive) and a list of arguments separated by spaces.

$LOAD <path>

This command loads the word list with the given path.

Because a new retriever has to be constructed after loading a new word list, the $CREATE command will be automatically executed after the word list is loaded.

Example: $LOAD C:\path\to\file.txt

$CREATE [ <retriever> ]

This command is used to create a retriever.

If no argument is provided, another input will be opened prompting the user to input the retriever to create.

Retrievers may support additional creation options in the form of arguments. This is documented when executing this command with arguments.

Example: $CREATE array

$GC

This command takes no arguments.

Upon execution, it will call .Net's garbage collection to compact the heap as much as possible. It will print the consumed memory when finished.

Example: $GC

$TEST [ <query 1> [ <query 2> [ <query 3> [ ... ] ] ] ]

This command takes any number of queries and executes them, outputting the results as a markdown table along with execution time and other properties of the queries.

If no arguments are provided, the standard test cases will be executed. These queries are defined in TestCases.cs.

Example: $TEST abc* *??ly colo[u]r

About

A fast retrieval algorithm with a regex-like query syntax on a fixed word list

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages