C# (CSharp) Lucene.Net.Analysis.Miscellaneous WordDelimiterFilter示例

编程语言: C# (CSharp)

命名空间/包名称: Lucene.Net.Analysis.Miscellaneous

hotexamples.com的示例: 18

C# (CSharp) Lucene.Net.Analysis.Miscellaneous WordDelimiterFilter - 已找到18个示例。这些是从开源项目中提取的最受好评的Lucene.Net.Analysis.Miscellaneous.WordDelimiterFilter现实C# (CSharp)示例。您可以评价示例，以帮助我们提高示例质量。

常用方法

显示隐藏

IsSubwordDelim(3)

IsAlpha(2)

ClearAttributes(1)

IsDigit(1)

IsUpper(1)

Position(1)

position(1)

Splits words into subwords and performs optional transformations on subword groups. Words are split into subwords with the following rules:

split on intra-word delimiters (by default, all non alpha-numeric characters): "Wi-Fi" → "Wi", "Fi"
split on case transitions: "PowerShot" → "Power", "Shot"
split on letter-number transitions: "SD500" → "SD", "500"
leading and trailing intra-word delimiters on each subword are ignored: "//hello---there, 'dude'" → "hello", "there", "dude"
trailing "'s" are removed for each subword: "O'Neil's" → "O", "Neil"
- Note: this step isn't performed in a separate filter because of possible subword combinations.

The combinations parameter affects how subwords are combined:

combinations="0" causes no subword combinations: "PowerShot" → 0:"Power", 1:"Shot" (0 and 1 are the token positions)
combinations="1" means that in addition to the subwords, maximum runs of non-numeric subwords are catenated and produced at the same position of the last subword in the run:
- "PowerShot" → 0:"Power", 1:"Shot" 1:"PowerShot"
- "A's+B's&C's" -gt; 0:"A", 1:"B", 2:"C", 2:"ABC"
- "Super-Duper-XL500-42-AutoCoder!" → 0:"Super", 1:"Duper", 2:"XL", 2:"SuperDuperXL", 3:"500" 4:"42", 5:"Auto", 6:"Coder", 6:"AutoCoder"

One use for WordDelimiterFilter is to help match words with different subword delimiters. For example, if the source text contained "wi-fi" one may want "wifi" "WiFi" "wi-fi" "wi+fi" queries to all match. One way of doing so is to specify combinations="1" in the analyzer used for indexing, and combinations="0" (the default) in the analyzer used for querying. Given that the current StandardTokenizer immediately removes many intra-word delimiters, it is recommended that this filter be used after a tokenizer that does not do this (such as WhitespaceTokenizer).

Inheritance: TokenFilter

WordDelimiterFilter Class Documentation

示例#1

显示文件

文件： WordDelimiterIterator.cs 项目： zhangbo27/lucenenet

        // ================================================= Helper Methods ================================================

        /// <summary>
        /// Determines whether the transition from lastType to type indicates a break
        /// </summary>
        /// <param name="lastType"> Last subword type </param>
        /// <param name="type"> Current subword type </param>
        /// <returns> <c>true</c> if the transition indicates a break, <c>false</c> otherwise </returns>
        private bool IsBreak(int lastType, int type)
        {
            if ((type & lastType) != 0)
            {
                return(false);
            }

            if (!splitOnCaseChange && WordDelimiterFilter.IsAlpha(lastType) && WordDelimiterFilter.IsAlpha(type))
            {
                // ALPHA->ALPHA: always ignore if case isn't considered.
                return(false);
            }
            else if (WordDelimiterFilter.IsUpper(lastType) && WordDelimiterFilter.IsAlpha(type))
            {
                // UPPER->letter: Don't split
                return(false);
            }
            else if (!splitOnNumerics && ((WordDelimiterFilter.IsAlpha(lastType) && WordDelimiterFilter.IsDigit(type)) || (WordDelimiterFilter.IsDigit(lastType) && WordDelimiterFilter.IsAlpha(type))))
            {
                // ALPHA->NUMERIC, NUMERIC->ALPHA :Don't split
                return(false);
            }

            return(true);
        }

示例#2

显示文件

文件： TestWordDelimiterFilter.cs 项目： wwb/lucenenet

        public virtual void doSplit(string input, params string[] output)
        {
            int flags = WordDelimiterFilter.GENERATE_WORD_PARTS | WordDelimiterFilter.GENERATE_NUMBER_PARTS | WordDelimiterFilter.SPLIT_ON_CASE_CHANGE | WordDelimiterFilter.SPLIT_ON_NUMERICS | WordDelimiterFilter.STEM_ENGLISH_POSSESSIVE;
            WordDelimiterFilter wdf = new WordDelimiterFilter(TEST_VERSION_CURRENT, new MockTokenizer(new StringReader(input), MockTokenizer.KEYWORD, false), WordDelimiterIterator.DEFAULT_WORD_DELIM_TABLE, flags, null);

            AssertTokenStreamContents(wdf, output);
        }

示例#3

显示文件

文件： TestWordDelimiterFilter.cs 项目： ChristopherHaws/lucenenet

        public virtual void TestOffsetChange2()
        {
            int flags = WordDelimiterFilter.GENERATE_WORD_PARTS | WordDelimiterFilter.GENERATE_NUMBER_PARTS | WordDelimiterFilter.CATENATE_ALL | WordDelimiterFilter.SPLIT_ON_CASE_CHANGE | WordDelimiterFilter.SPLIT_ON_NUMERICS | WordDelimiterFilter.STEM_ENGLISH_POSSESSIVE;
            WordDelimiterFilter wdf = new WordDelimiterFilter(TEST_VERSION_CURRENT, new SingleTokenTokenStream(new Token("(übelkeit", 7, 17)), WordDelimiterIterator.DEFAULT_WORD_DELIM_TABLE, flags, null);

            AssertTokenStreamContents(wdf, new string[] { "übelkeit" }, new int[] { 8 }, new int[] { 17 });
        }

示例#4

显示文件

文件： TestWordDelimiterFilter.cs 项目： wwb/lucenenet

        public virtual void TestOffsetChange4()
        {
            int flags = WordDelimiterFilter.GENERATE_WORD_PARTS | WordDelimiterFilter.GENERATE_NUMBER_PARTS | WordDelimiterFilter.CATENATE_ALL | WordDelimiterFilter.SPLIT_ON_CASE_CHANGE | WordDelimiterFilter.SPLIT_ON_NUMERICS | WordDelimiterFilter.STEM_ENGLISH_POSSESSIVE;
            WordDelimiterFilter wdf = new WordDelimiterFilter(TEST_VERSION_CURRENT, new SingleTokenTokenStream(new Token("(foo,bar)", 7, 16)), WordDelimiterIterator.DEFAULT_WORD_DELIM_TABLE, flags, null);

            AssertTokenStreamContents(wdf, new string[] { "foo", "foobar", "bar" }, new int[] { 8, 8, 12 }, new int[] { 11, 15, 15 });
        }

示例#5

显示文件

文件： WordDelimiterIterator.cs 项目： zhangbo27/lucenenet

 /// <summary>
 /// Determines if the text at the given position indicates an English possessive which should be removed
 /// </summary>
 /// <param name="pos"> Position in the text to check if it indicates an English possessive </param>
 /// <returns> <c>true</c> if the text at the position indicates an English posessive, <c>false</c> otherwise </returns>
 private bool EndsWithPossessive(int pos)
 {
     return(stemEnglishPossessive &&
            pos > 2 &&
            text[pos - 2] == '\'' &&
            (text[pos - 1] == 's' || text[pos - 1] == 'S') &&
            WordDelimiterFilter.IsAlpha(CharType(text[pos - 3])) &&
            (pos == endBounds || WordDelimiterFilter.IsSubwordDelim(CharType(text[pos]))));
 }

示例#6

显示文件

文件： TestWordDelimiterFilter.cs 项目： wwb/lucenenet

        public virtual void doSplitPossessive(int stemPossessive, string input, params string[] output)
        {
            int flags = WordDelimiterFilter.GENERATE_WORD_PARTS | WordDelimiterFilter.GENERATE_NUMBER_PARTS | WordDelimiterFilter.SPLIT_ON_CASE_CHANGE | WordDelimiterFilter.SPLIT_ON_NUMERICS;

            flags |= (stemPossessive == 1) ? WordDelimiterFilter.STEM_ENGLISH_POSSESSIVE : 0;
            WordDelimiterFilter wdf = new WordDelimiterFilter(TEST_VERSION_CURRENT, new MockTokenizer(new StringReader(input), MockTokenizer.KEYWORD, false), flags, null);

            AssertTokenStreamContents(wdf, output);
        }

示例#7

显示文件

文件： TestWordDelimiterFilter.cs 项目： zhuthree/lucenenet

        public virtual void TestOffsetChange2()
        {
            WordDelimiterFlags flags = WordDelimiterFlags.GENERATE_WORD_PARTS
                                       | WordDelimiterFlags.GENERATE_NUMBER_PARTS
                                       | WordDelimiterFlags.CATENATE_ALL
                                       | WordDelimiterFlags.SPLIT_ON_CASE_CHANGE
                                       | WordDelimiterFlags.SPLIT_ON_NUMERICS
                                       | WordDelimiterFlags.STEM_ENGLISH_POSSESSIVE;
            WordDelimiterFilter wdf = new WordDelimiterFilter(TEST_VERSION_CURRENT, new SingleTokenTokenStream(new Token("(übelkeit", 7, 17)), WordDelimiterIterator.DEFAULT_WORD_DELIM_TABLE, flags, null);

            AssertTokenStreamContents(wdf, new string[] { "übelkeit" }, new int[] { 8 }, new int[] { 17 });
        }

示例#8

显示文件

文件： TestWordDelimiterFilter.cs 项目： wwb/lucenenet

        public virtual void TestOffsets()
        {
            int flags = WordDelimiterFilter.GENERATE_WORD_PARTS | WordDelimiterFilter.GENERATE_NUMBER_PARTS | WordDelimiterFilter.CATENATE_ALL | WordDelimiterFilter.SPLIT_ON_CASE_CHANGE | WordDelimiterFilter.SPLIT_ON_NUMERICS | WordDelimiterFilter.STEM_ENGLISH_POSSESSIVE;
            // test that subwords and catenated subwords have
            // the correct offsets.
            WordDelimiterFilter wdf = new WordDelimiterFilter(TEST_VERSION_CURRENT, new SingleTokenTokenStream(new Token("foo-bar", 5, 12)), WordDelimiterIterator.DEFAULT_WORD_DELIM_TABLE, flags, null);

            AssertTokenStreamContents(wdf, new string[] { "foo", "foobar", "bar" }, new int[] { 5, 5, 9 }, new int[] { 8, 12, 12 });

            wdf = new WordDelimiterFilter(TEST_VERSION_CURRENT, new SingleTokenTokenStream(new Token("foo-bar", 5, 6)), WordDelimiterIterator.DEFAULT_WORD_DELIM_TABLE, flags, null);

            AssertTokenStreamContents(wdf, new string[] { "foo", "bar", "foobar" }, new int[] { 5, 5, 5 }, new int[] { 6, 6, 6 });
        }

示例#9

显示文件

文件： TestWordDelimiterFilter.cs 项目： ChristopherHaws/lucenenet

        public virtual void TestOffsets()
        {
            int flags = WordDelimiterFilter.GENERATE_WORD_PARTS | WordDelimiterFilter.GENERATE_NUMBER_PARTS | WordDelimiterFilter.CATENATE_ALL | WordDelimiterFilter.SPLIT_ON_CASE_CHANGE | WordDelimiterFilter.SPLIT_ON_NUMERICS | WordDelimiterFilter.STEM_ENGLISH_POSSESSIVE;
            // test that subwords and catenated subwords have
            // the correct offsets.
            WordDelimiterFilter wdf = new WordDelimiterFilter(TEST_VERSION_CURRENT, new SingleTokenTokenStream(new Token("foo-bar", 5, 12)), WordDelimiterIterator.DEFAULT_WORD_DELIM_TABLE, flags, null);

            AssertTokenStreamContents(wdf, new string[] { "foo", "foobar", "bar" }, new int[] { 5, 5, 9 }, new int[] { 8, 12, 12 });

            wdf = new WordDelimiterFilter(TEST_VERSION_CURRENT, new SingleTokenTokenStream(new Token("foo-bar", 5, 6)), WordDelimiterIterator.DEFAULT_WORD_DELIM_TABLE, flags, null);

            AssertTokenStreamContents(wdf, new string[] { "foo", "bar", "foobar" }, new int[] { 5, 5, 5 }, new int[] { 6, 6, 6 });
        }

示例#10

显示文件

文件： WordDelimiterIterator.cs 项目： zhangbo27/lucenenet

        /// <summary>
        /// Set the internal word bounds (remove leading and trailing delimiters). Note, if a possessive is found, don't remove
        /// it yet, simply note it.
        /// </summary>
        private void SetBounds()
        {
            while (startBounds < length && (WordDelimiterFilter.IsSubwordDelim(CharType(text[startBounds]))))
            {
                startBounds++;
            }

            while (endBounds > startBounds && (WordDelimiterFilter.IsSubwordDelim(CharType(text[endBounds - 1]))))
            {
                endBounds--;
            }
            if (EndsWithPossessive(endBounds))
            {
                hasFinalPossessive = true;
            }
            current = startBounds;
        }

示例#11

显示文件

文件： WordDelimiterIterator.cs 项目： zhangbo27/lucenenet

        /// <summary>
        /// Advance to the next subword in the string.
        /// </summary>
        /// <returns> index of the next subword, or <see cref="DONE"/> if all subwords have been returned </returns>
        internal int Next()
        {
            current = end;
            if (current == DONE)
            {
                return(DONE);
            }

            if (skipPossessive)
            {
                current       += 2;
                skipPossessive = false;
            }

            int lastType = 0;

            while (current < endBounds && (WordDelimiterFilter.IsSubwordDelim(lastType = CharType(text[current]))))
            {
                current++;
            }

            if (current >= endBounds)
            {
                return(end = DONE);
            }

            for (end = current + 1; end < endBounds; end++)
            {
                int type = CharType(text[end]);
                if (IsBreak(lastType, type))
                {
                    break;
                }
                lastType = type;
            }

            if (end < endBounds - 1 && EndsWithPossessive(end + 2))
            {
                skipPossessive = true;
            }

            return(end);
        }

示例#12

显示文件

文件： TestWordDelimiterFilter.cs 项目： ChristopherHaws/lucenenet

        public virtual void doSplitPossessive(int stemPossessive, string input, params string[] output)
        {
            int flags = WordDelimiterFilter.GENERATE_WORD_PARTS | WordDelimiterFilter.GENERATE_NUMBER_PARTS | WordDelimiterFilter.SPLIT_ON_CASE_CHANGE | WordDelimiterFilter.SPLIT_ON_NUMERICS;
            flags |= (stemPossessive == 1) ? WordDelimiterFilter.STEM_ENGLISH_POSSESSIVE : 0;
            WordDelimiterFilter wdf = new WordDelimiterFilter(TEST_VERSION_CURRENT, new MockTokenizer(new StringReader(input), MockTokenizer.KEYWORD, false), flags, null);

            AssertTokenStreamContents(wdf, output);
        }

示例#13

显示文件

文件： TestWordDelimiterFilter.cs 项目： ChristopherHaws/lucenenet

        public virtual void doSplit(string input, params string[] output)
        {
            int flags = WordDelimiterFilter.GENERATE_WORD_PARTS | WordDelimiterFilter.GENERATE_NUMBER_PARTS | WordDelimiterFilter.SPLIT_ON_CASE_CHANGE | WordDelimiterFilter.SPLIT_ON_NUMERICS | WordDelimiterFilter.STEM_ENGLISH_POSSESSIVE;
            WordDelimiterFilter wdf = new WordDelimiterFilter(TEST_VERSION_CURRENT, new MockTokenizer(new StringReader(input), MockTokenizer.KEYWORD, false), WordDelimiterIterator.DEFAULT_WORD_DELIM_TABLE, flags, null);

            AssertTokenStreamContents(wdf, output);
        }

示例#14

显示文件

文件： WordDelimiterFilter.cs 项目： ChristopherHaws/lucenenet

 public WordDelimiterConcatenation(WordDelimiterFilter outerInstance)
 {
     this.outerInstance = outerInstance;
 }

示例#15

显示文件

文件： WordDelimiterFilter.cs 项目： ChristopherHaws/lucenenet

 public OffsetSorter(WordDelimiterFilter outerInstance)
 {
     this.outerInstance = outerInstance;
 }

示例#16

显示文件

 public OffsetSorter(WordDelimiterFilter outerInstance)
 {
     this.outerInstance = outerInstance;
 }

示例#17

显示文件

 public WordDelimiterConcatenation(WordDelimiterFilter outerInstance)
 {
     this.outerInstance = outerInstance;
 }

示例#18

显示文件

文件： TestBugInSomething.cs 项目： ChristopherHaws/lucenenet

 public override TokenStreamComponents CreateComponents(string fieldName, TextReader reader)
 {
     Tokenizer tokenizer = new WikipediaTokenizer(reader);
     TokenStream stream = new SopTokenFilter(tokenizer);
     stream = new WordDelimiterFilter(TEST_VERSION_CURRENT, stream, table, -50, protWords);
     stream = new SopTokenFilter(stream);
     return new TokenStreamComponents(tokenizer, stream);
 }