FST-based term dict, using ord as FST output. The FST holds the mapping between <term, ord>, and term's metadata is delta encoded into a single byte block. Typically the byte block consists of four parts: 1. term statistics: docFreq, totalTermFreq; 2. monotonic long[], e.g. the pointer to the postings list for that term; 3. generic byte[], e.g. other information customized by postings base. 4. single-level skip list to speed up metadata decoding by ord.

Files:

Term Index

The .tix contains a list of FSTs, one for each field. The FST maps a term to its corresponding order in current field.

  • TermIndex(.tix) --> Header, TermFSTNumFields, Footer
  • TermFST --> FST
  • Header --> CodecUtil#writeHeader CodecHeader
  • Footer --> CodecUtil#writeFooter CodecFooter

Notes:

  • Since terms are already sorted before writing to Term Block, their ords can directly used to seek term metadata from term block.

Term Block

The .tbk contains all the statistics and metadata for terms, along with field summary (e.g. per-field data like number of documents in current field). For each field, there are four blocks:

  • statistics bytes block: contains term statistics;
  • metadata longs block: delta-encodes monotonic part of metadata;
  • metadata bytes block: encodes other parts of metadata;
  • skip block: contains skip data, to speed up metadata seeking and decoding

File Format:

  • TermBlock(.tbk) --> Header, PostingsHeader, FieldSummary, DirOffset
  • FieldSummary --> NumFields, <FieldNumber, NumTerms, SumTotalTermFreq?, SumDocFreq, DocCount, LongsSize, DataBlock > NumFields, Footer
  • DataBlock --> StatsBlockLength, MetaLongsBlockLength, MetaBytesBlockLength, SkipBlock, StatsBlock, MetaLongsBlock, MetaBytesBlock
  • SkipBlock --> < StatsFPDelta, MetaLongsSkipFPDelta, MetaBytesSkipFPDelta, MetaLongsSkipDeltaLongsSize >NumTerms
  • StatsBlock --> < DocFreq[Same?], (TotalTermFreq-DocFreq) ? > NumTerms
  • MetaLongsBlock --> < LongDeltaLongsSize, BytesSize > NumTerms
  • MetaBytesBlock --> Byte MetaBytesBlockLength
  • Header --> CodecUtil#writeHeader CodecHeader
  • DirOffset --> DataOutput#writeLong Uint64
  • NumFields, FieldNumber, DocCount, DocFreq, LongsSize, FieldNumber, DocCount --> DataOutput#writeVInt VInt
  • NumTerms, SumTotalTermFreq, SumDocFreq, StatsBlockLength, MetaLongsBlockLength, MetaBytesBlockLength, StatsFPDelta, MetaLongsSkipFPDelta, MetaBytesSkipFPDelta, MetaLongsSkipStart, TotalTermFreq, LongDelta,--> DataOutput#writeVLong VLong
  • Footer --> CodecUtil#writeFooter CodecFooter

Notes:

  • The format of PostingsHeader and MetaBytes are customized by the specific postings implementation: they contain arbitrary per-file data (such as parameters or versioning information), and per-term data (non-monotonic ones like pulsed postings data).
  • During initialization the reader will load all the blocks into memory. SkipBlock will be decoded, so that during seek term dict can lookup file pointers directly. StatsFPDelta, MetaLongsSkipFPDelta, etc. are file offset for every SkipInterval's term. MetaLongsSkipDelta is the difference from previous one, which indicates the value of preceding metadata longs for every SkipInterval's term.
  • DocFreq is the count of documents which contain the term. TotalTermFreq is the total number of occurrences of the term. Usually these two values are the same for long tail terms, therefore one bit is stole from DocFreq to check this case, so that encoding of TotalTermFreq may be omitted.
@lucene.experimental
Наследование: FieldsConsumer
Пример #1
0
            internal TermsWriter(FSTOrdTermsWriter outerInstance, FieldInfo fieldInfo)
            {
                _outerInstance = outerInstance;
                _numTerms      = 0;
                _fieldInfo     = fieldInfo;
                _longsSize     = outerInstance.postingsWriter.SetField(fieldInfo);
                _outputs       = PositiveInt32Outputs.Singleton;
                _builder       = new Builder <Int64>(FST.INPUT_TYPE.BYTE1, _outputs);

                _lastBlockStatsFp     = 0;
                _lastBlockMetaLongsFp = 0;
                _lastBlockMetaBytesFp = 0;
                _lastBlockLongs       = new long[_longsSize];

                _lastLongs       = new long[_longsSize];
                _lastMetaBytesFp = 0;
            }
Пример #2
0
        public override FieldsConsumer FieldsConsumer(SegmentWriteState state)
        {
            PostingsWriterBase postingsWriter = new Lucene41PostingsWriter(state);

            bool success = false;
            try
            {
                FieldsConsumer ret = new FSTOrdTermsWriter(state, postingsWriter);
                success = true;
                return ret;
            }
            finally
            {
                if (!success)
                {
                    IOUtils.CloseWhileHandlingException(postingsWriter);
                }
            }
        }
Пример #3
0
        public override FieldsConsumer FieldsConsumer(SegmentWriteState state)
        {
            PostingsWriterBase postingsWriter = new Lucene41PostingsWriter(state);

            bool success = false;

            try
            {
                FieldsConsumer ret = new FSTOrdTermsWriter(state, postingsWriter);
                success = true;
                return(ret);
            }
            finally
            {
                if (!success)
                {
                    IOUtils.CloseWhileHandlingException(postingsWriter);
                }
            }
        }
        public override FieldsConsumer FieldsConsumer(SegmentWriteState state)
        {
            PostingsWriterBase docsWriter = null;
            PostingsWriterBase pulsingWriter = null;

            bool success = false;
            try
            {
                docsWriter = _wrappedPostingsBaseFormat.PostingsWriterBase(state);
                pulsingWriter = new PulsingPostingsWriter(state, _freqCutoff, docsWriter);
                FieldsConsumer ret = new FSTOrdTermsWriter(state, pulsingWriter);
                success = true;
                return ret;
            }
            finally
            {
                if (!success)
                {
                    IOUtils.CloseWhileHandlingException(docsWriter, pulsingWriter);
                }
            }
        }
        public override FieldsConsumer FieldsConsumer(SegmentWriteState state)
        {
            PostingsWriterBase docsWriter    = null;
            PostingsWriterBase pulsingWriter = null;

            bool success = false;

            try
            {
                docsWriter    = _wrappedPostingsBaseFormat.PostingsWriterBase(state);
                pulsingWriter = new PulsingPostingsWriter(state, _freqCutoff, docsWriter);
                FieldsConsumer ret = new FSTOrdTermsWriter(state, pulsingWriter);
                success = true;
                return(ret);
            }
            finally
            {
                if (!success)
                {
                    IOUtils.CloseWhileHandlingException(docsWriter, pulsingWriter);
                }
            }
        }
Пример #6
0
            internal TermsWriter(FSTOrdTermsWriter outerInstance, FieldInfo fieldInfo)
            {
                _outerInstance = outerInstance;
                _numTerms = 0;
                _fieldInfo = fieldInfo;
                _longsSize = outerInstance.postingsWriter.SetField(fieldInfo);
                _outputs = PositiveIntOutputs.Singleton;
                _builder = new Builder<long>(FST.INPUT_TYPE.BYTE1, _outputs);

                _lastBlockStatsFp = 0;
                _lastBlockMetaLongsFp = 0;
                _lastBlockMetaBytesFp = 0;
                _lastBlockLongs = new long[_longsSize];

                _lastLongs = new long[_longsSize];
                _lastMetaBytesFp = 0;
            }