FST-based term dict, using metadata as FST output. The FST directly holds the mapping between <term, metadata>. Term metadata consists of three parts: 1. term statistics: docFreq, totalTermFreq; 2. monotonic long[], e.g. the pointer to the postings list for that term; 3. generic byte[], e.g. other information need by postings reader.

File:

Term Dictionary

The .tst contains a list of FSTs, one for each field. The FST maps a term to its corresponding statistics (e.g. docfreq) and metadata (e.g. information for postings list reader like file pointer to postings list).

Typically the metadata is separated into two parts:

  • Monotonical long array: Some metadata will always be ascending in order with the corresponding term. This part is used by FST to share outputs between arcs.
  • Generic byte array: Used to store non-monotonic metadata.

File format:
  • TermsDict(.tst) --> Header, PostingsHeader, FieldSummary, DirOffset
  • FieldSummary --> NumFields, <FieldNumber, NumTerms, SumTotalTermFreq?, SumDocFreq, DocCount, LongsSize, TermFST >NumFields
  • TermFST TermData
  • TermData --> Flag, BytesSize?, LongDeltaLongsSize?, ByteBytesSize?, < DocFreq[Same?], (TotalTermFreq-DocFreq) > ?
  • Header --> CodecUtil#writeHeader CodecHeader
  • DirOffset --> DataOutput#writeLong Uint64
  • DocFreq, LongsSize, BytesSize, NumFields, FieldNumber, DocCount --> DataOutput#writeVInt VInt
  • TotalTermFreq, NumTerms, SumTotalTermFreq, SumDocFreq, LongDelta --> DataOutput#writeVLong VLong

Notes:

  • The format of PostingsHeader and generic meta bytes are customized by the specific postings implementation: they contain arbitrary per-file data (such as parameters or versioning information), and per-term data (non-monotonic ones like pulsed postings data).
  • The format of TermData is determined by FST, typically monotonic metadata will be dense around shallow arcs, while in deeper arcs only generic bytes and term statistics exist.
  • The byte Flag is used to indicate which part of metadata exists on current arc. Specially the monotonic part is omitted when it is an array of 0s.
  • Since LongsSize is per-field fixed, it is only written once in field summary.
@lucene.experimental
Inheritance: FieldsConsumer
Exemplo n.º 1
0
 internal TermsWriter(FSTTermsWriter outerInstance, FieldInfo fieldInfo)
 {
     _outerInstance = outerInstance;
     _numTerms      = 0;
     _fieldInfo     = fieldInfo;
     _longsSize     = outerInstance._postingsWriter.SetField(fieldInfo);
     _outputs       = new FSTTermOutputs(fieldInfo, _longsSize);
     _builder       = new Builder <FSTTermOutputs.TermData>(FST.INPUT_TYPE.BYTE1, _outputs);
 }
Exemplo n.º 2
0
        public override FieldsConsumer FieldsConsumer(SegmentWriteState state)
        {
            PostingsWriterBase postingsWriter = new Lucene41PostingsWriter(state);

            bool success = false;
            try
            {
                FieldsConsumer ret = new FSTTermsWriter(state, postingsWriter);
                success = true;
                return ret;
            }
            finally
            {
                if (!success)
                {
                    IOUtils.CloseWhileHandlingException(postingsWriter);
                }
            }
        }
Exemplo n.º 3
0
        public override FieldsConsumer FieldsConsumer(SegmentWriteState state)
        {
            PostingsWriterBase postingsWriter = new Lucene41PostingsWriter(state);

            bool success = false;

            try
            {
                FieldsConsumer ret = new FSTTermsWriter(state, postingsWriter);
                success = true;
                return(ret);
            }
            finally
            {
                if (!success)
                {
                    IOUtils.DisposeWhileHandlingException(postingsWriter);
                }
            }
        }
        public override FieldsConsumer FieldsConsumer(SegmentWriteState state)
        {
            PostingsWriterBase docsWriter = null;
            PostingsWriterBase pulsingWriter = null;

            bool success = false;
            try
            {
                docsWriter = _wrappedPostingsBaseFormat.PostingsWriterBase(state);
                pulsingWriter = new PulsingPostingsWriter(state, _freqCutoff, docsWriter);
                FieldsConsumer ret = new FSTTermsWriter(state, pulsingWriter);
                success = true;
                return ret;
            }
            finally
            {
                if (!success)
                {
                    IOUtils.CloseWhileHandlingException(docsWriter, pulsingWriter);
                }
            }
        }
Exemplo n.º 5
0
        public override FieldsConsumer FieldsConsumer(SegmentWriteState state)
        {
            PostingsWriterBase docsWriter    = null;
            PostingsWriterBase pulsingWriter = null;

            bool success = false;

            try
            {
                docsWriter    = _wrappedPostingsBaseFormat.PostingsWriterBase(state);
                pulsingWriter = new PulsingPostingsWriter(state, _freqCutoff, docsWriter);
                FieldsConsumer ret = new FSTTermsWriter(state, pulsingWriter);
                success = true;
                return(ret);
            }
            finally
            {
                if (!success)
                {
                    IOUtils.DisposeWhileHandlingException(docsWriter, pulsingWriter);
                }
            }
        }
Exemplo n.º 6
0
 internal TermsWriter(FSTTermsWriter outerInstance, FieldInfo fieldInfo)
 {
     _outerInstance = outerInstance;
     _numTerms = 0;
     _fieldInfo = fieldInfo;
     _longsSize = outerInstance._postingsWriter.SetField(fieldInfo);
     _outputs = new FSTTermOutputs(fieldInfo, _longsSize);
     _builder = new Builder<FSTTermOutputs.TermData>(FST.INPUT_TYPE.BYTE1, _outputs);
 }