You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Robert Stewart <bs...@gmail.com> on 2012/02/13 23:29:21 UTC

Can I rebuild an index and remove some fields?

Lets say I have a large index (100M docs, 1TB, split up between 10 indexes).  And a bunch of the "stored" and "indexed" fields are not used in search at all.  In order to save memory and disk, I'd like to rebuild that index *without* those fields, but I don't have original documents to rebuild entire index with (don't have the full-text anymore, etc.).  Is there some way to rebuild or optimize an existing index with only a sub-set of the existing indexed fields?  Or alternatively is there a way to avoid loading some indexed fields at all ( to avoid loading term infos and terms index ) ?

Thanks
Bob

Re: Can I rebuild an index and remove some fields?

Posted by Robert Stewart <bs...@gmail.com>.

I will test it with my big production indexes first, if it works I
will port to Java and add to contrib I think.

On Wed, Feb 15, 2012 at 10:03 PM, Li Li <fa...@gmail.com> wrote:
> great. I think you could make it a public tool. maybe others also need such
> functionality.
>
> On Thu, Feb 16, 2012 at 5:31 AM, Robert Stewart <bs...@gmail.com>wrote:
>
>> I implemented an index shrinker and it works.  I reduced my test index
>> from 6.6 GB to 3.6 GB by removing a single shingled field I did not
>> need anymore.  I'm actually using Lucene.Net for this project so code
>> is C# using Lucene.Net 2.9.2 API.  But basic idea is:
>>
>> Create an IndexReader wrapper that only enumerates the terms you want
>> to keep, and that removes terms from documents when returning
>> documents.
>>
>> Use the SegmentMerger to re-write each segment (where each segment is
>> wrapped by the wrapper class), writing new segment to a new directory.
>> Collect the SegmentInfos and do a commit in order to create a new
>> segments file in new index directory
>>
>> Done - you now have a shrunk index with specified terms removed.
>>
>> Implementation uses separate thread for each segment, so it re-writes
>> them in parallel.  Took about 15 minutes to do 770,000 doc index on my
>> macbook.
>>
>>
>> On Tue, Feb 14, 2012 at 10:12 PM, Li Li <fa...@gmail.com> wrote:
>> > I have roughly read the codes of 4.0 trunk. maybe it's feasible.
>> >    SegmentMerger.add(IndexReader) will add to be merged Readers
>> >    merge() will call
>> >      mergeTerms(segmentWriteState);
>> >      mergePerDoc(segmentWriteState);
>> >
>> >   mergeTerms() will construct fields from IndexReaders
>> >    for(int
>> > readerIndex=0;readerIndex<mergeState.readers.size();readerIndex++) {
>> >      final MergeState.IndexReaderAndLiveDocs r =
>> > mergeState.readers.get(readerIndex);
>> >      final Fields f = r.reader.fields();
>> >      final int maxDoc = r.reader.maxDoc();
>> >      if (f != null) {
>> >        slices.add(new ReaderUtil.Slice(docBase, maxDoc, readerIndex));
>> >        fields.add(f);
>> >      }
>> >      docBase += maxDoc;
>> >    }
>> >    So If you wrapper your IndexReader and override its fields() method,
>> > maybe it will work for merge terms.
>> >
>> >    for DocValues, it can also override AtomicReader.docValues(). just
>> > return null for fields you want to remove. maybe it should
>> > traverse CompositeReader's getSequentialSubReaders() and wrapper each
>> > AtomicReader
>> >
>> >    other things like term vectors norms are similar.
>> > On Wed, Feb 15, 2012 at 6:30 AM, Robert Stewart <bstewart.ny@gmail.com
>> >wrote:
>> >
>> >> I was thinking if I make a wrapper class that aggregates another
>> >> IndexReader and filter out terms I don't want anymore it might work.
>> And
>> >> then pass that wrapper into SegmentMerger.  I think if I filter out
>> terms
>> >> on GetFieldNames(...) and Terms(...) it might work.
>> >>
>> >> Something like:
>> >>
>> >> HashSet<string> ignoredTerms=...;
>> >>
>> >> FilteringIndexReader wrapper=new FilterIndexReader(reader);
>> >>
>> >> SegmentMerger merger=new SegmentMerger(writer);
>> >>
>> >> merger.add(wrapper);
>> >>
>> >> merger.Merge();
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> On Feb 14, 2012, at 1:49 AM, Li Li wrote:
>> >>
>> >> > for method 2, delete is wrong. we can't delete terms.
>> >> >   you also should hack with the tii and tis file.
>> >> >
>> >> > On Tue, Feb 14, 2012 at 2:46 PM, Li Li <fa...@gmail.com> wrote:
>> >> >
>> >> >> method1, dumping data
>> >> >> for stored fields, you can traverse the whole index and save it to
>> >> >> somewhere else.
>> >> >> for indexed but not stored fields, it may be more difficult.
>> >> >>    if the indexed and not stored field is not analyzed(fields such as
>> >> >> id), it's easy to get from FieldCache.StringIndex.
>> >> >>    But for analyzed fields, though theoretically it can be restored
>> from
>> >> >> term vector and term position, it's hard to recover from index.
>> >> >>
>> >> >> method 2, hack with metadata
>> >> >> 1. indexed fields
>> >> >>      delete by query, e.g. field:*
>> >> >> 2. stored fields
>> >> >>       because all fields are stored sequentially. it's not easy to
>> >> delete
>> >> >> some fields. this will not affect search speed. but if you want to
>> get
>> >> >> stored fields,  and the useless fields are very long, then it will
>> slow
>> >> >> down.
>> >> >>       also it's possible to hack with it. but need more effort to
>> >> >> understand the index file format  and traverse the fdt/fdx file.
>> >> >>
>> >>
>> http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/fileformats.html
>> >> >>
>> >> >> this will give you some insight.
>> >> >>
>> >> >>
>> >> >> On Tue, Feb 14, 2012 at 6:29 AM, Robert Stewart <
>> bstewart.ny@gmail.com
>> >> >wrote:
>> >> >>
>> >> >>> Lets say I have a large index (100M docs, 1TB, split up between 10
>> >> >>> indexes).  And a bunch of the "stored" and "indexed" fields are not
>> >> used in
>> >> >>> search at all.  In order to save memory and disk, I'd like to
>> rebuild
>> >> that
>> >> >>> index *without* those fields, but I don't have original documents to
>> >> >>> rebuild entire index with (don't have the full-text anymore, etc.).
>>  Is
>> >> >>> there some way to rebuild or optimize an existing index with only a
>> >> sub-set
>> >> >>> of the existing indexed fields?  Or alternatively is there a way to
>> >> avoid
>> >> >>> loading some indexed fields at all ( to avoid loading term infos and
>> >> terms
>> >> >>> index ) ?
>> >> >>>
>> >> >>> Thanks
>> >> >>> Bob
>> >> >>
>> >> >>
>> >> >>
>> >>
>> >>
>>

Re: Can I rebuild an index and remove some fields?

Posted by Li Li <fa...@gmail.com>.

great. I think you could make it a public tool. maybe others also need such
functionality.

On Thu, Feb 16, 2012 at 5:31 AM, Robert Stewart <bs...@gmail.com>wrote:

> I implemented an index shrinker and it works.  I reduced my test index
> from 6.6 GB to 3.6 GB by removing a single shingled field I did not
> need anymore.  I'm actually using Lucene.Net for this project so code
> is C# using Lucene.Net 2.9.2 API.  But basic idea is:
>
> Create an IndexReader wrapper that only enumerates the terms you want
> to keep, and that removes terms from documents when returning
> documents.
>
> Use the SegmentMerger to re-write each segment (where each segment is
> wrapped by the wrapper class), writing new segment to a new directory.
> Collect the SegmentInfos and do a commit in order to create a new
> segments file in new index directory
>
> Done - you now have a shrunk index with specified terms removed.
>
> Implementation uses separate thread for each segment, so it re-writes
> them in parallel.  Took about 15 minutes to do 770,000 doc index on my
> macbook.
>
>
> On Tue, Feb 14, 2012 at 10:12 PM, Li Li <fa...@gmail.com> wrote:
> > I have roughly read the codes of 4.0 trunk. maybe it's feasible.
> >    SegmentMerger.add(IndexReader) will add to be merged Readers
> >    merge() will call
> >      mergeTerms(segmentWriteState);
> >      mergePerDoc(segmentWriteState);
> >
> >   mergeTerms() will construct fields from IndexReaders
> >    for(int
> > readerIndex=0;readerIndex<mergeState.readers.size();readerIndex++) {
> >      final MergeState.IndexReaderAndLiveDocs r =
> > mergeState.readers.get(readerIndex);
> >      final Fields f = r.reader.fields();
> >      final int maxDoc = r.reader.maxDoc();
> >      if (f != null) {
> >        slices.add(new ReaderUtil.Slice(docBase, maxDoc, readerIndex));
> >        fields.add(f);
> >      }
> >      docBase += maxDoc;
> >    }
> >    So If you wrapper your IndexReader and override its fields() method,
> > maybe it will work for merge terms.
> >
> >    for DocValues, it can also override AtomicReader.docValues(). just
> > return null for fields you want to remove. maybe it should
> > traverse CompositeReader's getSequentialSubReaders() and wrapper each
> > AtomicReader
> >
> >    other things like term vectors norms are similar.
> > On Wed, Feb 15, 2012 at 6:30 AM, Robert Stewart <bstewart.ny@gmail.com
> >wrote:
> >
> >> I was thinking if I make a wrapper class that aggregates another
> >> IndexReader and filter out terms I don't want anymore it might work.
> And
> >> then pass that wrapper into SegmentMerger.  I think if I filter out
> terms
> >> on GetFieldNames(...) and Terms(...) it might work.
> >>
> >> Something like:
> >>
> >> HashSet<string> ignoredTerms=...;
> >>
> >> FilteringIndexReader wrapper=new FilterIndexReader(reader);
> >>
> >> SegmentMerger merger=new SegmentMerger(writer);
> >>
> >> merger.add(wrapper);
> >>
> >> merger.Merge();
> >>
> >>
> >>
> >>
> >>
> >> On Feb 14, 2012, at 1:49 AM, Li Li wrote:
> >>
> >> > for method 2, delete is wrong. we can't delete terms.
> >> >   you also should hack with the tii and tis file.
> >> >
> >> > On Tue, Feb 14, 2012 at 2:46 PM, Li Li <fa...@gmail.com> wrote:
> >> >
> >> >> method1, dumping data
> >> >> for stored fields, you can traverse the whole index and save it to
> >> >> somewhere else.
> >> >> for indexed but not stored fields, it may be more difficult.
> >> >>    if the indexed and not stored field is not analyzed(fields such as
> >> >> id), it's easy to get from FieldCache.StringIndex.
> >> >>    But for analyzed fields, though theoretically it can be restored
> from
> >> >> term vector and term position, it's hard to recover from index.
> >> >>
> >> >> method 2, hack with metadata
> >> >> 1. indexed fields
> >> >>      delete by query, e.g. field:*
> >> >> 2. stored fields
> >> >>       because all fields are stored sequentially. it's not easy to
> >> delete
> >> >> some fields. this will not affect search speed. but if you want to
> get
> >> >> stored fields,  and the useless fields are very long, then it will
> slow
> >> >> down.
> >> >>       also it's possible to hack with it. but need more effort to
> >> >> understand the index file format  and traverse the fdt/fdx file.
> >> >>
> >>
> http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/fileformats.html
> >> >>
> >> >> this will give you some insight.
> >> >>
> >> >>
> >> >> On Tue, Feb 14, 2012 at 6:29 AM, Robert Stewart <
> bstewart.ny@gmail.com
> >> >wrote:
> >> >>
> >> >>> Lets say I have a large index (100M docs, 1TB, split up between 10
> >> >>> indexes).  And a bunch of the "stored" and "indexed" fields are not
> >> used in
> >> >>> search at all.  In order to save memory and disk, I'd like to
> rebuild
> >> that
> >> >>> index *without* those fields, but I don't have original documents to
> >> >>> rebuild entire index with (don't have the full-text anymore, etc.).
>  Is
> >> >>> there some way to rebuild or optimize an existing index with only a
> >> sub-set
> >> >>> of the existing indexed fields?  Or alternatively is there a way to
> >> avoid
> >> >>> loading some indexed fields at all ( to avoid loading term infos and
> >> terms
> >> >>> index ) ?
> >> >>>
> >> >>> Thanks
> >> >>> Bob
> >> >>
> >> >>
> >> >>
> >>
> >>
>

Re: Can I rebuild an index and remove some fields?

Posted by Robert Stewart <bs...@gmail.com>.

I implemented an index shrinker and it works.  I reduced my test index
from 6.6 GB to 3.6 GB by removing a single shingled field I did not
need anymore.  I'm actually using Lucene.Net for this project so code
is C# using Lucene.Net 2.9.2 API.  But basic idea is:

Create an IndexReader wrapper that only enumerates the terms you want
to keep, and that removes terms from documents when returning
documents.

Use the SegmentMerger to re-write each segment (where each segment is
wrapped by the wrapper class), writing new segment to a new directory.
Collect the SegmentInfos and do a commit in order to create a new
segments file in new index directory

Done - you now have a shrunk index with specified terms removed.

Implementation uses separate thread for each segment, so it re-writes
them in parallel.  Took about 15 minutes to do 770,000 doc index on my
macbook.


On Tue, Feb 14, 2012 at 10:12 PM, Li Li <fa...@gmail.com> wrote:
> I have roughly read the codes of 4.0 trunk. maybe it's feasible.
>    SegmentMerger.add(IndexReader) will add to be merged Readers
>    merge() will call
>      mergeTerms(segmentWriteState);
>      mergePerDoc(segmentWriteState);
>
>   mergeTerms() will construct fields from IndexReaders
>    for(int
> readerIndex=0;readerIndex<mergeState.readers.size();readerIndex++) {
>      final MergeState.IndexReaderAndLiveDocs r =
> mergeState.readers.get(readerIndex);
>      final Fields f = r.reader.fields();
>      final int maxDoc = r.reader.maxDoc();
>      if (f != null) {
>        slices.add(new ReaderUtil.Slice(docBase, maxDoc, readerIndex));
>        fields.add(f);
>      }
>      docBase += maxDoc;
>    }
>    So If you wrapper your IndexReader and override its fields() method,
> maybe it will work for merge terms.
>
>    for DocValues, it can also override AtomicReader.docValues(). just
> return null for fields you want to remove. maybe it should
> traverse CompositeReader's getSequentialSubReaders() and wrapper each
> AtomicReader
>
>    other things like term vectors norms are similar.
> On Wed, Feb 15, 2012 at 6:30 AM, Robert Stewart <bs...@gmail.com>wrote:
>
>> I was thinking if I make a wrapper class that aggregates another
>> IndexReader and filter out terms I don't want anymore it might work.   And
>> then pass that wrapper into SegmentMerger.  I think if I filter out terms
>> on GetFieldNames(...) and Terms(...) it might work.
>>
>> Something like:
>>
>> HashSet<string> ignoredTerms=...;
>>
>> FilteringIndexReader wrapper=new FilterIndexReader(reader);
>>
>> SegmentMerger merger=new SegmentMerger(writer);
>>
>> merger.add(wrapper);
>>
>> merger.Merge();
>>
>>
>>
>>
>>
>> On Feb 14, 2012, at 1:49 AM, Li Li wrote:
>>
>> > for method 2, delete is wrong. we can't delete terms.
>> >   you also should hack with the tii and tis file.
>> >
>> > On Tue, Feb 14, 2012 at 2:46 PM, Li Li <fa...@gmail.com> wrote:
>> >
>> >> method1, dumping data
>> >> for stored fields, you can traverse the whole index and save it to
>> >> somewhere else.
>> >> for indexed but not stored fields, it may be more difficult.
>> >>    if the indexed and not stored field is not analyzed(fields such as
>> >> id), it's easy to get from FieldCache.StringIndex.
>> >>    But for analyzed fields, though theoretically it can be restored from
>> >> term vector and term position, it's hard to recover from index.
>> >>
>> >> method 2, hack with metadata
>> >> 1. indexed fields
>> >>      delete by query, e.g. field:*
>> >> 2. stored fields
>> >>       because all fields are stored sequentially. it's not easy to
>> delete
>> >> some fields. this will not affect search speed. but if you want to get
>> >> stored fields,  and the useless fields are very long, then it will slow
>> >> down.
>> >>       also it's possible to hack with it. but need more effort to
>> >> understand the index file format  and traverse the fdt/fdx file.
>> >>
>> http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/fileformats.html
>> >>
>> >> this will give you some insight.
>> >>
>> >>
>> >> On Tue, Feb 14, 2012 at 6:29 AM, Robert Stewart <bstewart.ny@gmail.com
>> >wrote:
>> >>
>> >>> Lets say I have a large index (100M docs, 1TB, split up between 10
>> >>> indexes).  And a bunch of the "stored" and "indexed" fields are not
>> used in
>> >>> search at all.  In order to save memory and disk, I'd like to rebuild
>> that
>> >>> index *without* those fields, but I don't have original documents to
>> >>> rebuild entire index with (don't have the full-text anymore, etc.).  Is
>> >>> there some way to rebuild or optimize an existing index with only a
>> sub-set
>> >>> of the existing indexed fields?  Or alternatively is there a way to
>> avoid
>> >>> loading some indexed fields at all ( to avoid loading term infos and
>> terms
>> >>> index ) ?
>> >>>
>> >>> Thanks
>> >>> Bob
>> >>
>> >>
>> >>
>>
>>

Re: Can I rebuild an index and remove some fields?

Posted by Li Li <fa...@gmail.com>.

I have roughly read the codes of 4.0 trunk. maybe it's feasible.
    SegmentMerger.add(IndexReader) will add to be merged Readers
    merge() will call
      mergeTerms(segmentWriteState);
      mergePerDoc(segmentWriteState);

   mergeTerms() will construct fields from IndexReaders
    for(int
readerIndex=0;readerIndex<mergeState.readers.size();readerIndex++) {
      final MergeState.IndexReaderAndLiveDocs r =
mergeState.readers.get(readerIndex);
      final Fields f = r.reader.fields();
      final int maxDoc = r.reader.maxDoc();
      if (f != null) {
        slices.add(new ReaderUtil.Slice(docBase, maxDoc, readerIndex));
        fields.add(f);
      }
      docBase += maxDoc;
    }
    So If you wrapper your IndexReader and override its fields() method,
maybe it will work for merge terms.

    for DocValues, it can also override AtomicReader.docValues(). just
return null for fields you want to remove. maybe it should
traverse CompositeReader's getSequentialSubReaders() and wrapper each
AtomicReader

    other things like term vectors norms are similar.
On Wed, Feb 15, 2012 at 6:30 AM, Robert Stewart <bs...@gmail.com>wrote:

> I was thinking if I make a wrapper class that aggregates another
> IndexReader and filter out terms I don't want anymore it might work.   And
> then pass that wrapper into SegmentMerger.  I think if I filter out terms
> on GetFieldNames(...) and Terms(...) it might work.
>
> Something like:
>
> HashSet<string> ignoredTerms=...;
>
> FilteringIndexReader wrapper=new FilterIndexReader(reader);
>
> SegmentMerger merger=new SegmentMerger(writer);
>
> merger.add(wrapper);
>
> merger.Merge();
>
>
>
>
>
> On Feb 14, 2012, at 1:49 AM, Li Li wrote:
>
> > for method 2, delete is wrong. we can't delete terms.
> >   you also should hack with the tii and tis file.
> >
> > On Tue, Feb 14, 2012 at 2:46 PM, Li Li <fa...@gmail.com> wrote:
> >
> >> method1, dumping data
> >> for stored fields, you can traverse the whole index and save it to
> >> somewhere else.
> >> for indexed but not stored fields, it may be more difficult.
> >>    if the indexed and not stored field is not analyzed(fields such as
> >> id), it's easy to get from FieldCache.StringIndex.
> >>    But for analyzed fields, though theoretically it can be restored from
> >> term vector and term position, it's hard to recover from index.
> >>
> >> method 2, hack with metadata
> >> 1. indexed fields
> >>      delete by query, e.g. field:*
> >> 2. stored fields
> >>       because all fields are stored sequentially. it's not easy to
> delete
> >> some fields. this will not affect search speed. but if you want to get
> >> stored fields,  and the useless fields are very long, then it will slow
> >> down.
> >>       also it's possible to hack with it. but need more effort to
> >> understand the index file format  and traverse the fdt/fdx file.
> >>
> http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/fileformats.html
> >>
> >> this will give you some insight.
> >>
> >>
> >> On Tue, Feb 14, 2012 at 6:29 AM, Robert Stewart <bstewart.ny@gmail.com
> >wrote:
> >>
> >>> Lets say I have a large index (100M docs, 1TB, split up between 10
> >>> indexes).  And a bunch of the "stored" and "indexed" fields are not
> used in
> >>> search at all.  In order to save memory and disk, I'd like to rebuild
> that
> >>> index *without* those fields, but I don't have original documents to
> >>> rebuild entire index with (don't have the full-text anymore, etc.).  Is
> >>> there some way to rebuild or optimize an existing index with only a
> sub-set
> >>> of the existing indexed fields?  Or alternatively is there a way to
> avoid
> >>> loading some indexed fields at all ( to avoid loading term infos and
> terms
> >>> index ) ?
> >>>
> >>> Thanks
> >>> Bob
> >>
> >>
> >>
>
>

Re: Can I rebuild an index and remove some fields?

Posted by Robert Stewart <bs...@gmail.com>.

I was thinking if I make a wrapper class that aggregates another IndexReader and filter out terms I don't want anymore it might work.   And then pass that wrapper into SegmentMerger.  I think if I filter out terms on GetFieldNames(...) and Terms(...) it might work.

Something like:

HashSet<string> ignoredTerms=...;

FilteringIndexReader wrapper=new FilterIndexReader(reader);

SegmentMerger merger=new SegmentMerger(writer);

merger.add(wrapper);

merger.Merge();





On Feb 14, 2012, at 1:49 AM, Li Li wrote:

> for method 2, delete is wrong. we can't delete terms.
>   you also should hack with the tii and tis file.
> 
> On Tue, Feb 14, 2012 at 2:46 PM, Li Li <fa...@gmail.com> wrote:
> 
>> method1, dumping data
>> for stored fields, you can traverse the whole index and save it to
>> somewhere else.
>> for indexed but not stored fields, it may be more difficult.
>>    if the indexed and not stored field is not analyzed(fields such as
>> id), it's easy to get from FieldCache.StringIndex.
>>    But for analyzed fields, though theoretically it can be restored from
>> term vector and term position, it's hard to recover from index.
>> 
>> method 2, hack with metadata
>> 1. indexed fields
>>      delete by query, e.g. field:*
>> 2. stored fields
>>       because all fields are stored sequentially. it's not easy to delete
>> some fields. this will not affect search speed. but if you want to get
>> stored fields,  and the useless fields are very long, then it will slow
>> down.
>>       also it's possible to hack with it. but need more effort to
>> understand the index file format  and traverse the fdt/fdx file.
>> http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/fileformats.html
>> 
>> this will give you some insight.
>> 
>> 
>> On Tue, Feb 14, 2012 at 6:29 AM, Robert Stewart <bs...@gmail.com>wrote:
>> 
>>> Lets say I have a large index (100M docs, 1TB, split up between 10
>>> indexes).  And a bunch of the "stored" and "indexed" fields are not used in
>>> search at all.  In order to save memory and disk, I'd like to rebuild that
>>> index *without* those fields, but I don't have original documents to
>>> rebuild entire index with (don't have the full-text anymore, etc.).  Is
>>> there some way to rebuild or optimize an existing index with only a sub-set
>>> of the existing indexed fields?  Or alternatively is there a way to avoid
>>> loading some indexed fields at all ( to avoid loading term infos and terms
>>> index ) ?
>>> 
>>> Thanks
>>> Bob
>> 
>> 
>>

Re: Can I rebuild an index and remove some fields?

Posted by Li Li <fa...@gmail.com>.

for method 2, delete is wrong. we can't delete terms.
   you also should hack with the tii and tis file.

On Tue, Feb 14, 2012 at 2:46 PM, Li Li <fa...@gmail.com> wrote:

> method1, dumping data
> for stored fields, you can traverse the whole index and save it to
> somewhere else.
> for indexed but not stored fields, it may be more difficult.
>     if the indexed and not stored field is not analyzed(fields such as
> id), it's easy to get from FieldCache.StringIndex.
>     But for analyzed fields, though theoretically it can be restored from
> term vector and term position, it's hard to recover from index.
>
> method 2, hack with metadata
> 1. indexed fields
>       delete by query, e.g. field:*
> 2. stored fields
>        because all fields are stored sequentially. it's not easy to delete
> some fields. this will not affect search speed. but if you want to get
> stored fields,  and the useless fields are very long, then it will slow
> down.
>        also it's possible to hack with it. but need more effort to
> understand the index file format  and traverse the fdt/fdx file.
> http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/fileformats.html
>
> this will give you some insight.
>
>
> On Tue, Feb 14, 2012 at 6:29 AM, Robert Stewart <bs...@gmail.com>wrote:
>
>> Lets say I have a large index (100M docs, 1TB, split up between 10
>> indexes).  And a bunch of the "stored" and "indexed" fields are not used in
>> search at all.  In order to save memory and disk, I'd like to rebuild that
>> index *without* those fields, but I don't have original documents to
>> rebuild entire index with (don't have the full-text anymore, etc.).  Is
>> there some way to rebuild or optimize an existing index with only a sub-set
>> of the existing indexed fields?  Or alternatively is there a way to avoid
>> loading some indexed fields at all ( to avoid loading term infos and terms
>> index ) ?
>>
>> Thanks
>> Bob
>
>
>

Re: Can I rebuild an index and remove some fields?

Posted by Li Li <fa...@gmail.com>.

method1, dumping data
for stored fields, you can traverse the whole index and save it to
somewhere else.
for indexed but not stored fields, it may be more difficult.
    if the indexed and not stored field is not analyzed(fields such as id),
it's easy to get from FieldCache.StringIndex.
    But for analyzed fields, though theoretically it can be restored from
term vector and term position, it's hard to recover from index.

method 2, hack with metadata
1. indexed fields
      delete by query, e.g. field:*
2. stored fields
       because all fields are stored sequentially. it's not easy to delete
some fields. this will not affect search speed. but if you want to get
stored fields,  and the useless fields are very long, then it will slow
down.
       also it's possible to hack with it. but need more effort to
understand the index file format  and traverse the fdt/fdx file.
http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/fileformats.html

this will give you some insight.

On Tue, Feb 14, 2012 at 6:29 AM, Robert Stewart <bs...@gmail.com>wrote:

> Lets say I have a large index (100M docs, 1TB, split up between 10
> indexes).  And a bunch of the "stored" and "indexed" fields are not used in
> search at all.  In order to save memory and disk, I'd like to rebuild that
> index *without* those fields, but I don't have original documents to
> rebuild entire index with (don't have the full-text anymore, etc.).  Is
> there some way to rebuild or optimize an existing index with only a sub-set
> of the existing indexed fields?  Or alternatively is there a way to avoid
> loading some indexed fields at all ( to avoid loading term infos and terms
> index ) ?
>
> Thanks
> Bob