You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Chris Bamford <ch...@talktalk.net> on 2011/04/08 16:44:09 UTC

Rewriting an index without losing 'hidden' data

Hi, 

I recently discovered that I need to add a single field to every document in an existing (very large) index.  Reindexing from scratch is not an option I want to consider right now, so I wrote a utility to add the field by rewriting the index - but this seemed to lose some of the fields (indexed, but not stored?).  In fact, it shrunk a 12Gb index down to 4.2Gb - clearly not what I wanted.  :-)
What am I doing wrong?

My technique was:

  Analyzer analyser = new StandardAnalyzer();
  IndexSearcher searcher = new IndexSearcher(indexPath);
  IndexWriter indexWriter = new IndexWriter(indexPath, analyser);
  Hits hits = matchAllDocumentsFromIndex(searcher);

  for (int i=0; i < hits.length(); i++) {
          Document doc = hits.doc(i);
          String id = doc.get("unique-id");
          doc.add(new Field("newField", newValue, Field.Store.YES, Field.Index.UN_TOKENIZED));
          indexWriter.updateDocument(new Term("unique-id", id), doc);
  }

  searcher.close();
  indexWriter.optimize(); 
  indexWriter.close();

Note that my matchAllDocumentsFromIndex() does get the right number of hits from the index - i.e. the same number as held in the index.


 Thanks for any ideas!
BTW I am using Lucene 2.3.2.

- Chris

 



Re: Rewriting an index without losing 'hidden' data

Posted by Samarendra Pratap <sa...@gmail.com>.
Hi, I know it is too late to answer a question (sorry Chris) but I thought
it could be useful to share things (even late).
I was just going through the mails and I found that we've done it a few
months back.

*Objective: To add a new field to existing index without re-writing the
whole index.*

We have an index ("primary index") to which we want to add a new field,  say
"tags".
Source of the data is database.

I am adding pseudo code here

Create an index "index 2" with just two fields "id" (which is also a unique
identifier in main index) and "tags" (keep it stored) from database (source
of data).

Open a new IndexWriter ("index 3")

Now run a loop over all the documents of "Primary Index" with increasing
order of doc-id
Get document of current doc-id (starting from zero)
 Find the value of "id" field
Search this value in in secondary index in the same ("id") field.  (or
directly get the document through IndexReader and termVector). You should
get only one document.
 If document is found
Add this document to "index 3"
If document is not found
 Add a blank document to "index 3" (to maintain the doc-id order)

(After the loop is finished, the doc-ids and fields of "primary index" and
"index 3" will be in order, i.e. document at doc id 5 in "index 3" and in
"primary index" would be representing the same document of the database with
different fields)

Open a *ParallelReader* ( this is the key :-) ) and add both the indexes
("primary index" and "index 3") one by one.
Open an IndexWriter and use addIndexes(IndexReader) to create a single
index.
The final index will contain primary index with "tags" field. :-)


I request the list to comment if there could be any issue with that.


My question follows then -
I tried this on NumericField (as "tags") but this didn't work.
My guess (excuse me for guessing without deeper investigations) is that this
is because NumericField is not a Field. It is an AbstractField

Irrespective of the correctness of my guess can someone give me a hint or
point me to something which can help me doing the same process successfully
for NumericField as well?

I hope to listen from learned people.


On Fri, Apr 8, 2011 at 9:38 PM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> Unfortunately, updateDocument replaces the *entire* previous document
> with the new one.
>
> The ability to update a single indexed field (either replace that
> field entirely, or, change only certain token occurrences within it),
> while leaving all other indexed fields in the document unaffected, has
> been a long requested big missing feature in Lucene.  We call it
> "incremental field updates".
>
> There have been some healthy discussions on the dev list, that have
> worked out a good rough design (eg see
> http://markmail.org/thread/lsfjhpiblzymkfcn).  Also, recent
> improvements in how buffered deletes are handled should make it alot
> easier for updates to "piggyback" using that same packet stream
> approach.  So... I think there is hope some day that we'll get this
> into Lucene.
>
> Mike
>
> http://blog.mikemccandless.com
>
> On Fri, Apr 8, 2011 at 11:00 AM, Ian Lea <ia...@gmail.com> wrote:
> > Unfortunately you just can't do this.  Might be possible if all fields
> > were stored but evidently they are not in your index.  For unstored
> > fields, the Document object will not contain the data that was passed
> > in when the doc was originally added.
> >
> > I believe there might be a way of recreating some of the missing data
> > via TermFreqVector but that has always sounded dodgy and lossy to me.
> >
> > The safest way is to reindex, however painful it might be.  Maybe you
> > could take the opportunity to upgrade lucene at the same time!
> >
> >
> > --
> > Ian.
> >
> >
> > On Fri, Apr 8, 2011 at 3:44 PM, Chris Bamford
> > <ch...@talktalk.net> wrote:
> >> Hi,
> >>
> >> I recently discovered that I need to add a single field to every
> document in an existing (very large) index.  Reindexing from scratch is not
> an option I want to consider right now, so I wrote a utility to add the
> field by rewriting the index - but this seemed to lose some of the fields
> (indexed, but not stored?).  In fact, it shrunk a 12Gb index down to 4.2Gb -
> clearly not what I wanted.  :-)
> >> What am I doing wrong?
> >>
> >> My technique was:
> >>
> >>  Analyzer analyser = new StandardAnalyzer();
> >>  IndexSearcher searcher = new IndexSearcher(indexPath);
> >>  IndexWriter indexWriter = new IndexWriter(indexPath, analyser);
> >>  Hits hits = matchAllDocumentsFromIndex(searcher);
> >>
> >>  for (int i=0; i < hits.length(); i++) {
> >>          Document doc = hits.doc(i);
> >>          String id = doc.get("unique-id");
> >>          doc.add(new Field("newField", newValue, Field.Store.YES,
> Field.Index.UN_TOKENIZED));
> >>          indexWriter.updateDocument(new Term("unique-id", id), doc);
> >>  }
> >>
> >>  searcher.close();
> >>  indexWriter.optimize();
> >>  indexWriter.close();
> >>
> >> Note that my matchAllDocumentsFromIndex() does get the right number of
> hits from the index - i.e. the same number as held in the index.
> >>
> >>
> >>  Thanks for any ideas!
> >> BTW I am using Lucene 2.3.2.
> >>
> >> - Chris
> >>
> >>
> >>
> >>
> >>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


-- 
Regards,
Samar

Re: Rewriting an index without losing 'hidden' data

Posted by Michael McCandless <lu...@mikemccandless.com>.
Unfortunately, updateDocument replaces the *entire* previous document
with the new one.

The ability to update a single indexed field (either replace that
field entirely, or, change only certain token occurrences within it),
while leaving all other indexed fields in the document unaffected, has
been a long requested big missing feature in Lucene.  We call it
"incremental field updates".

There have been some healthy discussions on the dev list, that have
worked out a good rough design (eg see
http://markmail.org/thread/lsfjhpiblzymkfcn).  Also, recent
improvements in how buffered deletes are handled should make it alot
easier for updates to "piggyback" using that same packet stream
approach.  So... I think there is hope some day that we'll get this
into Lucene.

Mike

http://blog.mikemccandless.com

On Fri, Apr 8, 2011 at 11:00 AM, Ian Lea <ia...@gmail.com> wrote:
> Unfortunately you just can't do this.  Might be possible if all fields
> were stored but evidently they are not in your index.  For unstored
> fields, the Document object will not contain the data that was passed
> in when the doc was originally added.
>
> I believe there might be a way of recreating some of the missing data
> via TermFreqVector but that has always sounded dodgy and lossy to me.
>
> The safest way is to reindex, however painful it might be.  Maybe you
> could take the opportunity to upgrade lucene at the same time!
>
>
> --
> Ian.
>
>
> On Fri, Apr 8, 2011 at 3:44 PM, Chris Bamford
> <ch...@talktalk.net> wrote:
>> Hi,
>>
>> I recently discovered that I need to add a single field to every document in an existing (very large) index.  Reindexing from scratch is not an option I want to consider right now, so I wrote a utility to add the field by rewriting the index - but this seemed to lose some of the fields (indexed, but not stored?).  In fact, it shrunk a 12Gb index down to 4.2Gb - clearly not what I wanted.  :-)
>> What am I doing wrong?
>>
>> My technique was:
>>
>>  Analyzer analyser = new StandardAnalyzer();
>>  IndexSearcher searcher = new IndexSearcher(indexPath);
>>  IndexWriter indexWriter = new IndexWriter(indexPath, analyser);
>>  Hits hits = matchAllDocumentsFromIndex(searcher);
>>
>>  for (int i=0; i < hits.length(); i++) {
>>          Document doc = hits.doc(i);
>>          String id = doc.get("unique-id");
>>          doc.add(new Field("newField", newValue, Field.Store.YES, Field.Index.UN_TOKENIZED));
>>          indexWriter.updateDocument(new Term("unique-id", id), doc);
>>  }
>>
>>  searcher.close();
>>  indexWriter.optimize();
>>  indexWriter.close();
>>
>> Note that my matchAllDocumentsFromIndex() does get the right number of hits from the index - i.e. the same number as held in the index.
>>
>>
>>  Thanks for any ideas!
>> BTW I am using Lucene 2.3.2.
>>
>> - Chris
>>
>>
>>
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Rewriting an index without losing 'hidden' data

Posted by Chris Bamford <ch...@talktalk.net>.
 Thanks Ian,  its as I feared!
I shall have to reindex (and upgrade!).

Cheers,

- Chris

 


 

 

-----Original Message-----
From: Ian Lea <ia...@gmail.com>
To: java-user@lucene.apache.org
Sent: Fri, 8 Apr 2011 16:00
Subject: Re: Rewriting an index without losing 'hidden' data


Unfortunately you just can't do this.  Might be possible if all fields
were stored but evidently they are not in your index.  For unstored
fields, the Document object will not contain the data that was passed
in when the doc was originally added.

I believe there might be a way of recreating some of the missing data
via TermFreqVector but that has always sounded dodgy and lossy to me.

The safest way is to reindex, however painful it might be.  Maybe you
could take the opportunity to upgrade lucene at the same time!


--
Ian.


On Fri, Apr 8, 2011 at 3:44 PM, Chris Bamford
<ch...@talktalk.net> wrote:
> Hi,
>
> I recently discovered that I need to add a single field to every document in 
an existing (very large) index.  Reindexing from scratch is not an option I want 
to consider right now, so I wrote a utility to add the field by rewriting the 
index - but this seemed to lose some of the fields (indexed, but not stored?). 
 In fact, it shrunk a 12Gb index down to 4.2Gb - clearly not what I wanted.  :-)
> What am I doing wrong?
>
> My technique was:
>
>  Analyzer analyser = new StandardAnalyzer();
>  IndexSearcher searcher = new IndexSearcher(indexPath);
>  IndexWriter indexWriter = new IndexWriter(indexPath, analyser);
>  Hits hits = matchAllDocumentsFromIndex(searcher);
>
>  for (int i=0; i < hits.length(); i++) {
>          Document doc = hits.doc(i);
>          String id = doc.get("unique-id");
>          doc.add(new Field("newField", newValue, Field.Store.YES, 
Field.Index.UN_TOKENIZED));
>          indexWriter.updateDocument(new Term("unique-id", id), doc);
>  }
>
>  searcher.close();
>  indexWriter.optimize();
>  indexWriter.close();
>
> Note that my matchAllDocumentsFromIndex() does get the right number of hits 
from the index - i.e. the same number as held in the index.
>
>
>  Thanks for any ideas!
> BTW I am using Lucene 2.3.2.
>
> - Chris
>
>
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


 

Re: Rewriting an index without losing 'hidden' data

Posted by Ian Lea <ia...@gmail.com>.
Unfortunately you just can't do this.  Might be possible if all fields
were stored but evidently they are not in your index.  For unstored
fields, the Document object will not contain the data that was passed
in when the doc was originally added.

I believe there might be a way of recreating some of the missing data
via TermFreqVector but that has always sounded dodgy and lossy to me.

The safest way is to reindex, however painful it might be.  Maybe you
could take the opportunity to upgrade lucene at the same time!


--
Ian.


On Fri, Apr 8, 2011 at 3:44 PM, Chris Bamford
<ch...@talktalk.net> wrote:
> Hi,
>
> I recently discovered that I need to add a single field to every document in an existing (very large) index.  Reindexing from scratch is not an option I want to consider right now, so I wrote a utility to add the field by rewriting the index - but this seemed to lose some of the fields (indexed, but not stored?).  In fact, it shrunk a 12Gb index down to 4.2Gb - clearly not what I wanted.  :-)
> What am I doing wrong?
>
> My technique was:
>
>  Analyzer analyser = new StandardAnalyzer();
>  IndexSearcher searcher = new IndexSearcher(indexPath);
>  IndexWriter indexWriter = new IndexWriter(indexPath, analyser);
>  Hits hits = matchAllDocumentsFromIndex(searcher);
>
>  for (int i=0; i < hits.length(); i++) {
>          Document doc = hits.doc(i);
>          String id = doc.get("unique-id");
>          doc.add(new Field("newField", newValue, Field.Store.YES, Field.Index.UN_TOKENIZED));
>          indexWriter.updateDocument(new Term("unique-id", id), doc);
>  }
>
>  searcher.close();
>  indexWriter.optimize();
>  indexWriter.close();
>
> Note that my matchAllDocumentsFromIndex() does get the right number of hits from the index - i.e. the same number as held in the index.
>
>
>  Thanks for any ideas!
> BTW I am using Lucene 2.3.2.
>
> - Chris
>
>
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org