You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by sp...@gmx.eu on 2009/01/02 16:08:46 UTC

Re-combining already indexed documents

Hi,

I have already indexed documents. I want to recombine them into new
documents. Is this possible without the original documents - only with the
index?

Example:

doc1, doc2, doc3 are indexed.
I want a new indexed doc4 which is indexed as if I had concatenated doc1,
doc2, doc3 into doc4 and then indexed doc4. 


Thank you


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: Re-combining already indexed documents

Posted by sp...@gmx.eu.
> The fastest way to reconstruct the token 
> stream would  
> be to use the TermFreqVector but if you didn't store it at 
> index time  
> you would have traverse the inverted index using TermEnum and  
> TermPositions in order to pick up the term values and 
> positions. This  
> can be a rather time consuming process if you have a large index.

OK, then I better reindex from source. Thank you.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Re-combining already indexed documents

Posted by Karl Wettin <ka...@gmail.com>.
Hello,

the easiest way would be to construct the combined document using the  
data from your primary source rather than reconstructing it from the  
index. If the source data no longer is available you could still  
reconstruct a token stream. The data is however a bit spread out so it  
can turn out to be a bit tricky. Payloads can only be accessed via  
TermPositions and start/end offsets can only be accessed via  the  
TermFreqVector. The fastest way to reconstruct the token stream would  
be to use the TermFreqVector but if you didn't store it at index time  
you would have traverse the inverted index using TermEnum and  
TermPositions in order to pick up the term values and positions. This  
can be a rather time consuming process if you have a large index.

The code of TermVectorAccessor in contrib/miscellaneous might be  
helpful.


       karl

2 jan 2009 kl. 16.08 skrev <sp...@gmx.eu> <sp...@gmx.eu>:

> Hi,
>
> I have already indexed documents. I want to recombine them into new
> documents. Is this possible without the original documents - only  
> with the
> index?
>
> Example:
>
> doc1, doc2, doc3 are indexed.
> I want a new indexed doc4 which is indexed as if I had concatenated  
> doc1,
> doc2, doc3 into doc4 and then indexed doc4.
>
>
> Thank you
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org