You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Stephen Green <ee...@gmail.com> on 2014/03/11 19:33:20 UTC

Indexing a document that modifies itself as it's being indexed

I'm working on a system that uses Lucene 4.6.0, and I have a couple of use
cases for documents that modify themselves as they're being indexed.

For example, we have text classifiers that we would like to run on the
contents of certain fields.  These classifiers produce field values (i.e.,
the classes that the document is in) that I would like to be part of the
document.

Now, the text classifiers want to tokenize the text in order to do the
classification, and I'd like to avoid re-tokenizing the text multiple
times, so I can build a token filter that collects the tokens and then runs
the classifier.  This filter can know about the oald.Document that's being
processed, but I suspected that adding elements to Document.fields  while
it's being indexed would lead to a concurrent modification exception.

Since IndexWriter.addDocument takes an Iterable<IndexableField>, I figured
I could just make my own document class that implemented Iterable, but
would allow me to add new fields onto the end of the document and extend
the iteration to cover those fields.

I did this, but it didn't have the effect that I was hoping for, because
the fields that were added were never processed.

Working through the code, I discovered that
DocFieldProcessor.processDocument iterates through all the fields in the
document, collecting them by field name (using it's own hash table?) before
processing them.

Of  course, this breaks my add-fields-as-other-fields-are-being-processed
approach because the iterator is exhausted before any of the processing
happens.

So, my questions are: Does it make any sense to try to do this?  If so, is
there an approach that will work without having to rewrite a lot of
indexing code?

Thanks,

Steve Green
-- 
Stephen Green

Re: Indexing a document that modifies itself as it's being indexed

Posted by Stephen Green <ee...@gmail.com>.
Thanks, Mike.

Once I was that deep in the guts of the indexer, I knew things were
probably not going to go my way.

I'll check out CachingTokenFilter.



On Tue, Mar 11, 2014 at 3:09 PM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> You can't rely on how IndexWriter will iterate/consume those fields;
> that's an implementation detail.
>
> Maybe you could use CachingTokenFilter to pre-process the text fields
> and append the new fields?  And then during indexing, replay the
> cached tokens, so you don't have to tokenize twice.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Tue, Mar 11, 2014 at 2:33 PM, Stephen Green <ee...@gmail.com>
> wrote:
> > I'm working on a system that uses Lucene 4.6.0, and I have a couple of
> use
> > cases for documents that modify themselves as they're being indexed.
> >
> > For example, we have text classifiers that we would like to run on the
> > contents of certain fields.  These classifiers produce field values
> (i.e.,
> > the classes that the document is in) that I would like to be part of the
> > document.
> >
> > Now, the text classifiers want to tokenize the text in order to do the
> > classification, and I'd like to avoid re-tokenizing the text multiple
> > times, so I can build a token filter that collects the tokens and then
> runs
> > the classifier.  This filter can know about the oald.Document that's
> being
> > processed, but I suspected that adding elements to Document.fields  while
> > it's being indexed would lead to a concurrent modification exception.
> >
> > Since IndexWriter.addDocument takes an Iterable<IndexableField>, I
> figured
> > I could just make my own document class that implemented Iterable, but
> > would allow me to add new fields onto the end of the document and extend
> > the iteration to cover those fields.
> >
> > I did this, but it didn't have the effect that I was hoping for, because
> > the fields that were added were never processed.
> >
> > Working through the code, I discovered that
> > DocFieldProcessor.processDocument iterates through all the fields in the
> > document, collecting them by field name (using it's own hash table?)
> before
> > processing them.
> >
> > Of  course, this breaks my add-fields-as-other-fields-are-being-processed
> > approach because the iterator is exhausted before any of the processing
> > happens.
> >
> > So, my questions are: Does it make any sense to try to do this?  If so,
> is
> > there an approach that will work without having to rewrite a lot of
> > indexing code?
> >
> > Thanks,
> >
> > Steve Green
> > --
> > Stephen Green
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


-- 
Stephen Green
http://thesearchguy.wordpress.com

Re: Indexing a document that modifies itself as it's being indexed

Posted by Michael McCandless <lu...@mikemccandless.com>.
You can't rely on how IndexWriter will iterate/consume those fields;
that's an implementation detail.

Maybe you could use CachingTokenFilter to pre-process the text fields
and append the new fields?  And then during indexing, replay the
cached tokens, so you don't have to tokenize twice.

Mike McCandless

http://blog.mikemccandless.com


On Tue, Mar 11, 2014 at 2:33 PM, Stephen Green <ee...@gmail.com> wrote:
> I'm working on a system that uses Lucene 4.6.0, and I have a couple of use
> cases for documents that modify themselves as they're being indexed.
>
> For example, we have text classifiers that we would like to run on the
> contents of certain fields.  These classifiers produce field values (i.e.,
> the classes that the document is in) that I would like to be part of the
> document.
>
> Now, the text classifiers want to tokenize the text in order to do the
> classification, and I'd like to avoid re-tokenizing the text multiple
> times, so I can build a token filter that collects the tokens and then runs
> the classifier.  This filter can know about the oald.Document that's being
> processed, but I suspected that adding elements to Document.fields  while
> it's being indexed would lead to a concurrent modification exception.
>
> Since IndexWriter.addDocument takes an Iterable<IndexableField>, I figured
> I could just make my own document class that implemented Iterable, but
> would allow me to add new fields onto the end of the document and extend
> the iteration to cover those fields.
>
> I did this, but it didn't have the effect that I was hoping for, because
> the fields that were added were never processed.
>
> Working through the code, I discovered that
> DocFieldProcessor.processDocument iterates through all the fields in the
> document, collecting them by field name (using it's own hash table?) before
> processing them.
>
> Of  course, this breaks my add-fields-as-other-fields-are-being-processed
> approach because the iterator is exhausted before any of the processing
> happens.
>
> So, my questions are: Does it make any sense to try to do this?  If so, is
> there an approach that will work without having to rewrite a lot of
> indexing code?
>
> Thanks,
>
> Steve Green
> --
> Stephen Green

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org