You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by yanyanzeng <lu...@hotmail.com> on 2008/08/07 02:39:25 UTC

bad index by batch indexing

Hi,
    I am building a search engine for text transcript documents from the
database of an enterprise messaging system,  and have designed a batch
processing job to incrementally build the index,because the database from
production is around huge, around 10G. 
   Now I am still testing in DEV environment, and have been puzzled by this
problem for a couple of days.  
If I build the index in one setting(because DEV database is very very
small),  the index is correct because I can get hits for my queries,  also, 
what luke shows looks fine,  4800 documents, 450 terms.
However, if I test building using my batch processing job,  I do get the
index which looks fine, but, when I search, it already returns 0 hits.  I
checked with Luke, which shows there are 5200 documents, 0 terms .
There is no exception or runtime error or anything abnormal during indexing
or searching,  I am really at a loss. 
The only difference between the two is that:   in the one setting approach, 
the whole index is built using the same indexwriter object.
in the batch approach,  an indexwriter object is opened per batch and closed
when the batch is finished. 
But,  I  think I have taken care of it by  
               IndexWriter  writer = new IndexWriter(FSDir, Analyser,
!FSdir.exists) 
 
Since lucene is designed for adding to exisiting index when the 3rd
parameter is false,   I do not understand where it went wrong.  
Should I have kept one singleton instance of the writer  until  all
documents in the database are processed, rather than opening &closing one
for each batch?     Or,  should I have kept a single instance of analyser?  
This does not seem necessary, but I really can not figure out where it went
wrong, and how come this strange behavior:  520 documents but 0 terms. 

I would be very grateful if anyone could advise.  THanks very much.

yanyan


-- 
View this message in context: http://www.nabble.com/bad-index-by-batch-indexing-tp18862037p18862037.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: bad index by batch indexing

Posted by yanyanzeng <lu...@hotmail.com>.
hi, thank you very much for your reply.  I am using the latest version,
lucene 2.3.2.   I will try using two arguments and post my result later.

yanyan




Anshum-2 wrote:
> 
> This really seems like an issue the batching mechanism (one of those
> errors
> which seem trivial on discovery :) ). I work with batched indexing and it
> works absolutely fine on data that is a lot higher in magnitude. You could
> try calling the indexwriter without the 3rd argument and see if it helps.
> Also, which version of lucene are you using?
> 
> --
> Anshum
> http://ai-cafe.blogspot.com
> 
> On Thu, Aug 7, 2008 at 6:09 AM, yanyanzeng
> <lu...@hotmail.com>wrote:
> 
>>
>> Hi,
>>    I am building a search engine for text transcript documents from the
>> database of an enterprise messaging system,  and have designed a batch
>> processing job to incrementally build the index,because the database from
>> production is around huge, around 10G.
>>   Now I am still testing in DEV environment, and have been puzzled by
>> this
>> problem for a couple of days.
>> If I build the index in one setting(because DEV database is very very
>> small),  the index is correct because I can get hits for my queries, 
>> also,
>> what luke shows looks fine,  4800 documents, 450 terms.
>> However, if I test building using my batch processing job,  I do get the
>> index which looks fine, but, when I search, it already returns 0 hits.  I
>> checked with Luke, which shows there are 5200 documents, 0 terms .
>> There is no exception or runtime error or anything abnormal during
>> indexing
>> or searching,  I am really at a loss.
>> The only difference between the two is that:   in the one setting
>> approach,
>> the whole index is built using the same indexwriter object.
>> in the batch approach,  an indexwriter object is opened per batch and
>> closed
>> when the batch is finished.
>> But,  I  think I have taken care of it by
>>               IndexWriter  writer = new IndexWriter(FSDir, Analyser,
>> !FSdir.exists)
>>
>> Since lucene is designed for adding to exisiting index when the 3rd
>> parameter is false,   I do not understand where it went wrong.
>> Should I have kept one singleton instance of the writer  until  all
>> documents in the database are processed, rather than opening &closing one
>> for each batch?     Or,  should I have kept a single instance of
>> analyser?
>> This does not seem necessary, but I really can not figure out where it
>> went
>> wrong, and how come this strange behavior:  520 documents but 0 terms.
>>
>> I would be very grateful if anyone could advise.  THanks very much.
>>
>> yanyan
>>
>>
>> --
>> View this message in context:
>> http://www.nabble.com/bad-index-by-batch-indexing-tp18862037p18862037.html
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
> 
> 
> -- 
> --
> The facts expressed here belong to everybody, the opinions to me.
> The distinction is yours to draw............
> 
> 

-- 
View this message in context: http://www.nabble.com/bad-index-by-batch-indexing-tp18862037p18863533.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: bad index by batch indexing

Posted by Anshum <an...@gmail.com>.
This really seems like an issue the batching mechanism (one of those errors
which seem trivial on discovery :) ). I work with batched indexing and it
works absolutely fine on data that is a lot higher in magnitude. You could
try calling the indexwriter without the 3rd argument and see if it helps.
Also, which version of lucene are you using?

--
Anshum
http://ai-cafe.blogspot.com

On Thu, Aug 7, 2008 at 6:09 AM, yanyanzeng <lu...@hotmail.com>wrote:

>
> Hi,
>    I am building a search engine for text transcript documents from the
> database of an enterprise messaging system,  and have designed a batch
> processing job to incrementally build the index,because the database from
> production is around huge, around 10G.
>   Now I am still testing in DEV environment, and have been puzzled by this
> problem for a couple of days.
> If I build the index in one setting(because DEV database is very very
> small),  the index is correct because I can get hits for my queries,  also,
> what luke shows looks fine,  4800 documents, 450 terms.
> However, if I test building using my batch processing job,  I do get the
> index which looks fine, but, when I search, it already returns 0 hits.  I
> checked with Luke, which shows there are 5200 documents, 0 terms .
> There is no exception or runtime error or anything abnormal during indexing
> or searching,  I am really at a loss.
> The only difference between the two is that:   in the one setting approach,
> the whole index is built using the same indexwriter object.
> in the batch approach,  an indexwriter object is opened per batch and
> closed
> when the batch is finished.
> But,  I  think I have taken care of it by
>               IndexWriter  writer = new IndexWriter(FSDir, Analyser,
> !FSdir.exists)
>
> Since lucene is designed for adding to exisiting index when the 3rd
> parameter is false,   I do not understand where it went wrong.
> Should I have kept one singleton instance of the writer  until  all
> documents in the database are processed, rather than opening &closing one
> for each batch?     Or,  should I have kept a single instance of analyser?
> This does not seem necessary, but I really can not figure out where it went
> wrong, and how come this strange behavior:  520 documents but 0 terms.
>
> I would be very grateful if anyone could advise.  THanks very much.
>
> yanyan
>
>
> --
> View this message in context:
> http://www.nabble.com/bad-index-by-batch-indexing-tp18862037p18862037.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


-- 
--
The facts expressed here belong to everybody, the opinions to me.
The distinction is yours to draw............

Re: bad index by batch indexing

Posted by Mark Miller <ma...@gmail.com>.
Only one Writer should be active on an index at any given time. You  
don't want to batch unless you need to see the docs in the index as  
you build it. - it's slower overall. You can add from multiple threads  
at the same time, but use the same Writer.

Sent from my iPhone

On Aug 6, 2008, at 8:39 PM, yanyanzeng <lu...@hotmail.com>  
wrote:

>
> Hi,
>    I am building a search engine for text transcript documents from  
> the
> database of an enterprise messaging system,  and have designed a batch
> processing job to incrementally build the index,because the database  
> from
> production is around huge, around 10G.
>   Now I am still testing in DEV environment, and have been puzzled  
> by this
> problem for a couple of days.
> If I build the index in one setting(because DEV database is very very
> small),  the index is correct because I can get hits for my  
> queries,  also,
> what luke shows looks fine,  4800 documents, 450 terms.
> However, if I test building using my batch processing job,  I do get  
> the
> index which looks fine, but, when I search, it already returns 0  
> hits.  I
> checked with Luke, which shows there are 5200 documents, 0 terms .
> There is no exception or runtime error or anything abnormal during  
> indexing
> or searching,  I am really at a loss.
> The only difference between the two is that:   in the one setting  
> approach,
> the whole index is built using the same indexwriter object.
> in the batch approach,  an indexwriter object is opened per batch  
> and closed
> when the batch is finished.
> But,  I  think I have taken care of it by
>               IndexWriter  writer = new IndexWriter(FSDir, Analyser,
> !FSdir.exists)
>
> Since lucene is designed for adding to exisiting index when the 3rd
> parameter is false,   I do not understand where it went wrong.
> Should I have kept one singleton instance of the writer  until  all
> documents in the database are processed, rather than opening  
> &closing one
> for each batch?     Or,  should I have kept a single instance of  
> analyser?
> This does not seem necessary, but I really can not figure out where  
> it went
> wrong, and how come this strange behavior:  520 documents but 0 terms.
>
> I would be very grateful if anyone could advise.  THanks very much.
>
> yanyan
>
>
> -- 
> View this message in context: http://www.nabble.com/bad-index-by-batch-indexing-tp18862037p18862037.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org