You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Phil Whelan <ph...@gmail.com> on 2009/07/30 20:11:44 UTC

indexing multiple email addresses in one field

Hi,

We have a very large lucene index that we're developing that has a
field of email addresses. (Actually mulitple fields with multiple
emails addresses, but I'll simplify here)

Each document will have one "email" field containing multiple email addresses.

I am indexing email addresses only using WhitespaceAnalyzer, so to
preserve the exact adresses and store multiple emails for one
document.

Example...
doc.add(new Field("email", "foo@bar.com bar@foo.com com@bar.foo",
Field.Store.YES, Field.Index.ANALYZED ));

Terms for this document will then be...
email:foo@bar.com
email:bar@foo.com
email:com@bar.foo

The problem I having is that these terms are rarely re-used in other
documents. There is little overlap with email usage, and there is a
lot of very long emails addresses. Because of this, the number of
terms in my index is very big and I think it's is causing performance
issues and bloating the index.

I think I'm not using Lucene optimally here.


A couple of questions...

1) Is there a way I can analyze these emails down to smaller terms but
still search for the exact email address? For instance, if I used a
different analyzer and broke these down to the terms "foo", "bar", and
"com", is Lucene able to find "email:foo@bar.com" without matching
"email:com@foo.bar"?

2) Does Lucene retain the positional information of tokens in the
index? Knowing this will help me anwer question 1.

Thanks,
Phil

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: ThreadedIndexWriter vs. IndexWriter

Posted by Phil Whelan <ph...@gmail.com>.
Hi Jibo,

Have you tried optimizing indexes? I do not know anything about the
implementation of ThreadedIndexWriter, but if they both optimize down
to the same size, it could just mean that ThreadedIndexWriter is not
as optimized.

Thanks,
Phil

On Fri, Jul 31, 2009 at 11:38 AM, Jibo John<ji...@mac.com> wrote:
> Number of docs are the same in the index for both the cases (200,000).
> I haven't altered the benchmark/ code, but, used a profiler to verify that
>  Benchmark main thread is closed only after all other  threads are closed.
>
> Thanks,
> -Jibo

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: ThreadedIndexWriter vs. IndexWriter

Posted by Michael McCandless <lu...@mikemccandless.com>.
Woops sorry for the confusion!

Mike

On Sat, Aug 1, 2009 at 1:03 PM, Phil Whelan<ph...@gmail.com> wrote:
> Hi Mike,
>
> It's Jibo, not me, having the problem. But thanks for the link. I was
> interested to look at the code. Will be buying the book soon.
>
> Phil
>
> On Sat, Aug 1, 2009 at 2:08 AM, Michael McCandless
> <lu...@mikemccandless.com> wrote:
>>
>> (Please note that ThreadedIndexWriter is source code available with
>> the upcoming revision to Lucene in Action.)
>>
>> Phil, is it possible you are using an older version of the book's
>> source code?  In particular, can you check whether your version of
>> ThreadedIndexWriter.java has this:
>>
>>  public void close(boolean doWait) throws CorruptIndexException, IOException {
>>    finish();
>>    super.close(doWait);
>>  }
>>
>> (I vaguely remember that being missing from earlier releases, which
>> could explain what you're seeing).  If you are missing that, can you
>> download the current code from http://www.manning.com/hatcher3 and try
>> again?
>>
>> If that's not the problem... can you post the benchmark alg you are
>> using in each case?
>>
>> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: ThreadedIndexWriter vs. IndexWriter

Posted by Phil Whelan <ph...@gmail.com>.
Hi Mike,

It's Jibo, not me, having the problem. But thanks for the link. I was
interested to look at the code. Will be buying the book soon.

Phil

On Sat, Aug 1, 2009 at 2:08 AM, Michael McCandless
<lu...@mikemccandless.com> wrote:
>
> (Please note that ThreadedIndexWriter is source code available with
> the upcoming revision to Lucene in Action.)
>
> Phil, is it possible you are using an older version of the book's
> source code?  In particular, can you check whether your version of
> ThreadedIndexWriter.java has this:
>
>  public void close(boolean doWait) throws CorruptIndexException, IOException {
>    finish();
>    super.close(doWait);
>  }
>
> (I vaguely remember that being missing from earlier releases, which
> could explain what you're seeing).  If you are missing that, can you
> download the current code from http://www.manning.com/hatcher3 and try
> again?
>
> If that's not the problem... can you post the benchmark alg you are
> using in each case?
>
> Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: ThreadedIndexWriter vs. IndexWriter

Posted by Michael McCandless <lu...@mikemccandless.com>.
Phew!  Thank you for raising this... it was a sneaky one.

Mike

On Tue, Aug 11, 2009 at 4:13 PM, Jibo John<ji...@mac.com> wrote:
> Mike,
>
> Yes, it works perfect !
>
> I did observe a dip in the indexing throughput (1855 recs/sec vs. 2200
> recs/sec previously), but, more importantly, no data is lost this time.
>
> Thanks for helping me nail this down.
>
> -Jibo
>
>
>
> On Aug 11, 2009, at 11:12 AM, Michael McCandless wrote:
>
>> OK I found the problem!
>>
>> It was losing docs from the queue, when shutting down the thread pool,
>> because we were calling super's addDocument(doc) not addDocument(doc,
>> analyzer).  IndexWriter was simply forwarding that call to
>> ThreadedIndexWriter's addDocument(doc, analyzer) which in turn would
>> do nothing because the thread pool was already told to shut down.
>> Larger queues made it much more likely to happen.
>>
>> Can you try the new version (attached)?
>>
>> Also, make sure you add 'doc.reuse.fields=false' to your alg (on
>> trunk).
>>
>> Mike
>>
>> On Tue, Aug 11, 2009 at 12:39 PM, Jibo John<ji...@mac.com> wrote:
>>>
>>> Mike,
>>>
>>> I wasn't exactly using the lucene core jar from MEAP.
>>>
>>> I have been building lucene from the source, and running the tests under
>>> lucene/java/trunk/contrib/benchmark/ (checked out 2 weeks ago, I guess)
>>>  and, also under  lucene/java/tags/lucene_2_4_1/contrib/benchmark/.
>>> In both cases, copied CreateThreadedIndexTask to
>>> org.apache.lucene.benchmark.byTask.tasks and ThreadedIndexWriter to
>>> org.apache.lucene.index.
>>>
>>> I have observed the issue in both the versions of lucene.
>>>
>>> Indexes were optimized separately using Lucli.
>>>
>>>
>>> PFA the classes and the alg.
>>>
>>>
>>>
>>>
>>>
>>>
>>> Thank you for your help with this one.
>>>
>>> -Jibo
>>>
>>>
>>>
>>>
>>> On Aug 11, 2009, at 3:13 AM, Michael McCandless wrote:
>>>
>>>> I'm baffled why you're losing docs w/ ThreadedIndexWriter.
>>>>
>>>> One question: your Lucene core JAR seems to be newer than the last
>>>> MEAP update.  Did you update it manually?
>>>>
>>>> Also, your indexes were optimized, but your algs don't have an
>>>> optimize step -- did you separately run an optimize?
>>>>
>>>> Could you zip up the whole shebang (ThreadedIndexWriter.java,
>>>> CreateThreadedIndexTask.java, the algs) & post?  Please CC me directly
>>>> so I can grab the zip file... thanks.
>>>>
>>>> Mike
>>>>
>>>> On Mon, Aug 3, 2009 at 12:37 PM, Jibo John<ji...@mac.com> wrote:
>>>>>
>>>>> Mike,
>>>>>
>>>>> Verified that I have the latest source code.
>>>>> Here are the alg files and the checkindexer output.
>>>>>
>>>>>
>>>>> ----------------------------------------- indexwriter
>>>>> alg----------------------------------------------------------------
>>>>>
>>>>> analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer
>>>>> doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker
>>>>> directory=FSDirectory
>>>>>
>>>>> doc.stored = true                                                    #A
>>>>> docs.file=wikipedia.lines.txt
>>>>> ram.flush.mb=50
>>>>> compound=false
>>>>> merge.factor=5
>>>>> doc.add.log.step=1000
>>>>> doc.term.vector=false
>>>>> doc.term.vector.positions=false
>>>>> doc.term.vector.offsets=false
>>>>>
>>>>> { "Rounds"                                                           #B
>>>>>  ResetSystemErase
>>>>>  { "BuildIndex"
>>>>>  -CreateIndex()
>>>>>  [ { "AddDocs" AddDoc > : 40000 ] : 5
>>>>>  #C
>>>>>  -CloseIndex()
>>>>>  }
>>>>>  NewRound
>>>>> } : 1
>>>>>
>>>>> RepSumByPrefRound BuildIndex                                         #D
>>>>>
>>>>> -----------------------------------------threadedindexwriter alg
>>>>> ----------------------------------------------------------------
>>>>>
>>>>> analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer
>>>>> doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker
>>>>> directory=FSDirectory
>>>>>
>>>>> doc.stored = true                                                    #A
>>>>> docs.file=wikipedia.lines.txt
>>>>> ram.flush.mb=50
>>>>> compound=false
>>>>> merge.factor=5
>>>>> doc.add.log.step=1000
>>>>> doc.term.vector=false
>>>>> doc.term.vector.positions=false
>>>>> doc.term.vector.offsets=false
>>>>> writer.num.threads=15
>>>>> writer.max.thread.queue.size=75
>>>>> work.dir=work_t
>>>>>
>>>>>
>>>>> { "Rounds"                                                           #B
>>>>>  ResetSystemErase
>>>>>  { "BuildIndex"
>>>>>  -CreateThreadedIndex()
>>>>>  { "AddDocs" AddDoc > : 200000
>>>>>  -CloseIndex()
>>>>>  }
>>>>>  NewRound
>>>>> } : 1
>>>>>
>>>>> RepSumByPrefRound BuildIndex                                         #D
>>>>>
>>>>>
>>>>> -----------------------------------------------threadedindexwriter
>>>>> checkindex ----------------------------------------------------------
>>>>>
>>>>>
>>>>> $ java -classpath
>>>>>
>>>>> /Users/jibo/Desktop/iwork/lucene/java/trunk/build/lucene-core-2.9-dev.jar
>>>>> org.apache.lucene.index.CheckIndex
>>>>>
>>>>>
>>>>> /Users/jibo/Desktop/iwork/lucene/java/trunk/contrib/benchmark/work_t/index
>>>>>
>>>>> NOTE: testing will be more thorough if you run java with
>>>>> '-ea:org.apache.lucene...', so assertions are enabled
>>>>>
>>>>> Opening index @
>>>>>
>>>>>
>>>>> /Users/jibo/Desktop/iwork/lucene/java/trunk/contrib/benchmark/work_t/index
>>>>>
>>>>> Segments file=segments_3 numSegments=1 version=FORMAT_DIAGNOSTICS
>>>>> [Lucene
>>>>> 2.9]
>>>>>  1 of 1: name=_p docCount=199941
>>>>>  compound=true
>>>>>  hasProx=true
>>>>>  numFiles=3
>>>>>  size (MB)=317.1
>>>>>  diagnostics = {java.version=1.5.0_19, lucene.version=2.9-dev 779767M -
>>>>> 2009-05-28 17:02:17, os=Mac OS X, os.arch=i386, optimize=true,
>>>>> mergeDocStores=false, java.vendor=Apple Inc., os.version=10.5.7,
>>>>> source=merge, mergeFactor=5}
>>>>>  docStoreOffset=0
>>>>>  docStoreSegment=_0
>>>>>  docStoreIsCompoundFile=false
>>>>>  no deletions
>>>>>  test: open reader.........OK
>>>>>  test: fields, norms.......OK [4 fields]
>>>>>  test: terms, freq, prox...OK [1269552 terms; 67887116 terms/docs
>>>>> pairs;
>>>>> 133241176 tokens]
>>>>>  test: stored fields.......OK [199941 total field count; avg 1 fields
>>>>> per
>>>>> doc]
>>>>>  test: term vectors........OK [0 total vector count; avg 0 term/freq
>>>>> vector
>>>>> fields per doc]
>>>>>
>>>>> No problems were detected with this index.
>>>>>
>>>>> ------------------------------------------indexwriter checkindex
>>>>> ---------------------------------------------------------------
>>>>>
>>>>> $ java -classpath
>>>>>
>>>>> /Users/jibo/Desktop/iwork/lucene/java/trunk/build/lucene-core-2.9-dev.jar
>>>>> org.apache.lucene.index.CheckIndex
>>>>>
>>>>> /Users/jibo/Desktop/iwork/lucene/java/trunk/contrib/benchmark/work/index
>>>>>
>>>>> NOTE: testing will be more thorough if you run java with
>>>>> '-ea:org.apache.lucene...', so assertions are enabled
>>>>>
>>>>> Opening index @
>>>>>
>>>>> /Users/jibo/Desktop/iwork/lucene/java/trunk/contrib/benchmark/work/index
>>>>>
>>>>> Segments file=segments_a numSegments=1 version=FORMAT_DIAGNOSTICS
>>>>> [Lucene
>>>>> 2.9]
>>>>>  1 of 1: name=_18 docCount=200000
>>>>>  compound=true
>>>>>  hasProx=true
>>>>>  numFiles=1
>>>>>  size (MB)=427.445
>>>>>  diagnostics = {java.version=1.5.0_19, lucene.version=2.9-dev 779767M -
>>>>> 2009-05-28 17:02:17, os=Mac OS X, os.arch=i386, optimize=true,
>>>>> mergeDocStores=true, java.vendor=Apple Inc., os.version=10.5.7,
>>>>> source=merge, mergeFactor=4}
>>>>>  no deletions
>>>>>  test: open reader.........OK
>>>>>  test: fields, norms.......OK [4 fields]
>>>>>  test: terms, freq, prox...OK [3512343 terms; 80020204 terms/docs
>>>>> pairs;
>>>>> 163219760 tokens]
>>>>>  test: stored fields.......OK [200000 total field count; avg 1 fields
>>>>> per
>>>>> doc]
>>>>>  test: term vectors........OK [0 total vector count; avg 0 term/freq
>>>>> vector
>>>>> fields per doc]
>>>>>
>>>>> No problems were detected with this index.
>>>>>
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------------------------------------------
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Thanks,
>>>>> -Jibo
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Aug 1, 2009, at 2:08 AM, Michael McCandless wrote:
>>>>>
>>>>>> (Please note that ThreadedIndexWriter is source code available with
>>>>>> the upcoming revision to Lucene in Action.)
>>>>>>
>>>>>> Phil, is it possible you are using an older version of the book's
>>>>>> source code?  In particular, can you check whether your version of
>>>>>> ThreadedIndexWriter.java has this:
>>>>>>
>>>>>>  public void close(boolean doWait) throws CorruptIndexException,
>>>>>> IOException {
>>>>>>  finish();
>>>>>>  super.close(doWait);
>>>>>>  }
>>>>>>
>>>>>> (I vaguely remember that being missing from earlier releases, which
>>>>>> could explain what you're seeing).  If you are missing that, can you
>>>>>> download the current code from http://www.manning.com/hatcher3 and try
>>>>>> again?
>>>>>>
>>>>>> If that's not the problem... can you post the benchmark alg you are
>>>>>> using in each case?
>>>>>>
>>>>>> Mike
>>>>>>
>>>>>> On Fri, Jul 31, 2009 at 8:26 PM, Jibo John<ji...@mac.com> wrote:
>>>>>>>
>>>>>>> Hi Phil,
>>>>>>>
>>>>>>> It's 5 threads for IndexWriter.
>>>>>>>
>>>>>>> For ThreadedIndexWriter, I used:
>>>>>>>
>>>>>>> writer.num.threads=16
>>>>>>> writer.max.thread.queue.size=80
>>>>>>>
>>>>>>> Thanks,
>>>>>>> -Jibo
>>>>>>>
>>>>>>> On Jul 31, 2009, at 5:01 PM, Phil Whelan wrote:
>>>>>>>
>>>>>>>> Hi Jibo,
>>>>>>>>
>>>>>>>> Your mergeFactor is different, and the resulting numFiles (segment
>>>>>>>> files) is different. Maybe each thread is responsible for a segment
>>>>>>>> file. Just curious - do you have 3 threads?
>>>>>>>>
>>>>>>>> Phil
>>>>>>>>
>>>>>>>>
>>>>>>>> ---------------------------------------------------------------------
>>>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>>
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>
>>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>
>>>
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: ThreadedIndexWriter vs. IndexWriter

Posted by Jibo John <ji...@mac.com>.
Mike,

Yes, it works perfect !

I did observe a dip in the indexing throughput (1855 recs/sec vs. 2200  
recs/sec previously), but, more importantly, no data is lost this time.

Thanks for helping me nail this down.

-Jibo



On Aug 11, 2009, at 11:12 AM, Michael McCandless wrote:

> OK I found the problem!
>
> It was losing docs from the queue, when shutting down the thread pool,
> because we were calling super's addDocument(doc) not addDocument(doc,
> analyzer).  IndexWriter was simply forwarding that call to
> ThreadedIndexWriter's addDocument(doc, analyzer) which in turn would
> do nothing because the thread pool was already told to shut down.
> Larger queues made it much more likely to happen.
>
> Can you try the new version (attached)?
>
> Also, make sure you add 'doc.reuse.fields=false' to your alg (on
> trunk).
>
> Mike
>
> On Tue, Aug 11, 2009 at 12:39 PM, Jibo John<ji...@mac.com> wrote:
>> Mike,
>>
>> I wasn't exactly using the lucene core jar from MEAP.
>>
>> I have been building lucene from the source, and running the tests  
>> under
>> lucene/java/trunk/contrib/benchmark/ (checked out 2 weeks ago, I  
>> guess)
>>  and, also under  lucene/java/tags/lucene_2_4_1/contrib/benchmark/.
>> In both cases, copied CreateThreadedIndexTask to
>> org.apache.lucene.benchmark.byTask.tasks and ThreadedIndexWriter to
>> org.apache.lucene.index.
>>
>> I have observed the issue in both the versions of lucene.
>>
>> Indexes were optimized separately using Lucli.
>>
>>
>> PFA the classes and the alg.
>>
>>
>>
>>
>>
>>
>> Thank you for your help with this one.
>>
>> -Jibo
>>
>>
>>
>>
>> On Aug 11, 2009, at 3:13 AM, Michael McCandless wrote:
>>
>>> I'm baffled why you're losing docs w/ ThreadedIndexWriter.
>>>
>>> One question: your Lucene core JAR seems to be newer than the last
>>> MEAP update.  Did you update it manually?
>>>
>>> Also, your indexes were optimized, but your algs don't have an
>>> optimize step -- did you separately run an optimize?
>>>
>>> Could you zip up the whole shebang (ThreadedIndexWriter.java,
>>> CreateThreadedIndexTask.java, the algs) & post?  Please CC me  
>>> directly
>>> so I can grab the zip file... thanks.
>>>
>>> Mike
>>>
>>> On Mon, Aug 3, 2009 at 12:37 PM, Jibo John<ji...@mac.com> wrote:
>>>>
>>>> Mike,
>>>>
>>>> Verified that I have the latest source code.
>>>> Here are the alg files and the checkindexer output.
>>>>
>>>>
>>>> ----------------------------------------- indexwriter
>>>> alg----------------------------------------------------------------
>>>>
>>>> analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer
>>>> doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker
>>>> directory=FSDirectory
>>>>
>>>> doc.stored =  
>>>> true                                                    #A
>>>> docs.file=wikipedia.lines.txt
>>>> ram.flush.mb=50
>>>> compound=false
>>>> merge.factor=5
>>>> doc.add.log.step=1000
>>>> doc.term.vector=false
>>>> doc.term.vector.positions=false
>>>> doc.term.vector.offsets=false
>>>>
>>>> { "Rounds 
>>>> "                                                           #B
>>>>  ResetSystemErase
>>>>  { "BuildIndex"
>>>>  -CreateIndex()
>>>>  [ { "AddDocs" AddDoc > : 40000 ] : 5
>>>>  #C
>>>>  -CloseIndex()
>>>>  }
>>>>  NewRound
>>>> } : 1
>>>>
>>>> RepSumByPrefRound  
>>>> BuildIndex                                         #D
>>>>
>>>> -----------------------------------------threadedindexwriter alg
>>>> ----------------------------------------------------------------
>>>>
>>>> analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer
>>>> doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker
>>>> directory=FSDirectory
>>>>
>>>> doc.stored =  
>>>> true                                                    #A
>>>> docs.file=wikipedia.lines.txt
>>>> ram.flush.mb=50
>>>> compound=false
>>>> merge.factor=5
>>>> doc.add.log.step=1000
>>>> doc.term.vector=false
>>>> doc.term.vector.positions=false
>>>> doc.term.vector.offsets=false
>>>> writer.num.threads=15
>>>> writer.max.thread.queue.size=75
>>>> work.dir=work_t
>>>>
>>>>
>>>> { "Rounds 
>>>> "                                                           #B
>>>>  ResetSystemErase
>>>>  { "BuildIndex"
>>>>  -CreateThreadedIndex()
>>>>  { "AddDocs" AddDoc > : 200000
>>>>  -CloseIndex()
>>>>  }
>>>>  NewRound
>>>> } : 1
>>>>
>>>> RepSumByPrefRound  
>>>> BuildIndex                                         #D
>>>>
>>>>
>>>> -----------------------------------------------threadedindexwriter
>>>> checkindex  
>>>> ----------------------------------------------------------
>>>>
>>>>
>>>> $ java -classpath
>>>> /Users/jibo/Desktop/iwork/lucene/java/trunk/build/lucene-core-2.9- 
>>>> dev.jar
>>>> org.apache.lucene.index.CheckIndex
>>>>
>>>> /Users/jibo/Desktop/iwork/lucene/java/trunk/contrib/benchmark/ 
>>>> work_t/index
>>>>
>>>> NOTE: testing will be more thorough if you run java with
>>>> '-ea:org.apache.lucene...', so assertions are enabled
>>>>
>>>> Opening index @
>>>>
>>>> /Users/jibo/Desktop/iwork/lucene/java/trunk/contrib/benchmark/ 
>>>> work_t/index
>>>>
>>>> Segments file=segments_3 numSegments=1 version=FORMAT_DIAGNOSTICS  
>>>> [Lucene
>>>> 2.9]
>>>>  1 of 1: name=_p docCount=199941
>>>>  compound=true
>>>>  hasProx=true
>>>>  numFiles=3
>>>>  size (MB)=317.1
>>>>  diagnostics = {java.version=1.5.0_19, lucene.version=2.9-dev  
>>>> 779767M -
>>>> 2009-05-28 17:02:17, os=Mac OS X, os.arch=i386, optimize=true,
>>>> mergeDocStores=false, java.vendor=Apple Inc., os.version=10.5.7,
>>>> source=merge, mergeFactor=5}
>>>>  docStoreOffset=0
>>>>  docStoreSegment=_0
>>>>  docStoreIsCompoundFile=false
>>>>  no deletions
>>>>  test: open reader.........OK
>>>>  test: fields, norms.......OK [4 fields]
>>>>  test: terms, freq, prox...OK [1269552 terms; 67887116 terms/docs  
>>>> pairs;
>>>> 133241176 tokens]
>>>>  test: stored fields.......OK [199941 total field count; avg 1  
>>>> fields per
>>>> doc]
>>>>  test: term vectors........OK [0 total vector count; avg 0 term/ 
>>>> freq
>>>> vector
>>>> fields per doc]
>>>>
>>>> No problems were detected with this index.
>>>>
>>>> ------------------------------------------indexwriter checkindex
>>>> ---------------------------------------------------------------
>>>>
>>>> $ java -classpath
>>>> /Users/jibo/Desktop/iwork/lucene/java/trunk/build/lucene-core-2.9- 
>>>> dev.jar
>>>> org.apache.lucene.index.CheckIndex
>>>> /Users/jibo/Desktop/iwork/lucene/java/trunk/contrib/benchmark/ 
>>>> work/index
>>>>
>>>> NOTE: testing will be more thorough if you run java with
>>>> '-ea:org.apache.lucene...', so assertions are enabled
>>>>
>>>> Opening index @
>>>> /Users/jibo/Desktop/iwork/lucene/java/trunk/contrib/benchmark/ 
>>>> work/index
>>>>
>>>> Segments file=segments_a numSegments=1 version=FORMAT_DIAGNOSTICS  
>>>> [Lucene
>>>> 2.9]
>>>>  1 of 1: name=_18 docCount=200000
>>>>  compound=true
>>>>  hasProx=true
>>>>  numFiles=1
>>>>  size (MB)=427.445
>>>>  diagnostics = {java.version=1.5.0_19, lucene.version=2.9-dev  
>>>> 779767M -
>>>> 2009-05-28 17:02:17, os=Mac OS X, os.arch=i386, optimize=true,
>>>> mergeDocStores=true, java.vendor=Apple Inc., os.version=10.5.7,
>>>> source=merge, mergeFactor=4}
>>>>  no deletions
>>>>  test: open reader.........OK
>>>>  test: fields, norms.......OK [4 fields]
>>>>  test: terms, freq, prox...OK [3512343 terms; 80020204 terms/docs  
>>>> pairs;
>>>> 163219760 tokens]
>>>>  test: stored fields.......OK [200000 total field count; avg 1  
>>>> fields per
>>>> doc]
>>>>  test: term vectors........OK [0 total vector count; avg 0 term/ 
>>>> freq
>>>> vector
>>>> fields per doc]
>>>>
>>>> No problems were detected with this index.
>>>>
>>>>
>>>> ---------------------------------------------------------------------------------------------------------
>>>>
>>>>
>>>>
>>>>
>>>> Thanks,
>>>> -Jibo
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Aug 1, 2009, at 2:08 AM, Michael McCandless wrote:
>>>>
>>>>> (Please note that ThreadedIndexWriter is source code available  
>>>>> with
>>>>> the upcoming revision to Lucene in Action.)
>>>>>
>>>>> Phil, is it possible you are using an older version of the book's
>>>>> source code?  In particular, can you check whether your version of
>>>>> ThreadedIndexWriter.java has this:
>>>>>
>>>>>  public void close(boolean doWait) throws CorruptIndexException,
>>>>> IOException {
>>>>>  finish();
>>>>>  super.close(doWait);
>>>>>  }
>>>>>
>>>>> (I vaguely remember that being missing from earlier releases,  
>>>>> which
>>>>> could explain what you're seeing).  If you are missing that, can  
>>>>> you
>>>>> download the current code from http://www.manning.com/hatcher3  
>>>>> and try
>>>>> again?
>>>>>
>>>>> If that's not the problem... can you post the benchmark alg you  
>>>>> are
>>>>> using in each case?
>>>>>
>>>>> Mike
>>>>>
>>>>> On Fri, Jul 31, 2009 at 8:26 PM, Jibo John<ji...@mac.com>  
>>>>> wrote:
>>>>>>
>>>>>> Hi Phil,
>>>>>>
>>>>>> It's 5 threads for IndexWriter.
>>>>>>
>>>>>> For ThreadedIndexWriter, I used:
>>>>>>
>>>>>> writer.num.threads=16
>>>>>> writer.max.thread.queue.size=80
>>>>>>
>>>>>> Thanks,
>>>>>> -Jibo
>>>>>>
>>>>>> On Jul 31, 2009, at 5:01 PM, Phil Whelan wrote:
>>>>>>
>>>>>>> Hi Jibo,
>>>>>>>
>>>>>>> Your mergeFactor is different, and the resulting numFiles  
>>>>>>> (segment
>>>>>>> files) is different. Maybe each thread is responsible for a  
>>>>>>> segment
>>>>>>> file. Just curious - do you have 3 threads?
>>>>>>>
>>>>>>> Phil
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>>> For additional commands, e-mail: java-user- 
>>>>>>> help@lucene.apache.org
>>>>>>>
>>>>>>
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>>
>>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: ThreadedIndexWriter vs. IndexWriter

Posted by Michael McCandless <lu...@mikemccandless.com>.
OK I found the problem!

It was losing docs from the queue, when shutting down the thread pool,
because we were calling super's addDocument(doc) not addDocument(doc,
analyzer).  IndexWriter was simply forwarding that call to
ThreadedIndexWriter's addDocument(doc, analyzer) which in turn would
do nothing because the thread pool was already told to shut down.
Larger queues made it much more likely to happen.

Can you try the new version (attached)?

Also, make sure you add 'doc.reuse.fields=false' to your alg (on
trunk).

Mike

On Tue, Aug 11, 2009 at 12:39 PM, Jibo John<ji...@mac.com> wrote:
> Mike,
>
> I wasn't exactly using the lucene core jar from MEAP.
>
> I have been building lucene from the source, and running the tests under
> lucene/java/trunk/contrib/benchmark/ (checked out 2 weeks ago, I guess)
> �and, also under �lucene/java/tags/lucene_2_4_1/contrib/benchmark/.
> In both cases, copied CreateThreadedIndexTask to
> org.apache.lucene.benchmark.byTask.tasks and ThreadedIndexWriter to
> org.apache.lucene.index.
>
> I have observed the issue in both the versions of lucene.
>
> Indexes were optimized separately using Lucli.
>
>
> PFA the classes and the alg.
>
>
>
>
>
>
> Thank you for your help with this one.
>
> -Jibo
>
>
>
>
> On Aug 11, 2009, at 3:13 AM, Michael McCandless wrote:
>
>> I'm baffled why you're losing docs w/ ThreadedIndexWriter.
>>
>> One question: your Lucene core JAR seems to be newer than the last
>> MEAP update. �Did you update it manually?
>>
>> Also, your indexes were optimized, but your algs don't have an
>> optimize step -- did you separately run an optimize?
>>
>> Could you zip up the whole shebang (ThreadedIndexWriter.java,
>> CreateThreadedIndexTask.java, the algs) & post? �Please CC me directly
>> so I can grab the zip file... thanks.
>>
>> Mike
>>
>> On Mon, Aug 3, 2009 at 12:37 PM, Jibo John<ji...@mac.com> wrote:
>>>
>>> Mike,
>>>
>>> Verified that I have the latest source code.
>>> Here are the alg files and the checkindexer output.
>>>
>>>
>>> ----------------------------------------- indexwriter
>>> alg----------------------------------------------------------------
>>>
>>> analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer
>>> doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker
>>> directory=FSDirectory
>>>
>>> doc.stored = true � � � � � � � � � � � � � � � � � � � � � � � � � �#A
>>> docs.file=wikipedia.lines.txt
>>> ram.flush.mb=50
>>> compound=false
>>> merge.factor=5
>>> doc.add.log.step=1000
>>> doc.term.vector=false
>>> doc.term.vector.positions=false
>>> doc.term.vector.offsets=false
>>>
>>> { "Rounds" � � � � � � � � � � � � � � � � � � � � � � � � � � � � � #B
>>> �ResetSystemErase
>>> �{ "BuildIndex"
>>> �-CreateIndex()
>>> �[ { "AddDocs" AddDoc > : 40000 ] : 5
>>> �#C
>>> �-CloseIndex()
>>> �}
>>> �NewRound
>>> } : 1
>>>
>>> RepSumByPrefRound BuildIndex � � � � � � � � � � � � � � � � � � � � #D
>>>
>>> -----------------------------------------threadedindexwriter alg
>>> ----------------------------------------------------------------
>>>
>>> analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer
>>> doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker
>>> directory=FSDirectory
>>>
>>> doc.stored = true � � � � � � � � � � � � � � � � � � � � � � � � � �#A
>>> docs.file=wikipedia.lines.txt
>>> ram.flush.mb=50
>>> compound=false
>>> merge.factor=5
>>> doc.add.log.step=1000
>>> doc.term.vector=false
>>> doc.term.vector.positions=false
>>> doc.term.vector.offsets=false
>>> writer.num.threads=15
>>> writer.max.thread.queue.size=75
>>> work.dir=work_t
>>>
>>>
>>> { "Rounds" � � � � � � � � � � � � � � � � � � � � � � � � � � � � � #B
>>> �ResetSystemErase
>>> �{ "BuildIndex"
>>> �-CreateThreadedIndex()
>>> �{ "AddDocs" AddDoc > : 200000
>>> �-CloseIndex()
>>> �}
>>> �NewRound
>>> } : 1
>>>
>>> RepSumByPrefRound BuildIndex � � � � � � � � � � � � � � � � � � � � #D
>>>
>>>
>>> -----------------------------------------------threadedindexwriter
>>> checkindex ----------------------------------------------------------
>>>
>>>
>>> $ java -classpath
>>> /Users/jibo/Desktop/iwork/lucene/java/trunk/build/lucene-core-2.9-dev.jar
>>> org.apache.lucene.index.CheckIndex
>>>
>>> /Users/jibo/Desktop/iwork/lucene/java/trunk/contrib/benchmark/work_t/index
>>>
>>> NOTE: testing will be more thorough if you run java with
>>> '-ea:org.apache.lucene...', so assertions are enabled
>>>
>>> Opening index @
>>>
>>> /Users/jibo/Desktop/iwork/lucene/java/trunk/contrib/benchmark/work_t/index
>>>
>>> Segments file=segments_3 numSegments=1 version=FORMAT_DIAGNOSTICS [Lucene
>>> 2.9]
>>> �1 of 1: name=_p docCount=199941
>>> �compound=true
>>> �hasProx=true
>>> �numFiles=3
>>> �size (MB)=317.1
>>> �diagnostics = {java.version=1.5.0_19, lucene.version=2.9-dev 779767M -
>>> 2009-05-28 17:02:17, os=Mac OS X, os.arch=i386, optimize=true,
>>> mergeDocStores=false, java.vendor=Apple Inc., os.version=10.5.7,
>>> source=merge, mergeFactor=5}
>>> �docStoreOffset=0
>>> �docStoreSegment=_0
>>> �docStoreIsCompoundFile=false
>>> �no deletions
>>> �test: open reader.........OK
>>> �test: fields, norms.......OK [4 fields]
>>> �test: terms, freq, prox...OK [1269552 terms; 67887116 terms/docs pairs;
>>> 133241176 tokens]
>>> �test: stored fields.......OK [199941 total field count; avg 1 fields per
>>> doc]
>>> �test: term vectors........OK [0 total vector count; avg 0 term/freq
>>> vector
>>> fields per doc]
>>>
>>> No problems were detected with this index.
>>>
>>> ------------------------------------------indexwriter checkindex
>>> ---------------------------------------------------------------
>>>
>>> $ java -classpath
>>> /Users/jibo/Desktop/iwork/lucene/java/trunk/build/lucene-core-2.9-dev.jar
>>> org.apache.lucene.index.CheckIndex
>>> /Users/jibo/Desktop/iwork/lucene/java/trunk/contrib/benchmark/work/index
>>>
>>> NOTE: testing will be more thorough if you run java with
>>> '-ea:org.apache.lucene...', so assertions are enabled
>>>
>>> Opening index @
>>> /Users/jibo/Desktop/iwork/lucene/java/trunk/contrib/benchmark/work/index
>>>
>>> Segments file=segments_a numSegments=1 version=FORMAT_DIAGNOSTICS [Lucene
>>> 2.9]
>>> �1 of 1: name=_18 docCount=200000
>>> �compound=true
>>> �hasProx=true
>>> �numFiles=1
>>> �size (MB)=427.445
>>> �diagnostics = {java.version=1.5.0_19, lucene.version=2.9-dev 779767M -
>>> 2009-05-28 17:02:17, os=Mac OS X, os.arch=i386, optimize=true,
>>> mergeDocStores=true, java.vendor=Apple Inc., os.version=10.5.7,
>>> source=merge, mergeFactor=4}
>>> �no deletions
>>> �test: open reader.........OK
>>> �test: fields, norms.......OK [4 fields]
>>> �test: terms, freq, prox...OK [3512343 terms; 80020204 terms/docs pairs;
>>> 163219760 tokens]
>>> �test: stored fields.......OK [200000 total field count; avg 1 fields per
>>> doc]
>>> �test: term vectors........OK [0 total vector count; avg 0 term/freq
>>> vector
>>> fields per doc]
>>>
>>> No problems were detected with this index.
>>>
>>>
>>> ---------------------------------------------------------------------------------------------------------
>>>
>>>
>>>
>>>
>>> Thanks,
>>> -Jibo
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Aug 1, 2009, at 2:08 AM, Michael McCandless wrote:
>>>
>>>> (Please note that ThreadedIndexWriter is source code available with
>>>> the upcoming revision to Lucene in Action.)
>>>>
>>>> Phil, is it possible you are using an older version of the book's
>>>> source code? �In particular, can you check whether your version of
>>>> ThreadedIndexWriter.java has this:
>>>>
>>>> �public void close(boolean doWait) throws CorruptIndexException,
>>>> IOException {
>>>> �finish();
>>>> �super.close(doWait);
>>>> �}
>>>>
>>>> (I vaguely remember that being missing from earlier releases, which
>>>> could explain what you're seeing). �If you are missing that, can you
>>>> download the current code from http://www.manning.com/hatcher3 and try
>>>> again?
>>>>
>>>> If that's not the problem... can you post the benchmark alg you are
>>>> using in each case?
>>>>
>>>> Mike
>>>>
>>>> On Fri, Jul 31, 2009 at 8:26 PM, Jibo John<ji...@mac.com> wrote:
>>>>>
>>>>> Hi Phil,
>>>>>
>>>>> It's 5 threads for IndexWriter.
>>>>>
>>>>> For ThreadedIndexWriter, I used:
>>>>>
>>>>> writer.num.threads=16
>>>>> writer.max.thread.queue.size=80
>>>>>
>>>>> Thanks,
>>>>> -Jibo
>>>>>
>>>>> On Jul 31, 2009, at 5:01 PM, Phil Whelan wrote:
>>>>>
>>>>>> Hi Jibo,
>>>>>>
>>>>>> Your mergeFactor is different, and the resulting numFiles (segment
>>>>>> files) is different. Maybe each thread is responsible for a segment
>>>>>> file. Just curious - do you have 3 threads?
>>>>>>
>>>>>> Phil
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>>
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>
>>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
>
>


Re: ThreadedIndexWriter vs. IndexWriter

Posted by Jibo John <ji...@mac.com>.
Mike,

I wasn't exactly using the lucene core jar from MEAP.

I have been building lucene from the source, and running the tests  
under lucene/java/trunk/contrib/benchmark/ (checked out 2 weeks ago, I  
guess)  and, also under  lucene/java/tags/lucene_2_4_1/contrib/ 
benchmark/.
In both cases, copied CreateThreadedIndexTask to  
org.apache.lucene.benchmark.byTask.tasks and ThreadedIndexWriter to  
org.apache.lucene.index.

I have observed the issue in both the versions of lucene.

Indexes were optimized separately using Lucli.


PFA the classes and the alg.


Re: ThreadedIndexWriter vs. IndexWriter

Posted by Michael McCandless <lu...@mikemccandless.com>.
I'm baffled why you're losing docs w/ ThreadedIndexWriter.

One question: your Lucene core JAR seems to be newer than the last
MEAP update.  Did you update it manually?

Also, your indexes were optimized, but your algs don't have an
optimize step -- did you separately run an optimize?

Could you zip up the whole shebang (ThreadedIndexWriter.java,
CreateThreadedIndexTask.java, the algs) & post?  Please CC me directly
so I can grab the zip file... thanks.

Mike

On Mon, Aug 3, 2009 at 12:37 PM, Jibo John<ji...@mac.com> wrote:
> Mike,
>
> Verified that I have the latest source code.
> Here are the alg files and the checkindexer output.
>
>
> ----------------------------------------- indexwriter
> alg----------------------------------------------------------------
>
> analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer
> doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker
> directory=FSDirectory
>
> doc.stored = true                                                    #A
> docs.file=wikipedia.lines.txt
> ram.flush.mb=50
> compound=false
> merge.factor=5
> doc.add.log.step=1000
> doc.term.vector=false
> doc.term.vector.positions=false
> doc.term.vector.offsets=false
>
> { "Rounds"                                                           #B
>  ResetSystemErase
>  { "BuildIndex"
>  -CreateIndex()
>  [ { "AddDocs" AddDoc > : 40000 ] : 5                                    #C
>  -CloseIndex()
>  }
>  NewRound
> } : 1
>
> RepSumByPrefRound BuildIndex                                         #D
>
> -----------------------------------------threadedindexwriter alg
> ----------------------------------------------------------------
>
> analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer
> doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker
> directory=FSDirectory
>
> doc.stored = true                                                    #A
> docs.file=wikipedia.lines.txt
> ram.flush.mb=50
> compound=false
> merge.factor=5
> doc.add.log.step=1000
> doc.term.vector=false
> doc.term.vector.positions=false
> doc.term.vector.offsets=false
> writer.num.threads=15
> writer.max.thread.queue.size=75
> work.dir=work_t
>
>
> { "Rounds"                                                           #B
>  ResetSystemErase
>  { "BuildIndex"
>  -CreateThreadedIndex()
>   { "AddDocs" AddDoc > : 200000
>  -CloseIndex()
>  }
>  NewRound
> } : 1
>
> RepSumByPrefRound BuildIndex                                         #D
>
>
> -----------------------------------------------threadedindexwriter
> checkindex ----------------------------------------------------------
>
>
> $ java -classpath
> /Users/jibo/Desktop/iwork/lucene/java/trunk/build/lucene-core-2.9-dev.jar
> org.apache.lucene.index.CheckIndex
> /Users/jibo/Desktop/iwork/lucene/java/trunk/contrib/benchmark/work_t/index
>
> NOTE: testing will be more thorough if you run java with
> '-ea:org.apache.lucene...', so assertions are enabled
>
> Opening index @
> /Users/jibo/Desktop/iwork/lucene/java/trunk/contrib/benchmark/work_t/index
>
> Segments file=segments_3 numSegments=1 version=FORMAT_DIAGNOSTICS [Lucene
> 2.9]
>  1 of 1: name=_p docCount=199941
>   compound=true
>   hasProx=true
>   numFiles=3
>   size (MB)=317.1
>   diagnostics = {java.version=1.5.0_19, lucene.version=2.9-dev 779767M -
> 2009-05-28 17:02:17, os=Mac OS X, os.arch=i386, optimize=true,
> mergeDocStores=false, java.vendor=Apple Inc., os.version=10.5.7,
> source=merge, mergeFactor=5}
>   docStoreOffset=0
>   docStoreSegment=_0
>   docStoreIsCompoundFile=false
>   no deletions
>   test: open reader.........OK
>   test: fields, norms.......OK [4 fields]
>   test: terms, freq, prox...OK [1269552 terms; 67887116 terms/docs pairs;
> 133241176 tokens]
>   test: stored fields.......OK [199941 total field count; avg 1 fields per
> doc]
>   test: term vectors........OK [0 total vector count; avg 0 term/freq vector
> fields per doc]
>
> No problems were detected with this index.
>
> ------------------------------------------indexwriter checkindex
> ---------------------------------------------------------------
>
> $ java -classpath
> /Users/jibo/Desktop/iwork/lucene/java/trunk/build/lucene-core-2.9-dev.jar
> org.apache.lucene.index.CheckIndex
> /Users/jibo/Desktop/iwork/lucene/java/trunk/contrib/benchmark/work/index
>
> NOTE: testing will be more thorough if you run java with
> '-ea:org.apache.lucene...', so assertions are enabled
>
> Opening index @
> /Users/jibo/Desktop/iwork/lucene/java/trunk/contrib/benchmark/work/index
>
> Segments file=segments_a numSegments=1 version=FORMAT_DIAGNOSTICS [Lucene
> 2.9]
>  1 of 1: name=_18 docCount=200000
>   compound=true
>   hasProx=true
>   numFiles=1
>   size (MB)=427.445
>   diagnostics = {java.version=1.5.0_19, lucene.version=2.9-dev 779767M -
> 2009-05-28 17:02:17, os=Mac OS X, os.arch=i386, optimize=true,
> mergeDocStores=true, java.vendor=Apple Inc., os.version=10.5.7,
> source=merge, mergeFactor=4}
>   no deletions
>   test: open reader.........OK
>   test: fields, norms.......OK [4 fields]
>   test: terms, freq, prox...OK [3512343 terms; 80020204 terms/docs pairs;
> 163219760 tokens]
>   test: stored fields.......OK [200000 total field count; avg 1 fields per
> doc]
>   test: term vectors........OK [0 total vector count; avg 0 term/freq vector
> fields per doc]
>
> No problems were detected with this index.
>
> ---------------------------------------------------------------------------------------------------------
>
>
>
>
> Thanks,
> -Jibo
>
>
>
>
>
>
>
> On Aug 1, 2009, at 2:08 AM, Michael McCandless wrote:
>
>> (Please note that ThreadedIndexWriter is source code available with
>> the upcoming revision to Lucene in Action.)
>>
>> Phil, is it possible you are using an older version of the book's
>> source code?  In particular, can you check whether your version of
>> ThreadedIndexWriter.java has this:
>>
>>  public void close(boolean doWait) throws CorruptIndexException,
>> IOException {
>>   finish();
>>   super.close(doWait);
>>  }
>>
>> (I vaguely remember that being missing from earlier releases, which
>> could explain what you're seeing).  If you are missing that, can you
>> download the current code from http://www.manning.com/hatcher3 and try
>> again?
>>
>> If that's not the problem... can you post the benchmark alg you are
>> using in each case?
>>
>> Mike
>>
>> On Fri, Jul 31, 2009 at 8:26 PM, Jibo John<ji...@mac.com> wrote:
>>>
>>> Hi Phil,
>>>
>>> It's 5 threads for IndexWriter.
>>>
>>> For ThreadedIndexWriter, I used:
>>>
>>> writer.num.threads=16
>>> writer.max.thread.queue.size=80
>>>
>>> Thanks,
>>> -Jibo
>>>
>>> On Jul 31, 2009, at 5:01 PM, Phil Whelan wrote:
>>>
>>>> Hi Jibo,
>>>>
>>>> Your mergeFactor is different, and the resulting numFiles (segment
>>>> files) is different. Maybe each thread is responsible for a segment
>>>> file. Just curious - do you have 3 threads?
>>>>
>>>> Phil
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: ThreadedIndexWriter vs. IndexWriter

Posted by Jibo John <ji...@mac.com>.
Mike,

Verified that I have the latest source code.
Here are the alg files and the checkindexer output.


----------------------------------------- indexwriter  
alg----------------------------------------------------------------

analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer
doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker
directory=FSDirectory

doc.stored = true                                                    #A
docs.file=wikipedia.lines.txt
ram.flush.mb=50
compound=false
merge.factor=5
doc.add.log.step=1000
doc.term.vector=false
doc.term.vector.positions=false
doc.term.vector.offsets=false

{ "Rounds"                                                           #B
  ResetSystemErase
  { "BuildIndex"
   -CreateIndex()
   [ { "AddDocs" AddDoc > : 40000 ] :  
5                                    #C
   -CloseIndex()
  }
  NewRound
} : 1

RepSumByPrefRound BuildIndex                                         #D

-----------------------------------------threadedindexwriter alg  
----------------------------------------------------------------

analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer
doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker
directory=FSDirectory

doc.stored = true                                                    #A
docs.file=wikipedia.lines.txt
ram.flush.mb=50
compound=false
merge.factor=5
doc.add.log.step=1000
doc.term.vector=false
doc.term.vector.positions=false
doc.term.vector.offsets=false
writer.num.threads=15
writer.max.thread.queue.size=75
work.dir=work_t


{ "Rounds"                                                           #B
  ResetSystemErase
  { "BuildIndex"
   -CreateThreadedIndex()
    { "AddDocs" AddDoc > : 200000
   -CloseIndex()
  }
  NewRound
} : 1

RepSumByPrefRound BuildIndex                                         #D


-----------------------------------------------threadedindexwriter  
checkindex ----------------------------------------------------------


$ java -classpath /Users/jibo/Desktop/iwork/lucene/java/trunk/build/ 
lucene-core-2.9-dev.jar org.apache.lucene.index.CheckIndex /Users/jibo/ 
Desktop/iwork/lucene/java/trunk/contrib/benchmark/work_t/index

NOTE: testing will be more thorough if you run java with '- 
ea:org.apache.lucene...', so assertions are enabled

Opening index @ /Users/jibo/Desktop/iwork/lucene/java/trunk/contrib/ 
benchmark/work_t/index

Segments file=segments_3 numSegments=1 version=FORMAT_DIAGNOSTICS  
[Lucene 2.9]
  1 of 1: name=_p docCount=199941
    compound=true
    hasProx=true
    numFiles=3
    size (MB)=317.1
    diagnostics = {java.version=1.5.0_19, lucene.version=2.9-dev  
779767M - 2009-05-28 17:02:17, os=Mac OS X, os.arch=i386,  
optimize=true, mergeDocStores=false, java.vendor=Apple Inc.,  
os.version=10.5.7, source=merge, mergeFactor=5}
    docStoreOffset=0
    docStoreSegment=_0
    docStoreIsCompoundFile=false
    no deletions
    test: open reader.........OK
    test: fields, norms.......OK [4 fields]
    test: terms, freq, prox...OK [1269552 terms; 67887116 terms/docs  
pairs; 133241176 tokens]
    test: stored fields.......OK [199941 total field count; avg 1  
fields per doc]
    test: term vectors........OK [0 total vector count; avg 0 term/ 
freq vector fields per doc]

No problems were detected with this index.

------------------------------------------indexwriter checkindex  
---------------------------------------------------------------

$ java -classpath /Users/jibo/Desktop/iwork/lucene/java/trunk/build/ 
lucene-core-2.9-dev.jar org.apache.lucene.index.CheckIndex /Users/jibo/ 
Desktop/iwork/lucene/java/trunk/contrib/benchmark/work/index

NOTE: testing will be more thorough if you run java with '- 
ea:org.apache.lucene...', so assertions are enabled

Opening index @ /Users/jibo/Desktop/iwork/lucene/java/trunk/contrib/ 
benchmark/work/index

Segments file=segments_a numSegments=1 version=FORMAT_DIAGNOSTICS  
[Lucene 2.9]
  1 of 1: name=_18 docCount=200000
    compound=true
    hasProx=true
    numFiles=1
    size (MB)=427.445
    diagnostics = {java.version=1.5.0_19, lucene.version=2.9-dev  
779767M - 2009-05-28 17:02:17, os=Mac OS X, os.arch=i386,  
optimize=true, mergeDocStores=true, java.vendor=Apple Inc.,  
os.version=10.5.7, source=merge, mergeFactor=4}
    no deletions
    test: open reader.........OK
    test: fields, norms.......OK [4 fields]
    test: terms, freq, prox...OK [3512343 terms; 80020204 terms/docs  
pairs; 163219760 tokens]
    test: stored fields.......OK [200000 total field count; avg 1  
fields per doc]
    test: term vectors........OK [0 total vector count; avg 0 term/ 
freq vector fields per doc]

No problems were detected with this index.

---------------------------------------------------------------------------------------------------------




Thanks,
-Jibo







On Aug 1, 2009, at 2:08 AM, Michael McCandless wrote:

> (Please note that ThreadedIndexWriter is source code available with
> the upcoming revision to Lucene in Action.)
>
> Phil, is it possible you are using an older version of the book's
> source code?  In particular, can you check whether your version of
> ThreadedIndexWriter.java has this:
>
>  public void close(boolean doWait) throws CorruptIndexException,  
> IOException {
>    finish();
>    super.close(doWait);
>  }
>
> (I vaguely remember that being missing from earlier releases, which
> could explain what you're seeing).  If you are missing that, can you
> download the current code from http://www.manning.com/hatcher3 and try
> again?
>
> If that's not the problem... can you post the benchmark alg you are
> using in each case?
>
> Mike
>
> On Fri, Jul 31, 2009 at 8:26 PM, Jibo John<ji...@mac.com> wrote:
>> Hi Phil,
>>
>> It's 5 threads for IndexWriter.
>>
>> For ThreadedIndexWriter, I used:
>>
>> writer.num.threads=16
>> writer.max.thread.queue.size=80
>>
>> Thanks,
>> -Jibo
>>
>> On Jul 31, 2009, at 5:01 PM, Phil Whelan wrote:
>>
>>> Hi Jibo,
>>>
>>> Your mergeFactor is different, and the resulting numFiles (segment
>>> files) is different. Maybe each thread is responsible for a segment
>>> file. Just curious - do you have 3 threads?
>>>
>>> Phil
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: ThreadedIndexWriter vs. IndexWriter

Posted by Michael McCandless <lu...@mikemccandless.com>.
(Please note that ThreadedIndexWriter is source code available with
the upcoming revision to Lucene in Action.)

Phil, is it possible you are using an older version of the book's
source code?  In particular, can you check whether your version of
ThreadedIndexWriter.java has this:

  public void close(boolean doWait) throws CorruptIndexException, IOException {
    finish();
    super.close(doWait);
  }

(I vaguely remember that being missing from earlier releases, which
could explain what you're seeing).  If you are missing that, can you
download the current code from http://www.manning.com/hatcher3 and try
again?

If that's not the problem... can you post the benchmark alg you are
using in each case?

Mike

On Fri, Jul 31, 2009 at 8:26 PM, Jibo John<ji...@mac.com> wrote:
> Hi Phil,
>
> It's 5 threads for IndexWriter.
>
> For ThreadedIndexWriter, I used:
>
> writer.num.threads=16
> writer.max.thread.queue.size=80
>
> Thanks,
> -Jibo
>
> On Jul 31, 2009, at 5:01 PM, Phil Whelan wrote:
>
>> Hi Jibo,
>>
>> Your mergeFactor is different, and the resulting numFiles (segment
>> files) is different. Maybe each thread is responsible for a segment
>> file. Just curious - do you have 3 threads?
>>
>> Phil
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: ThreadedIndexWriter vs. IndexWriter

Posted by Jibo John <ji...@mac.com>.
Hi Phil,

It's 5 threads for IndexWriter.

For ThreadedIndexWriter, I used:

writer.num.threads=16
writer.max.thread.queue.size=80

Thanks,
-Jibo

On Jul 31, 2009, at 5:01 PM, Phil Whelan wrote:

> Hi Jibo,
>
> Your mergeFactor is different, and the resulting numFiles (segment
> files) is different. Maybe each thread is responsible for a segment
> file. Just curious - do you have 3 threads?
>
> Phil
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: ThreadedIndexWriter vs. IndexWriter

Posted by Phil Whelan <ph...@gmail.com>.
Hi Jibo,

Your mergeFactor is different, and the resulting numFiles (segment
files) is different. Maybe each thread is responsible for a segment
file. Just curious - do you have 3 threads?

Phil

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: ThreadedIndexWriter vs. IndexWriter

Posted by Jibo John <ji...@mac.com>.
Mike,

Here you go:


IndexWriter:
----------------
$ java -classpath /Users/jibo/Desktop/iwork/lucene/java/trunk/build/ 
lucene-core-2.9-dev.jar org.apache.lucene.index.CheckIndex /Users/jibo/ 
Desktop/iwork/lucene/java/trunk/contrib/benchmark/work/index

NOTE: testing will be more thorough if you run java with '- 
ea:org.apache.lucene...', so assertions are enabled

Opening index @ /Users/jibo/Desktop/iwork/lucene/java/trunk/contrib/ 
benchmark/work/index

Segments file=segments_a numSegments=1 version=FORMAT_DIAGNOSTICS  
[Lucene 2.9]
  1 of 1: name=_18 docCount=200000
    compound=true
    hasProx=true
    numFiles=1
    size (MB)=427.448
    diagnostics = {java.version=1.5.0_19, lucene.version=2.9-dev  
779767M - 2009-05-28 17:02:17, os=Mac OS X, os.arch=i386,  
optimize=true, mergeDocStores=true, java.vendor=Apple Inc.,  
os.version=10.5.7, source=merge, mergeFactor=4}
    no deletions
    test: open reader.........OK
    test: fields, norms.......OK [4 fields]
    test: terms, freq, prox...OK [3512343 terms; 80020204 terms/docs  
pairs; 163219760 tokens]
    test: stored fields.......OK [200000 total field count; avg 1  
fields per doc]
    test: term vectors........OK [0 total vector count; avg 0 term/ 
freq vector fields per doc]

No problems were detected with this index.


ThreadedIndexWriter:
-----------------------------

$ java -classpath /Users/jibo/Desktop/iwork/lucene/java/trunk/build/ 
lucene-core-2.9-dev.jar org.apache.lucene.index.CheckIndex /Users/jibo/ 
Desktop/iwork/lucene/java/trunk/contrib/benchmark/work/index

NOTE: testing will be more thorough if you run java with '- 
ea:org.apache.lucene...', so assertions are enabled

Opening index @ /Users/jibo/Desktop/iwork/lucene/java/trunk/contrib/ 
benchmark/work/index

Segments file=segments_3 numSegments=1 version=FORMAT_DIAGNOSTICS  
[Lucene 2.9]
  1 of 1: name=_q docCount=199970
    compound=true
    hasProx=true
    numFiles=3
    size (MB)=319.107
    diagnostics = {java.version=1.5.0_19, lucene.version=2.9-dev  
779767M - 2009-05-28 17:02:17, os=Mac OS X, os.arch=i386,  
optimize=true, mergeDocStores=false, java.vendor=Apple Inc.,  
os.version=10.5.7, source=merge, mergeFactor=6}
    docStoreOffset=0
    docStoreSegment=_0
    docStoreIsCompoundFile=false
    no deletions
    test: open reader.........OK
    test: fields, norms.......OK [4 fields]
    test: terms, freq, prox...OK [1227086 terms; 69244121 terms/docs  
pairs; 134390948 tokens]
    test: stored fields.......OK [199970 total field count; avg 1  
fields per doc]
    test: term vectors........OK [0 total vector count; avg 0 term/ 
freq vector fields per doc]

No problems were detected with this index.


$



On Jul 31, 2009, at 2:52 PM, Michael McCandless wrote:

> Hmmm... can you run CheckIndex on both indexes and post the results?
>
>  java org.apache.lucene.index.CheckIndex /path/to/index
>
> Mike
>
> On Fri, Jul 31, 2009 at 2:38 PM, Jibo John<ji...@mac.com> wrote:
>> Number of docs are the same in the index for both the cases  
>> (200,000).
>> I haven't altered the benchmark/ code, but, used a profiler to  
>> verify that
>>  Benchmark main thread is closed only after all other  threads are  
>> closed.
>>
>> Thanks,
>> -Jibo
>>
>>
>> On Jul 31, 2009, at 2:34 AM, Michael McCandless wrote:
>>
>>> Hmm... this doesn't sound right.
>>>
>>> That example (ThreadedIndexWriter) is meant to be a drop-in
>>> replacement, wherever you use an IndexWriter, that keeps an
>>> under-the-hood thread pool (using java.util.concurrent.*) to
>>> add/update documents with multiple threads.
>>>
>>> It should not result in a smaller index.
>>>
>>> Can you sanity check the index?  Eg is numDocs() the same for both?
>>> You definitely called close() on the writer, right?  That method  
>>> waits
>>> for all threads to finish their work before actually closing.
>>>
>>> Mike
>>>
>>> On Thu, Jul 30, 2009 at 8:01 PM, Jibo John<ji...@mac.com> wrote:
>>>>
>>>> While trying out a few tuning options using contrib/benchmak as  
>>>> described
>>>> in
>>>> LIA (2nd edition) book, I had an interesting observation.
>>>>
>>>> If I use a ThreadedIndexWriter (picked the example from lia2e,  
>>>> page 356)
>>>> instead of IndexWriter, the index size got reduced by 40%  
>>>> compared to
>>>> using
>>>> IndexWriter.
>>>> Index related configuration were the same for both the tests in  
>>>> the alg
>>>> file.
>>>>
>>>> I am curious how come using a threaded index writer will have an  
>>>> impact
>>>> on
>>>> the index size.
>>>>
>>>> Appreciate your input.
>>>>
>>>> Thanks,
>>>> -Jibo
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: ThreadedIndexWriter vs. IndexWriter

Posted by Michael McCandless <lu...@mikemccandless.com>.
Hmmm... can you run CheckIndex on both indexes and post the results?

  java org.apache.lucene.index.CheckIndex /path/to/index

Mike

On Fri, Jul 31, 2009 at 2:38 PM, Jibo John<ji...@mac.com> wrote:
> Number of docs are the same in the index for both the cases (200,000).
> I haven't altered the benchmark/ code, but, used a profiler to verify that
>  Benchmark main thread is closed only after all other  threads are closed.
>
> Thanks,
> -Jibo
>
>
> On Jul 31, 2009, at 2:34 AM, Michael McCandless wrote:
>
>> Hmm... this doesn't sound right.
>>
>> That example (ThreadedIndexWriter) is meant to be a drop-in
>> replacement, wherever you use an IndexWriter, that keeps an
>> under-the-hood thread pool (using java.util.concurrent.*) to
>> add/update documents with multiple threads.
>>
>> It should not result in a smaller index.
>>
>> Can you sanity check the index?  Eg is numDocs() the same for both?
>> You definitely called close() on the writer, right?  That method waits
>> for all threads to finish their work before actually closing.
>>
>> Mike
>>
>> On Thu, Jul 30, 2009 at 8:01 PM, Jibo John<ji...@mac.com> wrote:
>>>
>>> While trying out a few tuning options using contrib/benchmak as described
>>> in
>>> LIA (2nd edition) book, I had an interesting observation.
>>>
>>> If I use a ThreadedIndexWriter (picked the example from lia2e, page 356)
>>> instead of IndexWriter, the index size got reduced by 40% compared to
>>> using
>>> IndexWriter.
>>> Index related configuration were the same for both the tests in the alg
>>> file.
>>>
>>> I am curious how come using a threaded index writer will have an impact
>>> on
>>> the index size.
>>>
>>> Appreciate your input.
>>>
>>> Thanks,
>>> -Jibo
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: ThreadedIndexWriter vs. IndexWriter

Posted by Jibo John <ji...@mac.com>.
Number of docs are the same in the index for both the cases (200,000).
I haven't altered the benchmark/ code, but, used a profiler to verify  
that  Benchmark main thread is closed only after all other  threads  
are closed.

Thanks,
-Jibo


On Jul 31, 2009, at 2:34 AM, Michael McCandless wrote:

> Hmm... this doesn't sound right.
>
> That example (ThreadedIndexWriter) is meant to be a drop-in
> replacement, wherever you use an IndexWriter, that keeps an
> under-the-hood thread pool (using java.util.concurrent.*) to
> add/update documents with multiple threads.
>
> It should not result in a smaller index.
>
> Can you sanity check the index?  Eg is numDocs() the same for both?
> You definitely called close() on the writer, right?  That method waits
> for all threads to finish their work before actually closing.
>
> Mike
>
> On Thu, Jul 30, 2009 at 8:01 PM, Jibo John<ji...@mac.com> wrote:
>> While trying out a few tuning options using contrib/benchmak as  
>> described in
>> LIA (2nd edition) book, I had an interesting observation.
>>
>> If I use a ThreadedIndexWriter (picked the example from lia2e, page  
>> 356)
>> instead of IndexWriter, the index size got reduced by 40% compared  
>> to using
>> IndexWriter.
>> Index related configuration were the same for both the tests in the  
>> alg
>> file.
>>
>> I am curious how come using a threaded index writer will have an  
>> impact on
>> the index size.
>>
>> Appreciate your input.
>>
>> Thanks,
>> -Jibo
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: ThreadedIndexWriter vs. IndexWriter

Posted by Michael McCandless <lu...@mikemccandless.com>.
Hmm... this doesn't sound right.

That example (ThreadedIndexWriter) is meant to be a drop-in
replacement, wherever you use an IndexWriter, that keeps an
under-the-hood thread pool (using java.util.concurrent.*) to
add/update documents with multiple threads.

It should not result in a smaller index.

Can you sanity check the index?  Eg is numDocs() the same for both?
You definitely called close() on the writer, right?  That method waits
for all threads to finish their work before actually closing.

Mike

On Thu, Jul 30, 2009 at 8:01 PM, Jibo John<ji...@mac.com> wrote:
> While trying out a few tuning options using contrib/benchmak as described in
> LIA (2nd edition) book, I had an interesting observation.
>
> If I use a ThreadedIndexWriter (picked the example from lia2e, page 356)
> instead of IndexWriter, the index size got reduced by 40% compared to using
> IndexWriter.
> Index related configuration were the same for both the tests in the alg
> file.
>
> I am curious how come using a threaded index writer will have an impact on
> the index size.
>
> Appreciate your input.
>
> Thanks,
> -Jibo
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


ThreadedIndexWriter vs. IndexWriter

Posted by Jibo John <ji...@mac.com>.
While trying out a few tuning options using contrib/benchmak as  
described in LIA (2nd edition) book, I had an interesting observation.

If I use a ThreadedIndexWriter (picked the example from lia2e, page  
356) instead of IndexWriter, the index size got reduced by 40%  
compared to using IndexWriter.
Index related configuration were the same for both the tests in the  
alg file.

I am curious how come using a threaded index writer will have an  
impact on the index size.

Appreciate your input.

Thanks,
-Jibo

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: indexing multiple email addresses in one field

Posted by Paul Cowan <co...@aconex.com>.
Phil Whelan wrote:
> It seems I have to use the same Analyzer for the all the fields in the
> index?

Nope. Look at PerFieldAnalyzerWrapper, which is effectively a Map of 
field names -> analyzers. This might help if different fields will have 
very different values and semantics.

Cheers,

Paul

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: indexing multiple email addresses in one field

Posted by Phil Whelan <ph...@gmail.com>.
Thanks Matt. Thanks Paul. I'm up early (PST) and ready for a major
rewrite of my indexer. I think these changes are going to make a huge
difference.

Cheers,
Phil

On Fri, Jul 31, 2009 at 5:52 AM, Matthew Hall<mh...@informatics.jax.org> wrote:
> And to address the stop word issue, you can override the stop word list that
> it uses.
>
> Most analyzers that use stop words, (Standard included) has an option to
> pass it an arbitrary list of StopWords which will override the defaults.
>
> You could also just roll your own (which is what you are going to end up
> doing here anyhow)  When you do, just don't include stop word removal in the
> processing of your token stream.
>
> Matt
>
> Phil Whelan wrote:
>>
>> Hi Matthew / Paul,
>>
>> On Thu, Jul 30, 2009 at 4:32 PM, Paul Cowan<co...@aconex.com> wrote:
>>
>>>
>>> Matthew Hall wrote:
>>>
>>>>
>>>> Place a delimiter between the email addresses that doesn't get removed
>>>> in
>>>> your analyzer.  (preferably something you know will never be searched
>>>> on)
>>>>
>>>
>>> Or add them separately (rather than:
>>>  doc.add(new Field("email", "foo@bar.com bar@foo.com com@bar.foo" ...);
>>> use
>>>  doc.add(new Field("email", "foo@bar.com");
>>>  doc.add(new Field("email", "bar@foo.com");
>>>  doc.add(new Field("email", "com@bar.foo");
>>> ), using an Analyzer that overrides getPositionIncrementGap(). This
>>> inserts
>>> a 'gap' between each set of Tokens for the same Field, which stops phrase
>>> queries from 'crossing the boundaries' between subsequent values.
>>>
>>
>> I like the sound of that! I think I understand it.
>> getPositionIncrementGap() returns 0 by default which keeps the "email"
>> field tokens sequential. Overriding with 1, will add an effective
>> blank token between the email addresses (overriding with 2 would leave
>> 2). Similar to Matthew's delimiter token, but a bit neater.
>>
>> So the token (with positions in brackets) would look something like this.
>>
>> "foo(0) bar(1) com(2) bar(4) foo(5) com(6) com(8) bar(9) foo(10)"
>>
>> Up until now I've only been using the WhiteSpaceAnalyzer, as I've been
>> keeping quite a tight control over the fields going into the index
>> (not making best use of Lucene).
>>
>> What Analyzer would you recommend I use for this. I'll also be
>> indexing IPs, and other things, but that's pretty much the same story.
>> It seems I have to use the same Analyzer for the all the fields in the
>> index?
>>
>> I've been looking at StandardAnalyzer, but I do not want to remove
>> stop words. I want to keep letters and numbers mainly, and also
>> override getPositionIncrementGap? Is there anything that does these
>> things already, or close to it? Overriding getPositionIncrementGap
>> shouldn't be difficult though.
>>
>> Cheers,
>> Phil
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
>
> --
> Matthew Hall
> Software Engineer
> Mouse Genome Informatics
> mhall@informatics.jax.org
> (207) 288-6012
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>



-- 
Mobile: +1  778-233-4935
Website: http://philw.co.uk
Skype: philwhelan76
Twitter: philwhln
Email : phil123@gmail.com
iChat: philwhln@mac.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: indexing multiple email addresses in one field

Posted by Matthew Hall <mh...@informatics.jax.org>.
And to address the stop word issue, you can override the stop word list 
that it uses.

Most analyzers that use stop words, (Standard included) has an option to 
pass it an arbitrary list of StopWords which will override the defaults.

You could also just roll your own (which is what you are going to end up 
doing here anyhow)  When you do, just don't include stop word removal in 
the processing of your token stream.

Matt

Phil Whelan wrote:
> Hi Matthew / Paul,
>
> On Thu, Jul 30, 2009 at 4:32 PM, Paul Cowan<co...@aconex.com> wrote:
>   
>> Matthew Hall wrote:
>>     
>>> Place a delimiter between the email addresses that doesn't get removed in
>>> your analyzer.  (preferably something you know will never be searched on)
>>>       
>> Or add them separately (rather than:
>>  doc.add(new Field("email", "foo@bar.com bar@foo.com com@bar.foo" ...);
>> use
>>  doc.add(new Field("email", "foo@bar.com");
>>  doc.add(new Field("email", "bar@foo.com");
>>  doc.add(new Field("email", "com@bar.foo");
>> ), using an Analyzer that overrides getPositionIncrementGap(). This inserts
>> a 'gap' between each set of Tokens for the same Field, which stops phrase
>> queries from 'crossing the boundaries' between subsequent values.
>>     
>
> I like the sound of that! I think I understand it.
> getPositionIncrementGap() returns 0 by default which keeps the "email"
> field tokens sequential. Overriding with 1, will add an effective
> blank token between the email addresses (overriding with 2 would leave
> 2). Similar to Matthew's delimiter token, but a bit neater.
>
> So the token (with positions in brackets) would look something like this.
>
> "foo(0) bar(1) com(2) bar(4) foo(5) com(6) com(8) bar(9) foo(10)"
>
> Up until now I've only been using the WhiteSpaceAnalyzer, as I've been
> keeping quite a tight control over the fields going into the index
> (not making best use of Lucene).
>
> What Analyzer would you recommend I use for this. I'll also be
> indexing IPs, and other things, but that's pretty much the same story.
> It seems I have to use the same Analyzer for the all the fields in the
> index?
>
> I've been looking at StandardAnalyzer, but I do not want to remove
> stop words. I want to keep letters and numbers mainly, and also
> override getPositionIncrementGap? Is there anything that does these
> things already, or close to it? Overriding getPositionIncrementGap
> shouldn't be difficult though.
>
> Cheers,
> Phil
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>   


-- 
Matthew Hall
Software Engineer
Mouse Genome Informatics
mhall@informatics.jax.org
(207) 288-6012



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: indexing multiple email addresses in one field

Posted by Phil Whelan <ph...@gmail.com>.
Hi Matthew / Paul,

On Thu, Jul 30, 2009 at 4:32 PM, Paul Cowan<co...@aconex.com> wrote:
> Matthew Hall wrote:
>>
>> Place a delimiter between the email addresses that doesn't get removed in
>> your analyzer.  (preferably something you know will never be searched on)
>
> Or add them separately (rather than:
>  doc.add(new Field("email", "foo@bar.com bar@foo.com com@bar.foo" ...);
> use
>  doc.add(new Field("email", "foo@bar.com");
>  doc.add(new Field("email", "bar@foo.com");
>  doc.add(new Field("email", "com@bar.foo");
> ), using an Analyzer that overrides getPositionIncrementGap(). This inserts
> a 'gap' between each set of Tokens for the same Field, which stops phrase
> queries from 'crossing the boundaries' between subsequent values.

I like the sound of that! I think I understand it.
getPositionIncrementGap() returns 0 by default which keeps the "email"
field tokens sequential. Overriding with 1, will add an effective
blank token between the email addresses (overriding with 2 would leave
2). Similar to Matthew's delimiter token, but a bit neater.

So the token (with positions in brackets) would look something like this.

"foo(0) bar(1) com(2) bar(4) foo(5) com(6) com(8) bar(9) foo(10)"

Up until now I've only been using the WhiteSpaceAnalyzer, as I've been
keeping quite a tight control over the fields going into the index
(not making best use of Lucene).

What Analyzer would you recommend I use for this. I'll also be
indexing IPs, and other things, but that's pretty much the same story.
It seems I have to use the same Analyzer for the all the fields in the
index?

I've been looking at StandardAnalyzer, but I do not want to remove
stop words. I want to keep letters and numbers mainly, and also
override getPositionIncrementGap? Is there anything that does these
things already, or close to it? Overriding getPositionIncrementGap
shouldn't be difficult though.

Cheers,
Phil

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: indexing multiple email addresses in one field

Posted by Paul Cowan <co...@aconex.com>.
Matthew Hall wrote:
> Place a delimiter between the email addresses that doesn't get removed 
> in your analyzer.  (preferably something you know will never be searched 
> on)

Or add them separately (rather than:
   doc.add(new Field("email", "foo@bar.com bar@foo.com com@bar.foo" ...);
use
   doc.add(new Field("email", "foo@bar.com");
   doc.add(new Field("email", "bar@foo.com");
   doc.add(new Field("email", "com@bar.foo");
), using an Analyzer that overrides getPositionIncrementGap(). This 
inserts a 'gap' between each set of Tokens for the same Field, which 
stops phrase queries from 'crossing the boundaries' between subsequent 
values.

Cheers,

Paul

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: indexing multiple email addresses in one field

Posted by Matthew Hall <mh...@informatics.jax.org>.
Place a delimiter between the email addresses that doesn't get removed 
in your analyzer.  (preferably something you know will never be searched on)

That way you can ensure that each email matches independently of each other.

So something like

foo@bar.com DELIM123 bar@foo.com DELIM123 com@bar.foo

Matt


Phil Whelan wrote:
> On Thu, Jul 30, 2009 at 11:22 AM, Matthew Hall
> <mh...@informatics.jax.org> wrote:
>   
>> 1. Sure, just have an analyzer that splits on all non letter characters.
>> 2. Phrase queries keep the order intact.  (And yes, the positional information for the terms is kept, which is what allows span queries to work)
>>
>> So searching on the following "foo bar com" will match foo@bar.com but not bar@foo.com
>>     
>
> Thanks, I really appreciate your help with this. That's great to know.
> Can I take this a little further...
>
> If I have "foo@bar.com bar@foo.com com@bar.foo" and analyze it I get
> "foo bar com bar foo com com bar foo", so perhaps I need a different
> way of delimiting the emails, as it will match some other combinations
> here, eg. foo@com.com which is not one of the emails.
>
> Has anyone done anything similar? I can imagine that one option would
> be to filter the returned docs based on the original content of the
> string I'm analyzing. Does Lucene do this for me?
>
> Thanks,
> Phil
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>   


-- 
Matthew Hall
Software Engineer
Mouse Genome Informatics
mhall@informatics.jax.org
(207) 288-6012



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: indexing multiple email addresses in one field

Posted by Phil Whelan <ph...@gmail.com>.
On Thu, Jul 30, 2009 at 11:22 AM, Matthew Hall
<mh...@informatics.jax.org> wrote:
>
> 1. Sure, just have an analyzer that splits on all non letter characters.
> 2. Phrase queries keep the order intact.  (And yes, the positional information for the terms is kept, which is what allows span queries to work)
>
> So searching on the following "foo bar com" will match foo@bar.com but not bar@foo.com

Thanks, I really appreciate your help with this. That's great to know.
Can I take this a little further...

If I have "foo@bar.com bar@foo.com com@bar.foo" and analyze it I get
"foo bar com bar foo com com bar foo", so perhaps I need a different
way of delimiting the emails, as it will match some other combinations
here, eg. foo@com.com which is not one of the emails.

Has anyone done anything similar? I can imagine that one option would
be to filter the returned docs based on the original content of the
string I'm analyzing. Does Lucene do this for me?

Thanks,
Phil

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: indexing multiple email addresses in one field

Posted by Matthew Hall <mh...@informatics.jax.org>.
1. Sure, just have an analyzer that splits on all non letter characters.
2. Phrase queries keep the order intact.  (And yes, the positional 
information for the terms is kept, which is what allows span queries to 
work)

So searching on the following "foo bar com" will match foo@bar.com but 
not bar@foo.com

Matt

Phil Whelan wrote:
> Hi,
>
> We have a very large lucene index that we're developing that has a
> field of email addresses. (Actually mulitple fields with multiple
> emails addresses, but I'll simplify here)
>
> Each document will have one "email" field containing multiple email addresses.
>
> I am indexing email addresses only using WhitespaceAnalyzer, so to
> preserve the exact adresses and store multiple emails for one
> document.
>
> Example...
> doc.add(new Field("email", "foo@bar.com bar@foo.com com@bar.foo",
> Field.Store.YES, Field.Index.ANALYZED ));
>
> Terms for this document will then be...
> email:foo@bar.com
> email:bar@foo.com
> email:com@bar.foo
>
> The problem I having is that these terms are rarely re-used in other
> documents. There is little overlap with email usage, and there is a
> lot of very long emails addresses. Because of this, the number of
> terms in my index is very big and I think it's is causing performance
> issues and bloating the index.
>
> I think I'm not using Lucene optimally here.
>
>
> A couple of questions...
>
> 1) Is there a way I can analyze these emails down to smaller terms but
> still search for the exact email address? For instance, if I used a
> different analyzer and broke these down to the terms "foo", "bar", and
> "com", is Lucene able to find "email:foo@bar.com" without matching
> "email:com@foo.bar"?
>
> 2) Does Lucene retain the positional information of tokens in the
> index? Knowing this will help me anwer question 1.
>
> Thanks,
> Phil
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>   


-- 
Matthew Hall
Software Engineer
Mouse Genome Informatics
mhall@informatics.jax.org
(207) 288-6012



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org