You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Grant Ingersoll <gs...@apache.org> on 2007/08/25 14:31:04 UTC

2.3

Do we want to start thinking about 2.3 release in the next month or  
so, say by the end of September?  I propose this time around, during  
our freeze cycle, we take a good look at cleaning up JIRA, especially  
some of the older issues (< LUCENE-400).  I am hoping to add a Hudson  
call similar to what Hadoop does that automatically applies patches  
to the trunk to see if the tests pass, codestyle is met, etc.  If  
anyone wants to help out with that, I would appreciate it.  Ideally,  
we could build something that is nice and portable so that other  
Lucene projects can use it (maybe even other ASF projects)  It seems  
like it could be done through some ANT tasks and targets, instead of  
the current shell way.  Mostly, I say that, though, b/c I am not much  
of a shell scripter and is not a knock against the Hadoop  
implementation.

We also should do some thinking on what we want to deprecate per our  
plan to move to JDK 1.5 and Lucene 3.0.

Thoughts?

-Grant



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: 2.3

Posted by Nigel Daley <nd...@yahoo-inc.com>.

Hi Grant,

The Hadoop-Patch process script is now re-written and should be  
easier to clone.

http://svn.apache.org/viewvc/lucene/hadoop/nightly/ 
hudsonBuildHadoopPatch.sh?view=markup

and

http://svn.apache.org/viewvc/lucene/hadoop/nightly/ 
processHadoopPatchEmail.sh?revision=575463&view=markup

Cheers,
Nige

On Aug 25, 2007, at 5:31 AM, Grant Ingersoll wrote:

> Do we want to start thinking about 2.3 release in the next month or  
> so, say by the end of September?  I propose this time around,  
> during our freeze cycle, we take a good look at cleaning up JIRA,  
> especially some of the older issues (< LUCENE-400).  I am hoping to  
> add a Hudson call similar to what Hadoop does that automatically  
> applies patches to the trunk to see if the tests pass, codestyle is  
> met, etc.  If anyone wants to help out with that, I would  
> appreciate it.  Ideally, we could build something that is nice and  
> portable so that other Lucene projects can use it (maybe even other  
> ASF projects)  It seems like it could be done through some ANT  
> tasks and targets, instead of the current shell way.  Mostly, I say  
> that, though, b/c I am not much of a shell scripter and is not a  
> knock against the Hadoop implementation.
>
> We also should do some thinking on what we want to deprecate per  
> our plan to move to JDK 1.5 and Lucene 3.0.
>
> Thoughts?
>
> -Grant
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: 2.3

Posted by Michael McCandless <lu...@mikemccandless.com>.

"Grant Ingersoll" <gs...@apache.org> wrote:
> 
> On Aug 25, 2007, at 8:54 AM, Michael McCandless wrote:
> 
> >> Do we want to start thinking about 2.3 release in the next month or
> >> so, say by the end of September?
> >
> > +1
> >
> > I'd like to commit LUCENE-845/847/870: I think those issues are not
> > too far being committable.  I'll go mark them 2.3.
> >
> 
> So, I see these are fixed...
> 
> I see 3 issues open against 2.3 still, plus 1 that are open against 2.2

I think LUCENE-1007 should also be 2.3 (I'll mark it as such).

I'm also working on an issue with writers on different machines,
sharing an index over NFS.  It's a spinoff of this thread from
java-user:

  http://www.gossamer-threads.com/lists/engine?do=post_view_flat;post=50680;page=1;sb=post_latest_reply;so=ASC;mh=25;list=lucene

I haven't opened an issue yet but will soon; I think we should get
that fix into 2.3 as well.

After that I'll work on LUCENE-982.

> I think the biggest issue before releasing is our confidence in the  
> major changes related to DocumentsWriter and the merge stuff, right?

Yes :)  Specifically I'd like to see the debug output from Mark on
LUCENE-994 to figure out why he's seeing an unexpected performance loss
when merging segments by their net byte size.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: 2.3

Posted by Michael McCandless <lu...@mikemccandless.com>.

Yikes -- yes please do isolate the code & open an issue!

Mike

"Andi Vajda" <va...@osafoundation.org> wrote:
> 
> On Fri, 28 Sep 2007, Grant Ingersoll wrote:
> 
> > I see 3 issues open against 2.3 still, plus 1 that are open against 2.2
> >
> > I think the biggest issue before releasing is our confidence in the major 
> > changes related to DocumentsWriter and the merge stuff, right?
> >
> > Do we want to try to set a date for a release?
> 
> I found a bug with indexing documents that contain fields with Term
> Vectors. 
> The indexing fails with 'reading past EOF' errors in what seems the index 
> optimizing phase during addIndexes(). (I index first into a RAMDirectory,
> then 
> addIndexes() into an FSDIrectory).
> 
> I have not filed the bug yet formally as I need to isolate the code. If I
> turn 
> indexing with term vectors off, indexing completes fine.
> 
> Andi..
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: possible bug with indexing with term vectors

Posted by Andi Vajda <va...@osafoundation.org>.

On Sat, 29 Sep 2007, Andi Vajda wrote:

> I finally isolated the bug into a simple Java unit test. It indeed had to do 
> with doing multiple batches of document additions.

I forgot to say that I ran this with the most recent fixes for bug 1008.
To be precise, lucene svn rev 580605.

Andi..

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: possible bug with indexing with term vectors

Posted by Andi Vajda <va...@osafoundation.org>.

On Sat, 29 Sep 2007, Andi Vajda wrote:

> Ok, this could explain why the test is passing. In the test I only do one 
> batch of indexing, not several like here. I missed that difference. My 
> apologies. I'm going to change my test now and report back...

I finally isolated the bug into a simple Java unit test. It indeed had to do 
with doing multiple batches of document additions.

Michael, I'm going to send you the test and the data privately as it contains 
data that shouldn't probably end up on JIRA such as personal email snippets of 
mine.

Andi..

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: possible bug with indexing with term vectors

Posted by Andi Vajda <va...@osafoundation.org>.

On Sat, 29 Sep 2007, Michael McCandless wrote:

>> The new PyLucene is built with a code generator and all public APIs and
>> classes are made available to Python. SerialMergeScheduler is available.
>
> Wild!  Does this mean PyLucene will track tightly to Lucene releases
> going forward?

Yes, even more tightly than before since I don't have to patch the Lucene 
sources anymore.

> What happened prior to this first optimize call?  Did you just create
> the writer, switch to SerialMergeScheduler, add N docs, then call
> setInfoStream(...) and writer.optimize()?

Yes, that's almost exactly it. I create the writer new (with create=true) then 
close it and its directory. Then reopen it and add N docs.

> The debug output starts with an optimize() call, which first flushes
> 372 docs to segment _7f; this is the first segment in the index.  Had
> you opened this writer with create=true?

I open the writer with true when the app creates its initial repository.
Then the writer is added to and oped without create=true.

> This optimize() does nothing because the index has only one segment
> (_7f) in compound file format, so it's already optimized.  Then the
> writer is closed.
>
> Then this is printed:
>
>  <DBRepositoryView: Lucene (1)> indexed 191 items in 0:00:00.413600
>
> Which is odd because 191 != 372.  Can't explain that difference...

That's because an item can have several attributes that get indexed, each 
becoming a Lucene document (an item is a Chandler object).

> Then another index writer is opened, 5 docs are added, then optimize()
> is called, which flushes 5 docs to segment _7g and converts it to
> compound file format.
>
> Finally we try to merge _7f and _7g for optimize, and we hit the EOF
> exception trying to read the term vector for a doc from one of these
> two segments.

Ok, this could explain why the test is passing. In the test I only do one 
batch of indexing, not several like here. I missed that difference. My 
apologies. I'm going to change my test now and report back...

Thank you for the explanations.

Andi..

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: possible bug with indexing with term vectors

Posted by Andi Vajda <va...@osafoundation.org>.

On Fri, 28 Sep 2007, Michael McCandless wrote:

>> I tried all morning to isolate the problem but I seem to be unable
>> to reproduce it in a simple unit test. In my application, I've been
>> able to get errors by doing even less: just creating a FSDirectory
>> and adding documents with fields with term vectors fails when
>> optimizing the index with the error below. I even tried to add the
>> same documents, in the same order, in the unit test but to no
>> avail. It just works.
>
> Are you trying your unit test first in Python (using PyLucene)?

No, I wrote it in Java to begin with. But it's a good idea. I should try it 
from PyLucene too.

>> What is different about my environment ? Well, I'm running PyLucene,
>> but the new one, the one using a Apple's Java VM, the same VM I'm
>> using to run the unit test. And I'm not doing anything special like
>> calling back into Python or something, I'm just calling regular
>> Lucene APIs adding documents into an IndexWriter on an FSDirectory
>> using a StandardAnalyzer. If I stop using term vectors, all is
>> working fine.
>
> Are your documents irregular wrt term vectors?  (Ie some docs have
> none, others store the terms but not positions/offsets, etc?).  Any
> interesting changes to Lucene's defaults (autoCommit=false, etc)?

All default config values, no config changes. All documents follow the same 
pattern of having five fields, one with term vectors.

>> I'd like to get to the bottom of this but could use some help. Does
>> the stacktrace below ring a bell ? Is there a way to run the whole
>> indexing and optimizing in one single thread ?
>
> You can easily turn off the concurrent (background) merges by doing
> this:
>
>  writer.setMergeScheduler(new SerialMergeScheduler())
>
> though that probably isn't punched through to Python in PyLucene.  You
> can also build a Lucene JAR w/ a small change to IndexWriter.java to
> do the same thing.

The new PyLucene is built with a code generator and all public APIs and 
classes are made available to Python. SerialMergeScheduler is available.

> That stacktrace is happening while merging term vectors during an
> optimize.  It's specifically occuring when loading the term vectors
> for a given doc X; we read a position from the index stream (tvx) just
> fine, but then when we try to read the first vInt from the document
> stream (tvd) we hit the EOF exception.  So that position was too large
> or the tvd file was somehow truncated.  Weird.
>
> Can you call "writer.setInfoStream(System.out)" and get the error to
> occur and then post the resulting log?  It may shed some light
> here....

I called writer.setMergeScheduler(SerialMergeScheduler()) just after creating 
the writer and called writer.setInfoStream(System.out) just before calling 
optimize(). Below is what I get:

IW [Thread-0]:   flush: segment=_7f docStoreSegment=_7f docStoreOffset=0 
flushDocs=true flushDeletes=false flushDocStores=true numDocs=372 
numBufDelTerms=0
IW [Thread-0]:   index before flush

closeDocStore: 5 files to flush to segment _7f

flush postings as segment _7f numDocs=372
   oldRAMSize=75040 newFlushedSize=21018 docs/MB=18,558.867 new/old=28.009%
IW [Thread-0]: checkpoint: wrote segments file "segments_9j"
Deleter [Thread-0]: now checkpoint "segments_9j" [1 segments ; isCommit = 
true]
Deleter [Thread-0]:   IncRef "_7f.fnm": pre-incr count is 0
Deleter [Thread-0]:   IncRef "_7f.frq": pre-incr count is 0
Deleter [Thread-0]:   IncRef "_7f.prx": pre-incr count is 0
Deleter [Thread-0]:   IncRef "_7f.tis": pre-incr count is 0
Deleter [Thread-0]:   IncRef "_7f.tii": pre-incr count is 0
Deleter [Thread-0]:   IncRef "_7f.nrm": pre-incr count is 0
Deleter [Thread-0]:   IncRef "_7f.tvx": pre-incr count is 0
Deleter [Thread-0]:   IncRef "_7f.tvf": pre-incr count is 0
Deleter [Thread-0]:   IncRef "_7f.tvd": pre-incr count is 0
Deleter [Thread-0]:   IncRef "_7f.fdx": pre-incr count is 0
Deleter [Thread-0]:   IncRef "_7f.fdt": pre-incr count is 0
Deleter [Thread-0]: deleteCommits: now remove commit "segments_9i"
Deleter [Thread-0]:   DecRef "segments_9i": pre-decr count is 1
Deleter [Thread-0]: delete "segments_9i"
IW [Thread-0]: checkpoint: wrote segments file "segments_9k"
Deleter [Thread-0]: now checkpoint "segments_9k" [1 segments ; isCommit = 
true]
Deleter [Thread-0]:   IncRef "_7f.cfs": pre-incr count is 0
Deleter [Thread-0]: deleteCommits: now remove commit "segments_9j"
Deleter [Thread-0]:   DecRef "_7f.fnm": pre-decr count is 1
Deleter [Thread-0]: delete "_7f.fnm"
Deleter [Thread-0]:   DecRef "_7f.frq": pre-decr count is 1
Deleter [Thread-0]: delete "_7f.frq"
Deleter [Thread-0]:   DecRef "_7f.prx": pre-decr count is 1
Deleter [Thread-0]: delete "_7f.prx"
Deleter [Thread-0]:   DecRef "_7f.tis": pre-decr count is 1
Deleter [Thread-0]: delete "_7f.tis"
Deleter [Thread-0]:   DecRef "_7f.tii": pre-decr count is 1
Deleter [Thread-0]: delete "_7f.tii"
Deleter [Thread-0]:   DecRef "_7f.nrm": pre-decr count is 1
Deleter [Thread-0]: delete "_7f.nrm"
Deleter [Thread-0]:   DecRef "_7f.tvx": pre-decr count is 1
Deleter [Thread-0]: delete "_7f.tvx"
Deleter [Thread-0]:   DecRef "_7f.tvf": pre-decr count is 1
Deleter [Thread-0]: delete "_7f.tvf"
Deleter [Thread-0]:   DecRef "_7f.tvd": pre-decr count is 1
Deleter [Thread-0]: delete "_7f.tvd"
Deleter [Thread-0]:   DecRef "_7f.fdx": pre-decr count is 1
Deleter [Thread-0]: delete "_7f.fdx"
Deleter [Thread-0]:   DecRef "_7f.fdt": pre-decr count is 1
Deleter [Thread-0]: delete "_7f.fdt"
Deleter [Thread-0]:   DecRef "segments_9j": pre-decr count is 1
Deleter [Thread-0]: delete "segments_9j"
IW [Thread-0]: optimize: index now _7f:c372
IW [Thread-0]:   flush: segment=null docStoreSegment=null docStoreOffset=0 
flushDocs=false flushDeletes=false flushDocStores=false numDocs=0 
numBufDelTerms=0
IW [Thread-0]:   index before flush _7f:c372
IW [Thread-0]: at close: _7f:c372
<DBRepositoryView: Lucene (1)> indexed 191 items in 0:00:00.413600
IW [Thread-0]:   flush: segment=_7g docStoreSegment=_7g docStoreOffset=0 
flushDocs=true flushDeletes=false flushDocStores=true numDocs=5 
numBufDelTerms=0
IW [Thread-0]:   index before flush _7f:c372

closeDocStore: 5 files to flush to segment _7g

flush postings as segment _7g numDocs=5
   oldRAMSize=76960 newFlushedSize=437 docs/MB=11,997.437 new/old=0.568%
IW [Thread-0]: checkpoint: wrote segments file "segments_9l"
Deleter [Thread-0]: now checkpoint "segments_9l" [2 segments ; isCommit = 
true]
Deleter [Thread-0]:   IncRef "_7f.cfs": pre-incr count is 1
Deleter [Thread-0]:   IncRef "_7g.fnm": pre-incr count is 0
Deleter [Thread-0]:   IncRef "_7g.frq": pre-incr count is 0
Deleter [Thread-0]:   IncRef "_7g.prx": pre-incr count is 0
Deleter [Thread-0]:   IncRef "_7g.tis": pre-incr count is 0
Deleter [Thread-0]:   IncRef "_7g.tii": pre-incr count is 0
Deleter [Thread-0]:   IncRef "_7g.nrm": pre-incr count is 0
Deleter [Thread-0]:   IncRef "_7g.tvx": pre-incr count is 0
Deleter [Thread-0]:   IncRef "_7g.tvf": pre-incr count is 0
Deleter [Thread-0]:   IncRef "_7g.tvd": pre-incr count is 0
Deleter [Thread-0]:   IncRef "_7g.fdx": pre-incr count is 0
Deleter [Thread-0]:   IncRef "_7g.fdt": pre-incr count is 0
Deleter [Thread-0]: deleteCommits: now remove commit "segments_9k"
Deleter [Thread-0]:   DecRef "_7f.cfs": pre-decr count is 2
Deleter [Thread-0]:   DecRef "segments_9k": pre-decr count is 1
Deleter [Thread-0]: delete "segments_9k"
IW [Thread-0]: checkpoint: wrote segments file "segments_9m"
Deleter [Thread-0]: now checkpoint "segments_9m" [2 segments ; isCommit = 
true]
Deleter [Thread-0]:   IncRef "_7f.cfs": pre-incr count is 1
Deleter [Thread-0]:   IncRef "_7g.cfs": pre-incr count is 0
Deleter [Thread-0]: deleteCommits: now remove commit "segments_9l"
Deleter [Thread-0]:   DecRef "_7f.cfs": pre-decr count is 2
Deleter [Thread-0]:   DecRef "_7g.fnm": pre-decr count is 1
Deleter [Thread-0]: delete "_7g.fnm"
Deleter [Thread-0]:   DecRef "_7g.frq": pre-decr count is 1
Deleter [Thread-0]: delete "_7g.frq"
Deleter [Thread-0]:   DecRef "_7g.prx": pre-decr count is 1
Deleter [Thread-0]: delete "_7g.prx"
Deleter [Thread-0]:   DecRef "_7g.tis": pre-decr count is 1
Deleter [Thread-0]: delete "_7g.tis"
Deleter [Thread-0]:   DecRef "_7g.tii": pre-decr count is 1
Deleter [Thread-0]: delete "_7g.tii"
Deleter [Thread-0]:   DecRef "_7g.nrm": pre-decr count is 1
Deleter [Thread-0]: delete "_7g.nrm"
Deleter [Thread-0]:   DecRef "_7g.tvx": pre-decr count is 1
Deleter [Thread-0]: delete "_7g.tvx"
Deleter [Thread-0]:   DecRef "_7g.tvf": pre-decr count is 1
Deleter [Thread-0]: delete "_7g.tvf"
Deleter [Thread-0]:   DecRef "_7g.tvd": pre-decr count is 1
Deleter [Thread-0]: delete "_7g.tvd"
Deleter [Thread-0]:   DecRef "_7g.fdx": pre-decr count is 1
Deleter [Thread-0]: delete "_7g.fdx"
Deleter [Thread-0]:   DecRef "_7g.fdt": pre-decr count is 1
Deleter [Thread-0]: delete "_7g.fdt"
Deleter [Thread-0]:   DecRef "segments_9l": pre-decr count is 1
Deleter [Thread-0]: delete "segments_9l"
IW [Thread-0]: optimize: index now _7f:c372 _7g:c5
IW [Thread-0]: add merge to pendingMerges: _7f:c372 _7g:c5 [optimize] [total 1 
pending]
Deleter [Thread-0]:   IncRef "_7f.cfs": pre-incr count is 1
Deleter [Thread-0]:   IncRef "_7g.cfs": pre-incr count is 1
IW [Thread-0]: now merge
   merge=_7f:c372 _7g:c5 into _7h [optimize]
   index=_7f:c372 _7g:c5
IW [Thread-0]: merging _7f:c372 _7g:c5 into _7h [optimize]
IW [Thread-0]: merge: total 377 docs
IW [Thread-0]: hit exception during merge; now refresh deleter on segment _7h
Deleter [Thread-0]: refresh [prefix=_7h]: removing newly created unreferenced 
file "_7h.fdt"
Deleter [Thread-0]: delete "_7h.fdt"
Deleter [Thread-0]: refresh [prefix=_7h]: removing newly created unreferenced 
file "_7h.fdx"
Deleter [Thread-0]: delete "_7h.fdx"
Deleter [Thread-0]: refresh [prefix=_7h]: removing newly created unreferenced 
file "_7h.fnm"
Deleter [Thread-0]: delete "_7h.fnm"
Deleter [Thread-0]: refresh [prefix=_7h]: removing newly created unreferenced 
file "_7h.frq"
Deleter [Thread-0]: delete "_7h.frq"
Deleter [Thread-0]: refresh [prefix=_7h]: removing newly created unreferenced 
file "_7h.nrm"
Deleter [Thread-0]: delete "_7h.nrm"
Deleter [Thread-0]: refresh [prefix=_7h]: removing newly created unreferenced 
file "_7h.prx"
Deleter [Thread-0]: delete "_7h.prx"
Deleter [Thread-0]: refresh [prefix=_7h]: removing newly created unreferenced 
file "_7h.tii"
Deleter [Thread-0]: delete "_7h.tii"
Deleter [Thread-0]: refresh [prefix=_7h]: removing newly created unreferenced 
file "_7h.tis"
Deleter [Thread-0]: delete "_7h.tis"
Deleter [Thread-0]: refresh [prefix=_7h]: removing newly created unreferenced 
file "_7h.tvd"
Deleter [Thread-0]: delete "_7h.tvd"
Deleter [Thread-0]: refresh [prefix=_7h]: removing newly created unreferenced 
file "_7h.tvf"
Deleter [Thread-0]: delete "_7h.tvf"
Deleter [Thread-0]: refresh [prefix=_7h]: removing newly created unreferenced 
file "_7h.tvx"
Deleter [Thread-0]: delete "_7h.tvx"
IW [Thread-0]: hit exception during merge
Deleter [Thread-0]:   DecRef "_7f.cfs": pre-decr count is 2
Deleter [Thread-0]:   DecRef "_7g.cfs": pre-decr count is 2
java.io.IOException: read past EOF
         at 
org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:146)
         at 
org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:38)
         at org.apache.lucene.store.IndexInput.readVInt(IndexInput.java:76)
         at 
org.apache.lucene.index.TermVectorsReader.get(TermVectorsReader.java:207)
         at 
org.apache.lucene.index.SegmentReader.getTermFreqVectors(SegmentReader.java:692)
         at 
org.apache.lucene.index.SegmentMerger.mergeVectors(SegmentMerger.java:279)
         at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:122)
         at 
org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:2898)
         at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:2647)
         at 
org.apache.lucene.index.SerialMergeScheduler.merge(SerialMergeScheduler.java:37)
         at 
org.apache.lucene.index.IndexWriter.maybeMerge(IndexWriter.java:1668)
         at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:1604)
         at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:1571)

IW [Thread-0]:   flush: segment=null docStoreSegment=null docStoreOffset=0 
flushDocs=false flushDeletes=false flushDocStores=false numDocs=0 
numBufDelTerms=0
IW [Thread-0]:   index before flush _7f:c372 _7g:c5
IW [Thread-0]: close: wrote segments file "segments_9n"
Deleter [Thread-0]: now checkpoint "segments_9n" [2 segments ; isCommit = 
true]
Deleter [Thread-0]:   IncRef "_7f.cfs": pre-incr count is 1
Deleter [Thread-0]:   IncRef "_7g.cfs": pre-incr count is 1
Deleter [Thread-0]: deleteCommits: now remove commit "segments_9m"
Deleter [Thread-0]:   DecRef "_7f.cfs": pre-decr count is 2
Deleter [Thread-0]:   DecRef "_7g.cfs": pre-decr count is 2
Deleter [Thread-0]:   DecRef "segments_9m": pre-decr count is 1
Deleter [Thread-0]: delete "segments_9m"
IW [Thread-0]: at close: _7f:c372 _7g:c5

Andi..


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: possible bug with indexing with term vectors

Posted by Michael McCandless <lu...@mikemccandless.com>.

"Andi Vajda" <va...@osafoundation.org> wrote:
>
> I tried all morning to isolate the problem but I seem to be unable
> to reproduce it in a simple unit test. In my application, I've been
> able to get errors by doing even less: just creating a FSDirectory
> and adding documents with fields with term vectors fails when
> optimizing the index with the error below. I even tried to add the
> same documents, in the same order, in the unit test but to no
> avail. It just works.

Are you trying your unit test first in Python (using PyLucene)?

> What is different about my environment ? Well, I'm running PyLucene,
> but the new one, the one using a Apple's Java VM, the same VM I'm
> using to run the unit test. And I'm not doing anything special like
> calling back into Python or something, I'm just calling regular
> Lucene APIs adding documents into an IndexWriter on an FSDirectory
> using a StandardAnalyzer. If I stop using term vectors, all is
> working fine.

Spooky.  It's definitely possible something is broken (there is alot
of new code in 2.3).

Are your documents irregular wrt term vectors?  (Ie some docs have
none, others store the terms but not positions/offsets, etc?).  Any
interesting changes to Lucene's defaults (autoCommit=false, etc)?

> I'd like to get to the bottom of this but could use some help. Does
> the stacktrace below ring a bell ? Is there a way to run the whole
> indexing and optimizing in one single thread ?

You can easily turn off the concurrent (background) merges by doing
this:

  writer.setMergeScheduler(new SerialMergeScheduler())

though that probably isn't punched through to Python in PyLucene.  You
can also build a Lucene JAR w/ a small change to IndexWriter.java to
do the same thing.

That stacktrace is happening while merging term vectors during an
optimize.  It's specifically occuring when loading the term vectors
for a given doc X; we read a position from the index stream (tvx) just
fine, but then when we try to read the first vInt from the document
stream (tvd) we hit the EOF exception.  So that position was too large
or the tvd file was somehow truncated.  Weird.

Can you call "writer.setInfoStream(System.out)" and get the error to
occur and then post the resulting log?  It may shed some light
here....

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: possible bug with indexing with term vectors

Posted by Grant Ingersoll <gr...@gmail.com>.

There are a couple of JIRA issues related to TVs as well, mostly edge  
cases, but Andi might want to take a look at them to see if they  
describe his situation.

-Grant

On Sep 29, 2007, at 8:35 AM, Michael McCandless wrote:

>
> You are right Grant -- good catch!!!  I have a unit test showing it
> now.  Thank you :)
>
> So, this case is tickled if you have a doc (or docs) that have some
> fields with term vectors enabled, but then later as part of the same
> buffered set of docs you have 1 or more docs that have no fields with
> term vectors enabled.
>
> I'll fix it.
>
> The thing is, from Andi's description I'm not sure this is the case
> he's hitting?  He said all docs have 5 fields, one of them with term
> vectors enabled ... hmmm.
>
> Mike
>
> On Sat, 29 Sep 2007 07:59:13 -0400, "Grant Ingersoll"  
> <gr...@gmail.com> said:
>> Hmmm, not sure, but in looking at DocumentsWriter, it seems like
>> lines around 553 might be at issue:
>> if (tvx != null) {
>>          tvx.writeLong(tvd.getFilePointer());
>>          if (numVectorFields > 0) {
>>            tvd.writeVInt(numVectorFields);
>>            for(int i=0;i<numVectorFields;i++)
>>              tvd.writeVInt(vectorFieldNumbers[i]);
>>            assert 0 == vectorFieldPointers[0];
>>            tvd.writeVLong(tvf.getFilePointer());
>>            long lastPos = vectorFieldPointers[0];
>>            for(int i=1;i<numVectorFields;i++) {
>>              long pos = vectorFieldPointers[i];
>>              tvd.writeVLong(pos-lastPos);
>>              lastPos = pos;
>>            }
>>            tvfLocal.writeTo(tvf);
>>            tvfLocal.reset();
>>          }
>>        }
>>
>> Specifically, the exception being thrown seems to be that it is
>> trying to read in a vInt that contains the number of fields that have
>> vectors.  However, in DocumentsWriter, it only writes out this vInt
>> if the numVectorFields is > 0.
>>
>> I think you might try:
>> if (numVectorFields > 0){
>> ....
>> }
>> else{
>> tvd.writeVInt(0)
>> }
>>
>> In the old TermVectorsWriter, it used to be:
>>   private void writeDoc() throws IOException {
>>      if (isFieldOpen())
>>        throw new IllegalStateException("Field is still open while
>> writing document");
>>      //System.out.println("Writing doc pointer: " +  
>> currentDocPointer);
>>      // write document index record
>>      tvx.writeLong(currentDocPointer);
>>
>>      // write document data record
>>      final int size = fields.size();
>>
>>      // write the number of fields
>>      tvd.writeVInt(size);
>>
>>      // write field numbers
>>      for (int i = 0; i < size; i++) {
>>        TVField field = (TVField) fields.elementAt(i);
>>        tvd.writeVInt(field.number);
>>      }
>>
>> http://svn.apache.org/viewvc/lucene/java/tags/lucene_2_2_0/src/java/
>> org/apache/lucene/index/TermVectorsWriter.java?view=markup
>>
>>
>>
>> On Sep 28, 2007, at 4:26 PM, Andi Vajda wrote:
>>
>>>
>>> On Fri, 28 Sep 2007, Andi Vajda wrote:
>>>
>>>> I found a bug with indexing documents that contain fields with
>>>> Term Vectors. The indexing fails with 'reading past EOF' errors in
>>>> what seems the index optimizing phase during addIndexes(). (I
>>>> index first into a RAMDirectory, then addIndexes() into an
>>>> FSDIrectory).
>>>>
>>>> I have not filed the bug yet formally as I need to isolate the
>>>> code. If I turn indexing with term vectors off, indexing completes
>>>> fine.
>>>
>>> I tried all morning to isolate the problem but I seem to be unable
>>> to reproduce it in a simple unit test. In my application, I've been
>>> able to get errors by doing even less: just creating a FSDirectory
>>> and adding documents with fields with term vectors fails when
>>> optimizing the index with the error below. I even tried to add the
>>> same documents, in the same order, in the unit test but to no
>>> avail. It just works.
>>>
>>> What is different about my environment ? Well, I'm running
>>> PyLucene, but the new one, the one using a Apple's Java VM, the
>>> same VM I'm using to run the unit test. And I'm not doing anything
>>> special like calling back into Python or something, I'm just
>>> calling regular Lucene APIs adding documents into an IndexWriter on
>>> an FSDirectory using a StandardAnalyzer. If I stop using term
>>> vectors, all is working fine.
>>>
>>> I'd like to get to the bottom of this but could use some help. Does
>>> the stacktrace below ring a bell ? Is there a way to run the whole
>>> indexing and optimizing in one single thread ?
>>>
>>> Thanks !
>>>
>>> Andi..
>>>
>>> Exception in thread "Thread-4" org.apache.lucene.index.MergePolicy
>>> $MergeException: java.io.IOException: read past EOF
>>>         at org.apache.lucene.index.ConcurrentMergeScheduler
>>> $MergeThread.run(ConcurrentMergeScheduler.java:263)
>>> Caused by: java.io.IOException: read past EOF
>>>         at org.apache.lucene.store.BufferedIndexInput.refill
>>> (BufferedIndexInput.java:146)
>>>         at org.apache.lucene.store.BufferedIndexInput.readByte
>>> (BufferedIndexInput.java:38)
>>>         at org.apache.lucene.store.IndexInput.readVInt
>>> (IndexInput.java:76)
>>>         at org.apache.lucene.index.TermVectorsReader.get
>>> (TermVectorsReader.java:207)
>>>         at org.apache.lucene.index.SegmentReader.getTermFreqVectors
>>> (SegmentReader.java:692)
>>>         at org.apache.lucene.index.SegmentMerger.mergeVectors
>>> (SegmentMerger.java:279)
>>>         at org.apache.lucene.index.SegmentMerger.merge
>>> (SegmentMerger.java:122)
>>>         at org.apache.lucene.index.IndexWriter.mergeMiddle
>>> (IndexWriter.java:2898)
>>>         at org.apache.lucene.index.IndexWriter.merge
>>> (IndexWriter.java:2647)
>>>         at org.apache.lucene.index.ConcurrentMergeScheduler
>>> $MergeThread.run(ConcurrentMergeScheduler.java:232)
>>> java.io.IOException: background merge hit exception: _5u:c372
>>> _5v:c5 into _5w [optimize]
>>>         at org.apache.lucene.index.IndexWriter.optimize
>>> (IndexWriter.java:1621)
>>>         at org.apache.lucene.index.IndexWriter.optimize
>>> (IndexWriter.java:1571)
>>> Caused by: java.io.IOException: read past EOF
>>>         at org.apache.lucene.store.BufferedIndexInput.refill
>>> (BufferedIndexInput.java:146)
>>>         at org.apache.lucene.store.BufferedIndexInput.readByte
>>> (BufferedIndexInput.java:38)
>>>         at org.apache.lucene.store.IndexInput.readVInt
>>> (IndexInput.java:76)
>>>         at org.apache.lucene.index.TermVectorsReader.get
>>> (TermVectorsReader.java:207)
>>>         at org.apache.lucene.index.SegmentReader.getTermFreqVectors
>>> (SegmentReader.java:692)
>>>         at org.apache.lucene.index.SegmentMerger.mergeVectors
>>> (SegmentMerger.java:279)
>>>         at org.apache.lucene.index.SegmentMerger.merge
>>> (SegmentMerger.java:122)
>>>         at org.apache.lucene.index.IndexWriter.mergeMiddle
>>> (IndexWriter.java:2898)
>>>         at org.apache.lucene.index.IndexWriter.merge
>>> (IndexWriter.java:2647)
>>>         at org.apache.lucene.index.ConcurrentMergeScheduler
>>> $MergeThread.run(ConcurrentMergeScheduler.java:232)
>>>
>>> -------------------------------------------------------------------- 
>>> -
>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>
>>
>> ------------------------------------------------------
>> Grant Ingersoll
>> http://www.grantingersoll.com/
>> http://lucene.grantingersoll.com
>> http://www.paperoftheweek.com/
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>

------------------------------------------------------
Grant Ingersoll
http://www.grantingersoll.com/
http://lucene.grantingersoll.com
http://www.paperoftheweek.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: possible bug with indexing with term vectors

Posted by Michael McCandless <lu...@mikemccandless.com>.

You are right Grant -- good catch!!!  I have a unit test showing it
now.  Thank you :)

So, this case is tickled if you have a doc (or docs) that have some
fields with term vectors enabled, but then later as part of the same
buffered set of docs you have 1 or more docs that have no fields with
term vectors enabled.

I'll fix it.

The thing is, from Andi's description I'm not sure this is the case
he's hitting?  He said all docs have 5 fields, one of them with term
vectors enabled ... hmmm.

Mike

On Sat, 29 Sep 2007 07:59:13 -0400, "Grant Ingersoll" <gr...@gmail.com> said:
> Hmmm, not sure, but in looking at DocumentsWriter, it seems like  
> lines around 553 might be at issue:
> if (tvx != null) {
>          tvx.writeLong(tvd.getFilePointer());
>          if (numVectorFields > 0) {
>            tvd.writeVInt(numVectorFields);
>            for(int i=0;i<numVectorFields;i++)
>              tvd.writeVInt(vectorFieldNumbers[i]);
>            assert 0 == vectorFieldPointers[0];
>            tvd.writeVLong(tvf.getFilePointer());
>            long lastPos = vectorFieldPointers[0];
>            for(int i=1;i<numVectorFields;i++) {
>              long pos = vectorFieldPointers[i];
>              tvd.writeVLong(pos-lastPos);
>              lastPos = pos;
>            }
>            tvfLocal.writeTo(tvf);
>            tvfLocal.reset();
>          }
>        }
> 
> Specifically, the exception being thrown seems to be that it is  
> trying to read in a vInt that contains the number of fields that have  
> vectors.  However, in DocumentsWriter, it only writes out this vInt  
> if the numVectorFields is > 0.
> 
> I think you might try:
> if (numVectorFields > 0){
> ....
> }
> else{
> tvd.writeVInt(0)
> }
> 
> In the old TermVectorsWriter, it used to be:
>   private void writeDoc() throws IOException {
>      if (isFieldOpen())
>        throw new IllegalStateException("Field is still open while  
> writing document");
>      //System.out.println("Writing doc pointer: " + currentDocPointer);
>      // write document index record
>      tvx.writeLong(currentDocPointer);
> 
>      // write document data record
>      final int size = fields.size();
> 
>      // write the number of fields
>      tvd.writeVInt(size);
> 
>      // write field numbers
>      for (int i = 0; i < size; i++) {
>        TVField field = (TVField) fields.elementAt(i);
>        tvd.writeVInt(field.number);
>      }
> 
> http://svn.apache.org/viewvc/lucene/java/tags/lucene_2_2_0/src/java/ 
> org/apache/lucene/index/TermVectorsWriter.java?view=markup
> 
> 
> 
> On Sep 28, 2007, at 4:26 PM, Andi Vajda wrote:
> 
> >
> > On Fri, 28 Sep 2007, Andi Vajda wrote:
> >
> >> I found a bug with indexing documents that contain fields with  
> >> Term Vectors. The indexing fails with 'reading past EOF' errors in  
> >> what seems the index optimizing phase during addIndexes(). (I  
> >> index first into a RAMDirectory, then addIndexes() into an  
> >> FSDIrectory).
> >>
> >> I have not filed the bug yet formally as I need to isolate the  
> >> code. If I turn indexing with term vectors off, indexing completes  
> >> fine.
> >
> > I tried all morning to isolate the problem but I seem to be unable  
> > to reproduce it in a simple unit test. In my application, I've been  
> > able to get errors by doing even less: just creating a FSDirectory  
> > and adding documents with fields with term vectors fails when  
> > optimizing the index with the error below. I even tried to add the  
> > same documents, in the same order, in the unit test but to no  
> > avail. It just works.
> >
> > What is different about my environment ? Well, I'm running  
> > PyLucene, but the new one, the one using a Apple's Java VM, the  
> > same VM I'm using to run the unit test. And I'm not doing anything  
> > special like calling back into Python or something, I'm just  
> > calling regular Lucene APIs adding documents into an IndexWriter on  
> > an FSDirectory using a StandardAnalyzer. If I stop using term  
> > vectors, all is working fine.
> >
> > I'd like to get to the bottom of this but could use some help. Does  
> > the stacktrace below ring a bell ? Is there a way to run the whole  
> > indexing and optimizing in one single thread ?
> >
> > Thanks !
> >
> > Andi..
> >
> > Exception in thread "Thread-4" org.apache.lucene.index.MergePolicy 
> > $MergeException: java.io.IOException: read past EOF
> >         at org.apache.lucene.index.ConcurrentMergeScheduler 
> > $MergeThread.run(ConcurrentMergeScheduler.java:263)
> > Caused by: java.io.IOException: read past EOF
> >         at org.apache.lucene.store.BufferedIndexInput.refill 
> > (BufferedIndexInput.java:146)
> >         at org.apache.lucene.store.BufferedIndexInput.readByte 
> > (BufferedIndexInput.java:38)
> >         at org.apache.lucene.store.IndexInput.readVInt 
> > (IndexInput.java:76)
> >         at org.apache.lucene.index.TermVectorsReader.get 
> > (TermVectorsReader.java:207)
> >         at org.apache.lucene.index.SegmentReader.getTermFreqVectors 
> > (SegmentReader.java:692)
> >         at org.apache.lucene.index.SegmentMerger.mergeVectors 
> > (SegmentMerger.java:279)
> >         at org.apache.lucene.index.SegmentMerger.merge 
> > (SegmentMerger.java:122)
> >         at org.apache.lucene.index.IndexWriter.mergeMiddle 
> > (IndexWriter.java:2898)
> >         at org.apache.lucene.index.IndexWriter.merge 
> > (IndexWriter.java:2647)
> >         at org.apache.lucene.index.ConcurrentMergeScheduler 
> > $MergeThread.run(ConcurrentMergeScheduler.java:232)
> > java.io.IOException: background merge hit exception: _5u:c372  
> > _5v:c5 into _5w [optimize]
> >         at org.apache.lucene.index.IndexWriter.optimize 
> > (IndexWriter.java:1621)
> >         at org.apache.lucene.index.IndexWriter.optimize 
> > (IndexWriter.java:1571)
> > Caused by: java.io.IOException: read past EOF
> >         at org.apache.lucene.store.BufferedIndexInput.refill 
> > (BufferedIndexInput.java:146)
> >         at org.apache.lucene.store.BufferedIndexInput.readByte 
> > (BufferedIndexInput.java:38)
> >         at org.apache.lucene.store.IndexInput.readVInt 
> > (IndexInput.java:76)
> >         at org.apache.lucene.index.TermVectorsReader.get 
> > (TermVectorsReader.java:207)
> >         at org.apache.lucene.index.SegmentReader.getTermFreqVectors 
> > (SegmentReader.java:692)
> >         at org.apache.lucene.index.SegmentMerger.mergeVectors 
> > (SegmentMerger.java:279)
> >         at org.apache.lucene.index.SegmentMerger.merge 
> > (SegmentMerger.java:122)
> >         at org.apache.lucene.index.IndexWriter.mergeMiddle 
> > (IndexWriter.java:2898)
> >         at org.apache.lucene.index.IndexWriter.merge 
> > (IndexWriter.java:2647)
> >         at org.apache.lucene.index.ConcurrentMergeScheduler 
> > $MergeThread.run(ConcurrentMergeScheduler.java:232)
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-dev-help@lucene.apache.org
> >
> 
> ------------------------------------------------------
> Grant Ingersoll
> http://www.grantingersoll.com/
> http://lucene.grantingersoll.com
> http://www.paperoftheweek.com/
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: possible bug with indexing with term vectors

Posted by Grant Ingersoll <gr...@gmail.com>.

Hmmm, not sure, but in looking at DocumentsWriter, it seems like  
lines around 553 might be at issue:
if (tvx != null) {
         tvx.writeLong(tvd.getFilePointer());
         if (numVectorFields > 0) {
           tvd.writeVInt(numVectorFields);
           for(int i=0;i<numVectorFields;i++)
             tvd.writeVInt(vectorFieldNumbers[i]);
           assert 0 == vectorFieldPointers[0];
           tvd.writeVLong(tvf.getFilePointer());
           long lastPos = vectorFieldPointers[0];
           for(int i=1;i<numVectorFields;i++) {
             long pos = vectorFieldPointers[i];
             tvd.writeVLong(pos-lastPos);
             lastPos = pos;
           }
           tvfLocal.writeTo(tvf);
           tvfLocal.reset();
         }
       }

Specifically, the exception being thrown seems to be that it is  
trying to read in a vInt that contains the number of fields that have  
vectors.  However, in DocumentsWriter, it only writes out this vInt  
if the numVectorFields is > 0.

I think you might try:
if (numVectorFields > 0){
....
}
else{
tvd.writeVInt(0)
}

In the old TermVectorsWriter, it used to be:
  private void writeDoc() throws IOException {
     if (isFieldOpen())
       throw new IllegalStateException("Field is still open while  
writing document");
     //System.out.println("Writing doc pointer: " + currentDocPointer);
     // write document index record
     tvx.writeLong(currentDocPointer);

     // write document data record
     final int size = fields.size();

     // write the number of fields
     tvd.writeVInt(size);

     // write field numbers
     for (int i = 0; i < size; i++) {
       TVField field = (TVField) fields.elementAt(i);
       tvd.writeVInt(field.number);
     }

http://svn.apache.org/viewvc/lucene/java/tags/lucene_2_2_0/src/java/ 
org/apache/lucene/index/TermVectorsWriter.java?view=markup



On Sep 28, 2007, at 4:26 PM, Andi Vajda wrote:

>
> On Fri, 28 Sep 2007, Andi Vajda wrote:
>
>> I found a bug with indexing documents that contain fields with  
>> Term Vectors. The indexing fails with 'reading past EOF' errors in  
>> what seems the index optimizing phase during addIndexes(). (I  
>> index first into a RAMDirectory, then addIndexes() into an  
>> FSDIrectory).
>>
>> I have not filed the bug yet formally as I need to isolate the  
>> code. If I turn indexing with term vectors off, indexing completes  
>> fine.
>
> I tried all morning to isolate the problem but I seem to be unable  
> to reproduce it in a simple unit test. In my application, I've been  
> able to get errors by doing even less: just creating a FSDirectory  
> and adding documents with fields with term vectors fails when  
> optimizing the index with the error below. I even tried to add the  
> same documents, in the same order, in the unit test but to no  
> avail. It just works.
>
> What is different about my environment ? Well, I'm running  
> PyLucene, but the new one, the one using a Apple's Java VM, the  
> same VM I'm using to run the unit test. And I'm not doing anything  
> special like calling back into Python or something, I'm just  
> calling regular Lucene APIs adding documents into an IndexWriter on  
> an FSDirectory using a StandardAnalyzer. If I stop using term  
> vectors, all is working fine.
>
> I'd like to get to the bottom of this but could use some help. Does  
> the stacktrace below ring a bell ? Is there a way to run the whole  
> indexing and optimizing in one single thread ?
>
> Thanks !
>
> Andi..
>
> Exception in thread "Thread-4" org.apache.lucene.index.MergePolicy 
> $MergeException: java.io.IOException: read past EOF
>         at org.apache.lucene.index.ConcurrentMergeScheduler 
> $MergeThread.run(ConcurrentMergeScheduler.java:263)
> Caused by: java.io.IOException: read past EOF
>         at org.apache.lucene.store.BufferedIndexInput.refill 
> (BufferedIndexInput.java:146)
>         at org.apache.lucene.store.BufferedIndexInput.readByte 
> (BufferedIndexInput.java:38)
>         at org.apache.lucene.store.IndexInput.readVInt 
> (IndexInput.java:76)
>         at org.apache.lucene.index.TermVectorsReader.get 
> (TermVectorsReader.java:207)
>         at org.apache.lucene.index.SegmentReader.getTermFreqVectors 
> (SegmentReader.java:692)
>         at org.apache.lucene.index.SegmentMerger.mergeVectors 
> (SegmentMerger.java:279)
>         at org.apache.lucene.index.SegmentMerger.merge 
> (SegmentMerger.java:122)
>         at org.apache.lucene.index.IndexWriter.mergeMiddle 
> (IndexWriter.java:2898)
>         at org.apache.lucene.index.IndexWriter.merge 
> (IndexWriter.java:2647)
>         at org.apache.lucene.index.ConcurrentMergeScheduler 
> $MergeThread.run(ConcurrentMergeScheduler.java:232)
> java.io.IOException: background merge hit exception: _5u:c372  
> _5v:c5 into _5w [optimize]
>         at org.apache.lucene.index.IndexWriter.optimize 
> (IndexWriter.java:1621)
>         at org.apache.lucene.index.IndexWriter.optimize 
> (IndexWriter.java:1571)
> Caused by: java.io.IOException: read past EOF
>         at org.apache.lucene.store.BufferedIndexInput.refill 
> (BufferedIndexInput.java:146)
>         at org.apache.lucene.store.BufferedIndexInput.readByte 
> (BufferedIndexInput.java:38)
>         at org.apache.lucene.store.IndexInput.readVInt 
> (IndexInput.java:76)
>         at org.apache.lucene.index.TermVectorsReader.get 
> (TermVectorsReader.java:207)
>         at org.apache.lucene.index.SegmentReader.getTermFreqVectors 
> (SegmentReader.java:692)
>         at org.apache.lucene.index.SegmentMerger.mergeVectors 
> (SegmentMerger.java:279)
>         at org.apache.lucene.index.SegmentMerger.merge 
> (SegmentMerger.java:122)
>         at org.apache.lucene.index.IndexWriter.mergeMiddle 
> (IndexWriter.java:2898)
>         at org.apache.lucene.index.IndexWriter.merge 
> (IndexWriter.java:2647)
>         at org.apache.lucene.index.ConcurrentMergeScheduler 
> $MergeThread.run(ConcurrentMergeScheduler.java:232)
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>

------------------------------------------------------
Grant Ingersoll
http://www.grantingersoll.com/
http://lucene.grantingersoll.com
http://www.paperoftheweek.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

possible bug with indexing with term vectors

Posted by Andi Vajda <va...@osafoundation.org>.

On Fri, 28 Sep 2007, Andi Vajda wrote:

> I found a bug with indexing documents that contain fields with Term Vectors. 
> The indexing fails with 'reading past EOF' errors in what seems the index 
> optimizing phase during addIndexes(). (I index first into a RAMDirectory, 
> then addIndexes() into an FSDIrectory).
>
> I have not filed the bug yet formally as I need to isolate the code. If I 
> turn indexing with term vectors off, indexing completes fine.

I tried all morning to isolate the problem but I seem to be unable to 
reproduce it in a simple unit test. In my application, I've been able to get 
errors by doing even less: just creating a FSDirectory and adding documents 
with fields with term vectors fails when optimizing the index with the error 
below. I even tried to add the same documents, in the same order, in the unit 
test but to no avail. It just works.

What is different about my environment ? Well, I'm running PyLucene, but the 
new one, the one using a Apple's Java VM, the same VM I'm using to run the 
unit test. And I'm not doing anything special like calling back into Python or 
something, I'm just calling regular Lucene APIs adding documents into an 
IndexWriter on an FSDirectory using a StandardAnalyzer. If I stop using term 
vectors, all is working fine.

I'd like to get to the bottom of this but could use some help. Does the 
stacktrace below ring a bell ? Is there a way to run the whole indexing and 
optimizing in one single thread ?

Thanks !

Andi..

Exception in thread "Thread-4" 
org.apache.lucene.index.MergePolicy$MergeException: java.io.IOException: read 
past EOF
         at 
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:263)
Caused by: java.io.IOException: read past EOF
         at 
org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:146)
         at 
org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:38)
         at org.apache.lucene.store.IndexInput.readVInt(IndexInput.java:76)
         at 
org.apache.lucene.index.TermVectorsReader.get(TermVectorsReader.java:207)
         at 
org.apache.lucene.index.SegmentReader.getTermFreqVectors(SegmentReader.java:692)
         at 
org.apache.lucene.index.SegmentMerger.mergeVectors(SegmentMerger.java:279)
         at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:122)
         at 
org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:2898)
         at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:2647)
         at 
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:232)
java.io.IOException: background merge hit exception: _5u:c372 _5v:c5 into _5w 
[optimize]
         at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:1621)
         at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:1571)
Caused by: java.io.IOException: read past EOF
         at 
org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:146)
         at 
org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:38)
         at org.apache.lucene.store.IndexInput.readVInt(IndexInput.java:76)
         at 
org.apache.lucene.index.TermVectorsReader.get(TermVectorsReader.java:207)
         at 
org.apache.lucene.index.SegmentReader.getTermFreqVectors(SegmentReader.java:692)
         at 
org.apache.lucene.index.SegmentMerger.mergeVectors(SegmentMerger.java:279)
         at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:122)
         at 
org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:2898)
         at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:2647)
         at 
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:232)

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: 2.3

Posted by Andi Vajda <va...@osafoundation.org>.

On Fri, 28 Sep 2007, Grant Ingersoll wrote:

> I see 3 issues open against 2.3 still, plus 1 that are open against 2.2
>
> I think the biggest issue before releasing is our confidence in the major 
> changes related to DocumentsWriter and the merge stuff, right?
>
> Do we want to try to set a date for a release?

I found a bug with indexing documents that contain fields with Term Vectors. 
The indexing fails with 'reading past EOF' errors in what seems the index 
optimizing phase during addIndexes(). (I index first into a RAMDirectory, then 
addIndexes() into an FSDIrectory).

I have not filed the bug yet formally as I need to isolate the code. If I turn 
indexing with term vectors off, indexing completes fine.

Andi..

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: 2.3

Posted by Grant Ingersoll <gs...@apache.org>.

On Aug 25, 2007, at 8:54 AM, Michael McCandless wrote:

>> Do we want to start thinking about 2.3 release in the next month or
>> so, say by the end of September?
>
> +1
>
> I'd like to commit LUCENE-845/847/870: I think those issues are not
> too far being committable.  I'll go mark them 2.3.
>

So, I see these are fixed...

I see 3 issues open against 2.3 still, plus 1 that are open against 2.2

I think the biggest issue before releasing is our confidence in the  
major changes related to DocumentsWriter and the merge stuff, right?

Do we want to try to set a date for a release?


>> I propose this time around, during our freeze cycle, we take a good
>> look at cleaning up JIRA, especially some of the older issues (<
>> LUCENE-400).
>
> Great!
>
>> I am hoping to add a Hudson call similar to what Hadoop does that
>> automatically applies patches to the trunk to see if the tests pass,
>> codestyle is met, etc.  If anyone wants to help out with that, I
>> would appreciate it.  Ideally, we could build something that is nice
>> and portable so that other Lucene projects can use it (maybe even
>> other ASF projects) It seems like it could be done through some ANT
>> tasks and targets, instead of the current shell way.  Mostly, I say
>> that, though, b/c I am not much of a shell scripter and is not a
>> knock against the Hadoop implementation.
>
> This would be awesome.
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: 2.3

Posted by Michael McCandless <lu...@mikemccandless.com>.

> Do we want to start thinking about 2.3 release in the next month or
> so, say by the end of September?

+1

I'd like to commit LUCENE-845/847/870: I think those issues are not
too far being committable.  I'll go mark them 2.3.

> I propose this time around, during our freeze cycle, we take a good
> look at cleaning up JIRA, especially some of the older issues (<
> LUCENE-400).

Great!

> I am hoping to add a Hudson call similar to what Hadoop does that
> automatically applies patches to the trunk to see if the tests pass,
> codestyle is met, etc.  If anyone wants to help out with that, I
> would appreciate it.  Ideally, we could build something that is nice
> and portable so that other Lucene projects can use it (maybe even
> other ASF projects) It seems like it could be done through some ANT
> tasks and targets, instead of the current shell way.  Mostly, I say
> that, though, b/c I am not much of a shell scripter and is not a
> knock against the Hadoop implementation.

This would be awesome.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org