You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Pa...@emainc.com on 2009/08/31 16:28:16 UTC

Indexing large files?

Hi,

 

I'm working with Lucene 2.4.0 and the JVM (JDK 1.6.0_07).  I'm
consistently receiving "OutOfMemoryError: Java heap space", when trying
to index large text files.

 

Example 1: Indexing a 5 MB text file runs out of memory with a 16 MB
max. heap size.  So I increased the max. heap size to 512 MB.  This
worked for the 5 MB text file, but Lucene still used 84 MB of heap space
to do this.  Why so much?

 

The class FreqProxTermsWriterPerField appears to be the biggest memory
consumer by far according to JConsole and the TPTP Memory Profiling
plugin for Eclipse Ganymede.

 

Example 2: Indexing a 62 MB text file runs out of memory with a 512 MB
max. heap size.  Increasing the max. heap size to 1024 MB works but
Lucene uses 826 MB of heap space while performing this.  Still seems
like way too much memory is being used to do this.  I'm sure larger
files would cause the error as it seems correlative.

 

I'm on a Windows XP SP2 platform with 2 GB of RAM.  So what is the best
practice for indexing large files?  Here is a code snippet that I'm
using:

 

// Index the content of a text file.

      private Boolean saveTXTFile(File textFile, Document textDocument)
throws CIDBException {           

            

            try {             

                              

                  Boolean isFile = textFile.isFile();

                  Boolean hasTextExtension =
textFile.getName().endsWith(".txt");

                  

                  if (isFile && hasTextExtension) {

             

                        System.out.println("File " +
textFile.getCanonicalPath() + " is being indexed");

                        Reader textFileReader = new
FileReader(textFile);

                        if (textDocument == null)

                              textDocument = new Document();

                        textDocument.add(new Field("content",
textFileReader));

                        indexWriter.addDocument(textDocument);
// BREAKS HERE!!!!

                  }                    

            } catch (FileNotFoundException fnfe) {

                  System.out.println(fnfe.getMessage());

                  return false;

            } catch (CorruptIndexException cie) {

                  throw new CIDBException("The index has become
corrupt.");

            } catch (IOException ioe) {

                  System.out.println(ioe.getMessage());

                  return false;

            }                    

            return true;

      }

 

 

Thanks much,

 

Paul

Re: Stopping a runaway search, any ideas?

Posted by Daniel Shane <sh...@LEXUM.UMontreal.CA>.

Wow thats exactly what I was looking for! In the mean time I'll use the 
time based collector.

Thanks Uwe and Mark for your help!
Daniel Shane

mark harwood wrote:
> Or https://issues.apache.org/jira/browse/LUCENE-1720 offers lightweight timeout testing at all index access stages prior to calls to Collector e.g. will catch a runaway fuzzy query during it's expensive term expansion phase.
>
>
>
> ----- Original Message ----
> From: Uwe Schindler <uw...@thetaphi.de>
> To: java-user@lucene.apache.org
> Sent: Friday, 11 September, 2009 15:33:19
> Subject: RE: Stopping a runaway search, any ideas?
>
> Yes: TimeLimitedCollector in 2.4.1 (and the new non-deprecated ones in 2.9).
> Just wrap your own collector (like TopDocsCollector) with this class.
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
>
>   
>> -----Original Message-----
>> From: Daniel Shane [mailto:shaned@LEXUM.UMontreal.CA]
>> Sent: Friday, September 11, 2009 4:26 PM
>> To: java-user@lucene.apache.org
>> Subject: Stopping a runaway search, any ideas?
>>
>> I don't think its possible, but is there something in lucene to cap a
>> search to a predefined time length or is there a way to stop a search
>> when its running for too long?
>>
>> Daniel Shane
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>     
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>       
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Stopping a runaway search, any ideas?

Posted by mark harwood <ma...@yahoo.co.uk>.

Or https://issues.apache.org/jira/browse/LUCENE-1720 offers lightweight timeout testing at all index access stages prior to calls to Collector e.g. will catch a runaway fuzzy query during it's expensive term expansion phase.



----- Original Message ----
From: Uwe Schindler <uw...@thetaphi.de>
To: java-user@lucene.apache.org
Sent: Friday, 11 September, 2009 15:33:19
Subject: RE: Stopping a runaway search, any ideas?

Yes: TimeLimitedCollector in 2.4.1 (and the new non-deprecated ones in 2.9).
Just wrap your own collector (like TopDocsCollector) with this class.

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de

> -----Original Message-----
> From: Daniel Shane [mailto:shaned@LEXUM.UMontreal.CA]
> Sent: Friday, September 11, 2009 4:26 PM
> To: java-user@lucene.apache.org
> Subject: Stopping a runaway search, any ideas?
> 
> I don't think its possible, but is there something in lucene to cap a
> search to a predefined time length or is there a way to stop a search
> when its running for too long?
> 
> Daniel Shane
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


      

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Stopping a runaway search, any ideas?

Posted by Uwe Schindler <uw...@thetaphi.de>.

Yes: TimeLimitedCollector in 2.4.1 (and the new non-deprecated ones in 2.9).
Just wrap your own collector (like TopDocsCollector) with this class.

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de

> -----Original Message-----
> From: Daniel Shane [mailto:shaned@LEXUM.UMontreal.CA]
> Sent: Friday, September 11, 2009 4:26 PM
> To: java-user@lucene.apache.org
> Subject: Stopping a runaway search, any ideas?
> 
> I don't think its possible, but is there something in lucene to cap a
> search to a predefined time length or is there a way to stop a search
> when its running for too long?
> 
> Daniel Shane
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Stopping a runaway search, any ideas?

Posted by Chris Hostetter <ho...@fucit.org>.

: Subject: Stopping a runaway search, any ideas?
: References: <5B...@sc1exc2.corp.emainc.com>
:      <5B...@sc1exc2.corp.emainc.com>	
:     <24098ED350C76D46A4FDBD81B51BE0E903F040B80F@exchange.windows.mmu.acquireme
:     dia.com> <5e...@mail.gmail.com>
: In-Reply-To: <5e...@mail.gmail.com>

http://people.apache.org/~hossman/#threadhijack
Thread Hijacking on Mailing Lists

When starting a new discussion on a mailing list, please do not reply to 
an existing message, instead start a fresh email.  Even if you change the 
subject line of your email, other mail headers still track which thread 
you replied to and your question is "hidden" in that thread and gets less 
attention.   It makes following discussions in the mailing list archives 
particularly difficult.
See Also:  http://en.wikipedia.org/wiki/Thread_hijacking



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Stopping a runaway search, any ideas?

Posted by Daniel Shane <sh...@LEXUM.UMontreal.CA>.

I don't think its possible, but is there something in lucene to cap a 
search to a predefined time length or is there a way to stop a search 
when its running for too long?

Daniel Shane


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Indexing large files? - No answers yet...

Posted by Brian Pinkerton <br...@lucidimagination.com>.

Quite possibly, but shouldn't one expect Lucene's resource to track  
the size of the problem in question?     Paul's two examples below use  
input files of 5 and 62MB, hardly the size of input I'd expect to  
handle in a memory-compromised environment.

bri

On Sep 11, 2009, at 7:43 AM, Glen Newton wrote:

> Paul,
>
> I saw your last post and now understand the issues you face.
>
> I don't think there has been any effort to produce a
> reduced-memory-footprint configurable (RMFC) Lucene. With the many
> mobile devices, embedded and other reduced memory devices, should this
> perhaps be one of the areas the Lucene community looks in to?
>
> -Glen
>
> 2009/9/11  <Pa...@emainc.com>:
>> Thanks Glen!
>>
>> I will take at your project.  Unfortunately I will only have 512 MB  
>> to 1024 MB to work with as Lucene is only one component in a larger  
>> software system running on one machine.  I agree with you on the C\C 
>> ++ comment.  That is what I would normally use for memory intense  
>> software.  It turns out that the larger file you want to index is  
>> the larger the heap space you will need.  What I would like to see  
>> is a way to "throttle" the indexing process to control the memory  
>> footprint.  I understand that this will take longer, but if I  
>> perform the task during off hours it shouldn't matter. At least the  
>> file will be indexed correctly.
>>
>> Thanks,
>> Paul
>>
>>
>> -----Original Message-----
>> From: java-user-return-42272-Paul_Murdoch=emainc.com@lucene.apache.org 
>>  [mailto:java-user-return-42272- 
>> Paul_Murdoch=emainc.com@lucene.apache.org] On Behalf Of Glen Newton
>> Sent: Friday, September 11, 2009 9:53 AM
>> To: java-user@lucene.apache.org
>> Subject: Re: Indexing large files? - No answers yet...
>>
>> In this project:
>> http://zzzoot.blogspot.com/2009/07/project-torngat-building-large-scale.html
>>
>> I concatenate all the text of all of articles of a single journal  
>> into
>> a single text file.
>> This can create a text file that is 500MB in size.
>> Lucene is OK in indexing files this size (in parallel even), but I
>> have a heap size of 8GB.
>>
>> I would suggest increasing your heap to as large as your machine can
>> reasonably take.
>> The reality is that Java programs (like Lucene) take up more memory
>> than a similar C or even C++ program.
>> Java may approach C/C++ in speed, but not memory.
>>
>> We don't use Java because of its memory footprint!  ;-)
>>
>> See:
>> Programming language shootout: speed:
>> http://shootout.alioth.debian.org/u32q/benchmark.php?test=all&lang=all&d=ndata&calc=calculate&xfullcpu=1&xmem=0&xloc=0&binarytrees=1&chameneosredux=1&fannkuch=1&fasta=1&knucleotide=1&mandelbrot=1&meteor=0&nbody=1&pidigits=1&regexdna=1&revcomp=1&spectralnorm=1&threadring=0
>> Programming language shootout: memory:
>> http://shootout.alioth.debian.org/u32q/benchmark.php?test=all&lang=all&d=ndata&calc=calculate&xfullcpu=0&xmem=1&xloc=0&binarytrees=1&chameneosredux=1&fannkuch=1&fasta=1&knucleotide=1&mandelbrot=1&meteor=0&nbody=1&pidigits=1&regexdna=1&revcomp=1&spectralnorm=1&threadring=0
>>
>> -glen
>>
>> 2009/9/11 Dan OConnor <do...@acquiremedia.com>:
>>> Paul:
>>>
>>> My first suggestion would be to update your JVM to the latest  
>>> version (or at least .14). There were several garbage collection  
>>> related issues resolved in version 10 - 13 (especially dealing  
>>> with large heaps).
>>>
>>> Next, your IndexWriter parameters would help figure out why you  
>>> are using so much RAM
>>>       getMaxFieldLength()
>>>       getMaxBufferedDocs()
>>>       getMaxMergeDocs()
>>>       getRAMBufferSizeMB()
>>>
>>> How often are you calling commit?
>>> Do you close your IndexWriter after every document?
>>> How many documents of this size are you indexing?
>>> Have you used luke to look at your index?
>>> If this is a large index, have you optimized it recently?
>>> Are there any searches going on while you are indexing?
>>>
>>>
>>> Regards,
>>> Dan
>>>
>>>
>>> -----Original Message-----
>>> From: Paul_Murdoch@emainc.com [mailto:Paul_Murdoch@emainc.com]
>>> Sent: Friday, September 11, 2009 7:57 AM
>>> To: java-user@lucene.apache.org
>>> Subject: RE: Indexing large files? - No answers yet...
>>>
>>> This issue is still open.  Any suggestions/help with this would be
>>> greatly appreciated.
>>>
>>> Thanks,
>>>
>>> Paul
>>>
>>>
>>> -----Original Message-----
>>> From: java-user-return-42080-Paul_Murdoch=emainc.com@lucene.apache.org
>>> [mailto:java-user-return-42080- 
>>> Paul_Murdoch=emainc.com@lucene.apache.org
>>> ] On Behalf Of Paul_Murdoch@emainc.com
>>> Sent: Monday, August 31, 2009 10:28 AM
>>> To: java-user@lucene.apache.org
>>> Subject: Indexing large files?
>>>
>>> Hi,
>>>
>>>
>>>
>>> I'm working with Lucene 2.4.0 and the JVM (JDK 1.6.0_07).  I'm
>>> consistently receiving "OutOfMemoryError: Java heap space", when  
>>> trying
>>> to index large text files.
>>>
>>>
>>>
>>> Example 1: Indexing a 5 MB text file runs out of memory with a 16 MB
>>> max. heap size.  So I increased the max. heap size to 512 MB.  This
>>> worked for the 5 MB text file, but Lucene still used 84 MB of heap  
>>> space
>>> to do this.  Why so much?
>>>
>>>
>>>
>>> The class FreqProxTermsWriterPerField appears to be the biggest  
>>> memory
>>> consumer by far according to JConsole and the TPTP Memory Profiling
>>> plugin for Eclipse Ganymede.
>>>
>>>
>>>
>>> Example 2: Indexing a 62 MB text file runs out of memory with a  
>>> 512 MB
>>> max. heap size.  Increasing the max. heap size to 1024 MB works but
>>> Lucene uses 826 MB of heap space while performing this.  Still seems
>>> like way too much memory is being used to do this.  I'm sure larger
>>> files would cause the error as it seems correlative.
>>>
>>>
>>>
>>> I'm on a Windows XP SP2 platform with 2 GB of RAM.  So what is the  
>>> best
>>> practice for indexing large files?  Here is a code snippet that I'm
>>> using:
>>>
>>>
>>>
>>> // Index the content of a text file.
>>>
>>>     private Boolean saveTXTFile(File textFile, Document  
>>> textDocument)
>>> throws CIDBException {
>>>
>>>
>>>
>>>           try {
>>>
>>>
>>>
>>>                 Boolean isFile = textFile.isFile();
>>>
>>>                 Boolean hasTextExtension =
>>> textFile.getName().endsWith(".txt");
>>>
>>>
>>>
>>>                 if (isFile && hasTextExtension) {
>>>
>>>
>>>
>>>                       System.out.println("File " +
>>> textFile.getCanonicalPath() + " is being indexed");
>>>
>>>                       Reader textFileReader = new
>>> FileReader(textFile);
>>>
>>>                       if (textDocument == null)
>>>
>>>                             textDocument = new Document();
>>>
>>>                       textDocument.add(new Field("content",
>>> textFileReader));
>>>
>>>                       indexWriter.addDocument(textDocument);
>>> // BREAKS HERE!!!!
>>>
>>>                 }
>>>
>>>           } catch (FileNotFoundException fnfe) {
>>>
>>>                 System.out.println(fnfe.getMessage());
>>>
>>>                 return false;
>>>
>>>           } catch (CorruptIndexException cie) {
>>>
>>>                 throw new CIDBException("The index has become
>>> corrupt.");
>>>
>>>           } catch (IOException ioe) {
>>>
>>>                 System.out.println(ioe.getMessage());
>>>
>>>                 return false;
>>>
>>>           }
>>>
>>>           return true;
>>>
>>>     }
>>>
>>>
>>>
>>>
>>>
>>> Thanks much,
>>>
>>>
>>>
>>> Paul
>>>
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>
>>
>>
>> --
>>
>> -
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
>
>
> -- 
>
> -
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Indexing large files? - No answers yet...

Posted by Michael McCandless <lu...@mikemccandless.com>.

On Fri, Sep 11, 2009 at 1:15 PM,  <Pa...@emainc.com> wrote:

> I've been testing out "paging" the document this past week.  I'm
> still working on getting a successful test and think I'm close.  The
> down side was a drastic slow down in indexing speed, and lots of
> open files, but that was expected.

You mean a slowdown in indexing speed because you now flush after
every page not after every document, right?  That's expected.

But I'm not sure why you'd see a change in the number of open files...

> I tried with small mergeFactors, maxBufferedDocs(haven't tried 1
> though), and ramBufferSizeMB.  Using JConsole to monitor the heap
> usage, this method slowly creeps towards my max heap space until
> OOM. I can say that at least some of the document gets indexed
> before OOM.  So I performed a heap dump at OOM and saw that
> FreqProxTermsWriterPerField had by far consumed the most memory.  I
> haven't looked into that yet...

It's at least ~60 bytes per unique term, not counting the char[] to
hold the term, and FreqProxTermsWriterPerField is exactly where most
of those bytes are allocated (eg its PostingList class).

> Let's say I page the document into ten different smaller documents
> and they are indexed successfully (I'm not quite at this point yet).
> Is there a way to select documents by id and merge them into one
> large document after they are in the index?  That was my plan to
> work around OOM and achieve the same end result as trying to index
> the large document in one shot.

You mean at search time right?  You basically want the equivalent of
SQL's "group by".

You could make a custom Collector that does this...  Or look at how
Solr is iterating on field collapsing
(https://issues.apache.org/jira/browse/SOLR-236)?

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Indexing large files? - No answers yet...

Posted by Pa...@emainc.com.

Thanks Mike!

I've been testing out "paging" the document this past week.  I'm still working on getting a successful test and think I'm close.  The down side was a drastic slow down in indexing speed, and lots of open files, but that was expected.  I tried with small mergeFactors, maxBufferedDocs(haven't tried 1 though), and ramBufferSizeMB.  Using JConsole to monitor the heap usage, this method slowly creeps towards my max heap space until OOM. I can say that at least some of the document gets indexed before OOM.
So I performed a heap dump at OOM and saw that FreqProxTermsWriterPerField had by far consumed the most memory.  I haven't looked into that yet...  

Let's say I page the document into ten different smaller documents and they are indexed successfully (I'm not quite at this point yet).  Is there a way to select documents by id and merge them into one large document after they are in the index?  That was my plan to work around OOM and achieve the same end result as trying to index the large document in one shot.
  
Paul


-----Original Message-----
From: java-user-return-42283-Paul_Murdoch=emainc.com@lucene.apache.org [mailto:java-user-return-42283-Paul_Murdoch=emainc.com@lucene.apache.org] On Behalf Of Michael McCandless
Sent: Friday, September 11, 2009 11:54 AM
To: java-user@lucene.apache.org
Subject: Re: Indexing large files? - No answers yet...

To minimize Lucene's RAM usage during indexing, you should flush after
every document, eg by setting the ramBufferSizeMB to something tiny
(or maxBufferedDocs to 1).

But, unfortunately, Lucene cannot flush partway through indexing one
document.  Ie, the full document must be indexed into RAM before being
flushed.  So the worst case for Lucene will always be a single large
document.

Worse, documents with an unusually high number of unique terms will
then consume even more memory, because there is a certain RAM cost for
each unique term that's seen.

So the absolute worst case is a single large document, all of whose
terms are unique, which seems to be what's being tested here.

In theory one could make a custom indexing chain that knows it will
only hold a single document in ram at once, and could therefore trim
some of the data that we now must store per term, or maybe reduce the
size of the data types (we now use int for most fields per term, but
you could reduce them to shorts and force a flush whenever the shorts
might overflow), etc.

One possible workaround would be to pre-divide such large documents,
before indexing them, though this'd require coalescing at search time.

Mike

On Fri, Sep 11, 2009 at 11:02 AM,  <Pa...@emainc.com> wrote:
> Glen,
>
> Absolutely. I think a RMFC Lucene would great, especially for reduced memory or low bandwidth client/server scenarios.
>
> I just looked at your LuSql tool and it just what I needed about 9 months ago :-).  I wrote a simple re-indexer that interfaces to an SQL Server 2005 database and Lucene, but I could have saved some time if I knew about LuSql.  Unfortunately we're too far down the road in development to test and possibly integrate it into our system now, but I will put it on the R&D list for the next iteration.
>
> Thanks again,
>
> Paul
>
>
> -----Original Message-----
> From: java-user-return-42277-Paul_Murdoch=emainc.com@lucene.apache.org [mailto:java-user-return-42277-Paul_Murdoch=emainc.com@lucene.apache.org] On Behalf Of Glen Newton
> Sent: Friday, September 11, 2009 10:44 AM
> To: java-user@lucene.apache.org
> Subject: Re: Indexing large files? - No answers yet...
>
> Paul,
>
> I saw your last post and now understand the issues you face.
>
> I don't think there has been any effort to produce a
> reduced-memory-footprint configurable (RMFC) Lucene. With the many
> mobile devices, embedded and other reduced memory devices, should this
> perhaps be one of the areas the Lucene community looks in to?
>
> -Glen
>
> 2009/9/11  <Pa...@emainc.com>:
>> Thanks Glen!
>>
>> I will take at your project.  Unfortunately I will only have 512 MB to 1024 MB to work with as Lucene is only one component in a larger software system running on one machine.  I agree with you on the C\C++ comment.  That is what I would normally use for memory intense software.  It turns out that the larger file you want to index is the larger the heap space you will need.  What I would like to see is a way to "throttle" the indexing process to control the memory footprint.  I understand that this will take longer, but if I perform the task during off hours it shouldn't matter. At least the file will be indexed correctly.
>>
>> Thanks,
>> Paul
>>
>>
>> -----Original Message-----
>> From: java-user-return-42272-Paul_Murdoch=emainc.com@lucene.apache.org [mailto:java-user-return-42272-Paul_Murdoch=emainc.com@lucene.apache.org] On Behalf Of Glen Newton
>> Sent: Friday, September 11, 2009 9:53 AM
>> To: java-user@lucene.apache.org
>> Subject: Re: Indexing large files? - No answers yet...
>>
>> In this project:
>>  http://zzzoot.blogspot.com/2009/07/project-torngat-building-large-scale.html
>>
>> I concatenate all the text of all of articles of a single journal into
>> a single text file.
>> This can create a text file that is 500MB in size.
>> Lucene is OK in indexing files this size (in parallel even), but I
>> have a heap size of 8GB.
>>
>> I would suggest increasing your heap to as large as your machine can
>> reasonably take.
>> The reality is that Java programs (like Lucene) take up more memory
>> than a similar C or even C++ program.
>> Java may approach C/C++ in speed, but not memory.
>>
>> We don't use Java because of its memory footprint!  ;-)
>>
>> See:
>>  Programming language shootout: speed:
>> http://shootout.alioth.debian.org/u32q/benchmark.php?test=all&lang=all&d=ndata&calc=calculate&xfullcpu=1&xmem=0&xloc=0&binarytrees=1&chameneosredux=1&fannkuch=1&fasta=1&knucleotide=1&mandelbrot=1&meteor=0&nbody=1&pidigits=1&regexdna=1&revcomp=1&spectralnorm=1&threadring=0
>>  Programming language shootout: memory:
>> http://shootout.alioth.debian.org/u32q/benchmark.php?test=all&lang=all&d=ndata&calc=calculate&xfullcpu=0&xmem=1&xloc=0&binarytrees=1&chameneosredux=1&fannkuch=1&fasta=1&knucleotide=1&mandelbrot=1&meteor=0&nbody=1&pidigits=1&regexdna=1&revcomp=1&spectralnorm=1&threadring=0
>>
>> -glen
>>
>> 2009/9/11 Dan OConnor <do...@acquiremedia.com>:
>>> Paul:
>>>
>>> My first suggestion would be to update your JVM to the latest version (or at least .14). There were several garbage collection related issues resolved in version 10 - 13 (especially dealing with large heaps).
>>>
>>> Next, your IndexWriter parameters would help figure out why you are using so much RAM
>>>        getMaxFieldLength()
>>>        getMaxBufferedDocs()
>>>        getMaxMergeDocs()
>>>        getRAMBufferSizeMB()
>>>
>>> How often are you calling commit?
>>> Do you close your IndexWriter after every document?
>>> How many documents of this size are you indexing?
>>> Have you used luke to look at your index?
>>> If this is a large index, have you optimized it recently?
>>> Are there any searches going on while you are indexing?
>>>
>>>
>>> Regards,
>>> Dan
>>>
>>>
>>> -----Original Message-----
>>> From: Paul_Murdoch@emainc.com [mailto:Paul_Murdoch@emainc.com]
>>> Sent: Friday, September 11, 2009 7:57 AM
>>> To: java-user@lucene.apache.org
>>> Subject: RE: Indexing large files? - No answers yet...
>>>
>>> This issue is still open.  Any suggestions/help with this would be
>>> greatly appreciated.
>>>
>>> Thanks,
>>>
>>> Paul
>>>
>>>
>>> -----Original Message-----
>>> From: java-user-return-42080-Paul_Murdoch=emainc.com@lucene.apache.org
>>> [mailto:java-user-return-42080-Paul_Murdoch=emainc.com@lucene.apache.org
>>> ] On Behalf Of Paul_Murdoch@emainc.com
>>> Sent: Monday, August 31, 2009 10:28 AM
>>> To: java-user@lucene.apache.org
>>> Subject: Indexing large files?
>>>
>>> Hi,
>>>
>>>
>>>
>>> I'm working with Lucene 2.4.0 and the JVM (JDK 1.6.0_07).  I'm
>>> consistently receiving "OutOfMemoryError: Java heap space", when trying
>>> to index large text files.
>>>
>>>
>>>
>>> Example 1: Indexing a 5 MB text file runs out of memory with a 16 MB
>>> max. heap size.  So I increased the max. heap size to 512 MB.  This
>>> worked for the 5 MB text file, but Lucene still used 84 MB of heap space
>>> to do this.  Why so much?
>>>
>>>
>>>
>>> The class FreqProxTermsWriterPerField appears to be the biggest memory
>>> consumer by far according to JConsole and the TPTP Memory Profiling
>>> plugin for Eclipse Ganymede.
>>>
>>>
>>>
>>> Example 2: Indexing a 62 MB text file runs out of memory with a 512 MB
>>> max. heap size.  Increasing the max. heap size to 1024 MB works but
>>> Lucene uses 826 MB of heap space while performing this.  Still seems
>>> like way too much memory is being used to do this.  I'm sure larger
>>> files would cause the error as it seems correlative.
>>>
>>>
>>>
>>> I'm on a Windows XP SP2 platform with 2 GB of RAM.  So what is the best
>>> practice for indexing large files?  Here is a code snippet that I'm
>>> using:
>>>
>>>
>>>
>>> // Index the content of a text file.
>>>
>>>      private Boolean saveTXTFile(File textFile, Document textDocument)
>>> throws CIDBException {
>>>
>>>
>>>
>>>            try {
>>>
>>>
>>>
>>>                  Boolean isFile = textFile.isFile();
>>>
>>>                  Boolean hasTextExtension =
>>> textFile.getName().endsWith(".txt");
>>>
>>>
>>>
>>>                  if (isFile && hasTextExtension) {
>>>
>>>
>>>
>>>                        System.out.println("File " +
>>> textFile.getCanonicalPath() + " is being indexed");
>>>
>>>                        Reader textFileReader = new
>>> FileReader(textFile);
>>>
>>>                        if (textDocument == null)
>>>
>>>                              textDocument = new Document();
>>>
>>>                        textDocument.add(new Field("content",
>>> textFileReader));
>>>
>>>                        indexWriter.addDocument(textDocument);
>>> // BREAKS HERE!!!!
>>>
>>>                  }
>>>
>>>            } catch (FileNotFoundException fnfe) {
>>>
>>>                  System.out.println(fnfe.getMessage());
>>>
>>>                  return false;
>>>
>>>            } catch (CorruptIndexException cie) {
>>>
>>>                  throw new CIDBException("The index has become
>>> corrupt.");
>>>
>>>            } catch (IOException ioe) {
>>>
>>>                  System.out.println(ioe.getMessage());
>>>
>>>                  return false;
>>>
>>>            }
>>>
>>>            return true;
>>>
>>>      }
>>>
>>>
>>>
>>>
>>>
>>> Thanks much,
>>>
>>>
>>>
>>> Paul
>>>
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>
>>
>>
>> --
>>
>> -
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
>
>
> --
>
> -
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Indexing large files? - No answers yet...

Posted by Michael McCandless <lu...@mikemccandless.com>.

To minimize Lucene's RAM usage during indexing, you should flush after
every document, eg by setting the ramBufferSizeMB to something tiny
(or maxBufferedDocs to 1).

But, unfortunately, Lucene cannot flush partway through indexing one
document.  Ie, the full document must be indexed into RAM before being
flushed.  So the worst case for Lucene will always be a single large
document.

Worse, documents with an unusually high number of unique terms will
then consume even more memory, because there is a certain RAM cost for
each unique term that's seen.

So the absolute worst case is a single large document, all of whose
terms are unique, which seems to be what's being tested here.

In theory one could make a custom indexing chain that knows it will
only hold a single document in ram at once, and could therefore trim
some of the data that we now must store per term, or maybe reduce the
size of the data types (we now use int for most fields per term, but
you could reduce them to shorts and force a flush whenever the shorts
might overflow), etc.

One possible workaround would be to pre-divide such large documents,
before indexing them, though this'd require coalescing at search time.

Mike

On Fri, Sep 11, 2009 at 11:02 AM,  <Pa...@emainc.com> wrote:
> Glen,
>
> Absolutely. I think a RMFC Lucene would great, especially for reduced memory or low bandwidth client/server scenarios.
>
> I just looked at your LuSql tool and it just what I needed about 9 months ago :-).  I wrote a simple re-indexer that interfaces to an SQL Server 2005 database and Lucene, but I could have saved some time if I knew about LuSql.  Unfortunately we're too far down the road in development to test and possibly integrate it into our system now, but I will put it on the R&D list for the next iteration.
>
> Thanks again,
>
> Paul
>
>
> -----Original Message-----
> From: java-user-return-42277-Paul_Murdoch=emainc.com@lucene.apache.org [mailto:java-user-return-42277-Paul_Murdoch=emainc.com@lucene.apache.org] On Behalf Of Glen Newton
> Sent: Friday, September 11, 2009 10:44 AM
> To: java-user@lucene.apache.org
> Subject: Re: Indexing large files? - No answers yet...
>
> Paul,
>
> I saw your last post and now understand the issues you face.
>
> I don't think there has been any effort to produce a
> reduced-memory-footprint configurable (RMFC) Lucene. With the many
> mobile devices, embedded and other reduced memory devices, should this
> perhaps be one of the areas the Lucene community looks in to?
>
> -Glen
>
> 2009/9/11  <Pa...@emainc.com>:
>> Thanks Glen!
>>
>> I will take at your project.  Unfortunately I will only have 512 MB to 1024 MB to work with as Lucene is only one component in a larger software system running on one machine.  I agree with you on the C\C++ comment.  That is what I would normally use for memory intense software.  It turns out that the larger file you want to index is the larger the heap space you will need.  What I would like to see is a way to "throttle" the indexing process to control the memory footprint.  I understand that this will take longer, but if I perform the task during off hours it shouldn't matter. At least the file will be indexed correctly.
>>
>> Thanks,
>> Paul
>>
>>
>> -----Original Message-----
>> From: java-user-return-42272-Paul_Murdoch=emainc.com@lucene.apache.org [mailto:java-user-return-42272-Paul_Murdoch=emainc.com@lucene.apache.org] On Behalf Of Glen Newton
>> Sent: Friday, September 11, 2009 9:53 AM
>> To: java-user@lucene.apache.org
>> Subject: Re: Indexing large files? - No answers yet...
>>
>> In this project:
>>  http://zzzoot.blogspot.com/2009/07/project-torngat-building-large-scale.html
>>
>> I concatenate all the text of all of articles of a single journal into
>> a single text file.
>> This can create a text file that is 500MB in size.
>> Lucene is OK in indexing files this size (in parallel even), but I
>> have a heap size of 8GB.
>>
>> I would suggest increasing your heap to as large as your machine can
>> reasonably take.
>> The reality is that Java programs (like Lucene) take up more memory
>> than a similar C or even C++ program.
>> Java may approach C/C++ in speed, but not memory.
>>
>> We don't use Java because of its memory footprint!  ;-)
>>
>> See:
>>  Programming language shootout: speed:
>> http://shootout.alioth.debian.org/u32q/benchmark.php?test=all&lang=all&d=ndata&calc=calculate&xfullcpu=1&xmem=0&xloc=0&binarytrees=1&chameneosredux=1&fannkuch=1&fasta=1&knucleotide=1&mandelbrot=1&meteor=0&nbody=1&pidigits=1&regexdna=1&revcomp=1&spectralnorm=1&threadring=0
>>  Programming language shootout: memory:
>> http://shootout.alioth.debian.org/u32q/benchmark.php?test=all&lang=all&d=ndata&calc=calculate&xfullcpu=0&xmem=1&xloc=0&binarytrees=1&chameneosredux=1&fannkuch=1&fasta=1&knucleotide=1&mandelbrot=1&meteor=0&nbody=1&pidigits=1&regexdna=1&revcomp=1&spectralnorm=1&threadring=0
>>
>> -glen
>>
>> 2009/9/11 Dan OConnor <do...@acquiremedia.com>:
>>> Paul:
>>>
>>> My first suggestion would be to update your JVM to the latest version (or at least .14). There were several garbage collection related issues resolved in version 10 - 13 (especially dealing with large heaps).
>>>
>>> Next, your IndexWriter parameters would help figure out why you are using so much RAM
>>>        getMaxFieldLength()
>>>        getMaxBufferedDocs()
>>>        getMaxMergeDocs()
>>>        getRAMBufferSizeMB()
>>>
>>> How often are you calling commit?
>>> Do you close your IndexWriter after every document?
>>> How many documents of this size are you indexing?
>>> Have you used luke to look at your index?
>>> If this is a large index, have you optimized it recently?
>>> Are there any searches going on while you are indexing?
>>>
>>>
>>> Regards,
>>> Dan
>>>
>>>
>>> -----Original Message-----
>>> From: Paul_Murdoch@emainc.com [mailto:Paul_Murdoch@emainc.com]
>>> Sent: Friday, September 11, 2009 7:57 AM
>>> To: java-user@lucene.apache.org
>>> Subject: RE: Indexing large files? - No answers yet...
>>>
>>> This issue is still open.  Any suggestions/help with this would be
>>> greatly appreciated.
>>>
>>> Thanks,
>>>
>>> Paul
>>>
>>>
>>> -----Original Message-----
>>> From: java-user-return-42080-Paul_Murdoch=emainc.com@lucene.apache.org
>>> [mailto:java-user-return-42080-Paul_Murdoch=emainc.com@lucene.apache.org
>>> ] On Behalf Of Paul_Murdoch@emainc.com
>>> Sent: Monday, August 31, 2009 10:28 AM
>>> To: java-user@lucene.apache.org
>>> Subject: Indexing large files?
>>>
>>> Hi,
>>>
>>>
>>>
>>> I'm working with Lucene 2.4.0 and the JVM (JDK 1.6.0_07).  I'm
>>> consistently receiving "OutOfMemoryError: Java heap space", when trying
>>> to index large text files.
>>>
>>>
>>>
>>> Example 1: Indexing a 5 MB text file runs out of memory with a 16 MB
>>> max. heap size.  So I increased the max. heap size to 512 MB.  This
>>> worked for the 5 MB text file, but Lucene still used 84 MB of heap space
>>> to do this.  Why so much?
>>>
>>>
>>>
>>> The class FreqProxTermsWriterPerField appears to be the biggest memory
>>> consumer by far according to JConsole and the TPTP Memory Profiling
>>> plugin for Eclipse Ganymede.
>>>
>>>
>>>
>>> Example 2: Indexing a 62 MB text file runs out of memory with a 512 MB
>>> max. heap size.  Increasing the max. heap size to 1024 MB works but
>>> Lucene uses 826 MB of heap space while performing this.  Still seems
>>> like way too much memory is being used to do this.  I'm sure larger
>>> files would cause the error as it seems correlative.
>>>
>>>
>>>
>>> I'm on a Windows XP SP2 platform with 2 GB of RAM.  So what is the best
>>> practice for indexing large files?  Here is a code snippet that I'm
>>> using:
>>>
>>>
>>>
>>> // Index the content of a text file.
>>>
>>>      private Boolean saveTXTFile(File textFile, Document textDocument)
>>> throws CIDBException {
>>>
>>>
>>>
>>>            try {
>>>
>>>
>>>
>>>                  Boolean isFile = textFile.isFile();
>>>
>>>                  Boolean hasTextExtension =
>>> textFile.getName().endsWith(".txt");
>>>
>>>
>>>
>>>                  if (isFile && hasTextExtension) {
>>>
>>>
>>>
>>>                        System.out.println("File " +
>>> textFile.getCanonicalPath() + " is being indexed");
>>>
>>>                        Reader textFileReader = new
>>> FileReader(textFile);
>>>
>>>                        if (textDocument == null)
>>>
>>>                              textDocument = new Document();
>>>
>>>                        textDocument.add(new Field("content",
>>> textFileReader));
>>>
>>>                        indexWriter.addDocument(textDocument);
>>> // BREAKS HERE!!!!
>>>
>>>                  }
>>>
>>>            } catch (FileNotFoundException fnfe) {
>>>
>>>                  System.out.println(fnfe.getMessage());
>>>
>>>                  return false;
>>>
>>>            } catch (CorruptIndexException cie) {
>>>
>>>                  throw new CIDBException("The index has become
>>> corrupt.");
>>>
>>>            } catch (IOException ioe) {
>>>
>>>                  System.out.println(ioe.getMessage());
>>>
>>>                  return false;
>>>
>>>            }
>>>
>>>            return true;
>>>
>>>      }
>>>
>>>
>>>
>>>
>>>
>>> Thanks much,
>>>
>>>
>>>
>>> Paul
>>>
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>
>>
>>
>> --
>>
>> -
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
>
>
> --
>
> -
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Indexing large files? - No answers yet...

Posted by Pa...@emainc.com.

Glen,

Absolutely. I think a RMFC Lucene would great, especially for reduced memory or low bandwidth client/server scenarios. 

I just looked at your LuSql tool and it just what I needed about 9 months ago :-).  I wrote a simple re-indexer that interfaces to an SQL Server 2005 database and Lucene, but I could have saved some time if I knew about LuSql.  Unfortunately we're too far down the road in development to test and possibly integrate it into our system now, but I will put it on the R&D list for the next iteration.

Thanks again,

Paul


-----Original Message-----
From: java-user-return-42277-Paul_Murdoch=emainc.com@lucene.apache.org [mailto:java-user-return-42277-Paul_Murdoch=emainc.com@lucene.apache.org] On Behalf Of Glen Newton
Sent: Friday, September 11, 2009 10:44 AM
To: java-user@lucene.apache.org
Subject: Re: Indexing large files? - No answers yet...

Paul,

I saw your last post and now understand the issues you face.

I don't think there has been any effort to produce a
reduced-memory-footprint configurable (RMFC) Lucene. With the many
mobile devices, embedded and other reduced memory devices, should this
perhaps be one of the areas the Lucene community looks in to?

-Glen

2009/9/11  <Pa...@emainc.com>:
> Thanks Glen!
>
> I will take at your project.  Unfortunately I will only have 512 MB to 1024 MB to work with as Lucene is only one component in a larger software system running on one machine.  I agree with you on the C\C++ comment.  That is what I would normally use for memory intense software.  It turns out that the larger file you want to index is the larger the heap space you will need.  What I would like to see is a way to "throttle" the indexing process to control the memory footprint.  I understand that this will take longer, but if I perform the task during off hours it shouldn't matter. At least the file will be indexed correctly.
>
> Thanks,
> Paul
>
>
> -----Original Message-----
> From: java-user-return-42272-Paul_Murdoch=emainc.com@lucene.apache.org [mailto:java-user-return-42272-Paul_Murdoch=emainc.com@lucene.apache.org] On Behalf Of Glen Newton
> Sent: Friday, September 11, 2009 9:53 AM
> To: java-user@lucene.apache.org
> Subject: Re: Indexing large files? - No answers yet...
>
> In this project:
>  http://zzzoot.blogspot.com/2009/07/project-torngat-building-large-scale.html
>
> I concatenate all the text of all of articles of a single journal into
> a single text file.
> This can create a text file that is 500MB in size.
> Lucene is OK in indexing files this size (in parallel even), but I
> have a heap size of 8GB.
>
> I would suggest increasing your heap to as large as your machine can
> reasonably take.
> The reality is that Java programs (like Lucene) take up more memory
> than a similar C or even C++ program.
> Java may approach C/C++ in speed, but not memory.
>
> We don't use Java because of its memory footprint!  ;-)
>
> See:
>  Programming language shootout: speed:
> http://shootout.alioth.debian.org/u32q/benchmark.php?test=all&lang=all&d=ndata&calc=calculate&xfullcpu=1&xmem=0&xloc=0&binarytrees=1&chameneosredux=1&fannkuch=1&fasta=1&knucleotide=1&mandelbrot=1&meteor=0&nbody=1&pidigits=1&regexdna=1&revcomp=1&spectralnorm=1&threadring=0
>  Programming language shootout: memory:
> http://shootout.alioth.debian.org/u32q/benchmark.php?test=all&lang=all&d=ndata&calc=calculate&xfullcpu=0&xmem=1&xloc=0&binarytrees=1&chameneosredux=1&fannkuch=1&fasta=1&knucleotide=1&mandelbrot=1&meteor=0&nbody=1&pidigits=1&regexdna=1&revcomp=1&spectralnorm=1&threadring=0
>
> -glen
>
> 2009/9/11 Dan OConnor <do...@acquiremedia.com>:
>> Paul:
>>
>> My first suggestion would be to update your JVM to the latest version (or at least .14). There were several garbage collection related issues resolved in version 10 - 13 (especially dealing with large heaps).
>>
>> Next, your IndexWriter parameters would help figure out why you are using so much RAM
>>        getMaxFieldLength()
>>        getMaxBufferedDocs()
>>        getMaxMergeDocs()
>>        getRAMBufferSizeMB()
>>
>> How often are you calling commit?
>> Do you close your IndexWriter after every document?
>> How many documents of this size are you indexing?
>> Have you used luke to look at your index?
>> If this is a large index, have you optimized it recently?
>> Are there any searches going on while you are indexing?
>>
>>
>> Regards,
>> Dan
>>
>>
>> -----Original Message-----
>> From: Paul_Murdoch@emainc.com [mailto:Paul_Murdoch@emainc.com]
>> Sent: Friday, September 11, 2009 7:57 AM
>> To: java-user@lucene.apache.org
>> Subject: RE: Indexing large files? - No answers yet...
>>
>> This issue is still open.  Any suggestions/help with this would be
>> greatly appreciated.
>>
>> Thanks,
>>
>> Paul
>>
>>
>> -----Original Message-----
>> From: java-user-return-42080-Paul_Murdoch=emainc.com@lucene.apache.org
>> [mailto:java-user-return-42080-Paul_Murdoch=emainc.com@lucene.apache.org
>> ] On Behalf Of Paul_Murdoch@emainc.com
>> Sent: Monday, August 31, 2009 10:28 AM
>> To: java-user@lucene.apache.org
>> Subject: Indexing large files?
>>
>> Hi,
>>
>>
>>
>> I'm working with Lucene 2.4.0 and the JVM (JDK 1.6.0_07).  I'm
>> consistently receiving "OutOfMemoryError: Java heap space", when trying
>> to index large text files.
>>
>>
>>
>> Example 1: Indexing a 5 MB text file runs out of memory with a 16 MB
>> max. heap size.  So I increased the max. heap size to 512 MB.  This
>> worked for the 5 MB text file, but Lucene still used 84 MB of heap space
>> to do this.  Why so much?
>>
>>
>>
>> The class FreqProxTermsWriterPerField appears to be the biggest memory
>> consumer by far according to JConsole and the TPTP Memory Profiling
>> plugin for Eclipse Ganymede.
>>
>>
>>
>> Example 2: Indexing a 62 MB text file runs out of memory with a 512 MB
>> max. heap size.  Increasing the max. heap size to 1024 MB works but
>> Lucene uses 826 MB of heap space while performing this.  Still seems
>> like way too much memory is being used to do this.  I'm sure larger
>> files would cause the error as it seems correlative.
>>
>>
>>
>> I'm on a Windows XP SP2 platform with 2 GB of RAM.  So what is the best
>> practice for indexing large files?  Here is a code snippet that I'm
>> using:
>>
>>
>>
>> // Index the content of a text file.
>>
>>      private Boolean saveTXTFile(File textFile, Document textDocument)
>> throws CIDBException {
>>
>>
>>
>>            try {
>>
>>
>>
>>                  Boolean isFile = textFile.isFile();
>>
>>                  Boolean hasTextExtension =
>> textFile.getName().endsWith(".txt");
>>
>>
>>
>>                  if (isFile && hasTextExtension) {
>>
>>
>>
>>                        System.out.println("File " +
>> textFile.getCanonicalPath() + " is being indexed");
>>
>>                        Reader textFileReader = new
>> FileReader(textFile);
>>
>>                        if (textDocument == null)
>>
>>                              textDocument = new Document();
>>
>>                        textDocument.add(new Field("content",
>> textFileReader));
>>
>>                        indexWriter.addDocument(textDocument);
>> // BREAKS HERE!!!!
>>
>>                  }
>>
>>            } catch (FileNotFoundException fnfe) {
>>
>>                  System.out.println(fnfe.getMessage());
>>
>>                  return false;
>>
>>            } catch (CorruptIndexException cie) {
>>
>>                  throw new CIDBException("The index has become
>> corrupt.");
>>
>>            } catch (IOException ioe) {
>>
>>                  System.out.println(ioe.getMessage());
>>
>>                  return false;
>>
>>            }
>>
>>            return true;
>>
>>      }
>>
>>
>>
>>
>>
>> Thanks much,
>>
>>
>>
>> Paul
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
>
>
> --
>
> -
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>



-- 

-

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Indexing large files? - No answers yet...

Posted by Glen Newton <gl...@gmail.com>.

Paul,

I saw your last post and now understand the issues you face.

I don't think there has been any effort to produce a
reduced-memory-footprint configurable (RMFC) Lucene. With the many
mobile devices, embedded and other reduced memory devices, should this
perhaps be one of the areas the Lucene community looks in to?

-Glen

2009/9/11  <Pa...@emainc.com>:
> Thanks Glen!
>
> I will take at your project.  Unfortunately I will only have 512 MB to 1024 MB to work with as Lucene is only one component in a larger software system running on one machine.  I agree with you on the C\C++ comment.  That is what I would normally use for memory intense software.  It turns out that the larger file you want to index is the larger the heap space you will need.  What I would like to see is a way to "throttle" the indexing process to control the memory footprint.  I understand that this will take longer, but if I perform the task during off hours it shouldn't matter. At least the file will be indexed correctly.
>
> Thanks,
> Paul
>
>
> -----Original Message-----
> From: java-user-return-42272-Paul_Murdoch=emainc.com@lucene.apache.org [mailto:java-user-return-42272-Paul_Murdoch=emainc.com@lucene.apache.org] On Behalf Of Glen Newton
> Sent: Friday, September 11, 2009 9:53 AM
> To: java-user@lucene.apache.org
> Subject: Re: Indexing large files? - No answers yet...
>
> In this project:
>  http://zzzoot.blogspot.com/2009/07/project-torngat-building-large-scale.html
>
> I concatenate all the text of all of articles of a single journal into
> a single text file.
> This can create a text file that is 500MB in size.
> Lucene is OK in indexing files this size (in parallel even), but I
> have a heap size of 8GB.
>
> I would suggest increasing your heap to as large as your machine can
> reasonably take.
> The reality is that Java programs (like Lucene) take up more memory
> than a similar C or even C++ program.
> Java may approach C/C++ in speed, but not memory.
>
> We don't use Java because of its memory footprint!  ;-)
>
> See:
>  Programming language shootout: speed:
> http://shootout.alioth.debian.org/u32q/benchmark.php?test=all&lang=all&d=ndata&calc=calculate&xfullcpu=1&xmem=0&xloc=0&binarytrees=1&chameneosredux=1&fannkuch=1&fasta=1&knucleotide=1&mandelbrot=1&meteor=0&nbody=1&pidigits=1&regexdna=1&revcomp=1&spectralnorm=1&threadring=0
>  Programming language shootout: memory:
> http://shootout.alioth.debian.org/u32q/benchmark.php?test=all&lang=all&d=ndata&calc=calculate&xfullcpu=0&xmem=1&xloc=0&binarytrees=1&chameneosredux=1&fannkuch=1&fasta=1&knucleotide=1&mandelbrot=1&meteor=0&nbody=1&pidigits=1&regexdna=1&revcomp=1&spectralnorm=1&threadring=0
>
> -glen
>
> 2009/9/11 Dan OConnor <do...@acquiremedia.com>:
>> Paul:
>>
>> My first suggestion would be to update your JVM to the latest version (or at least .14). There were several garbage collection related issues resolved in version 10 - 13 (especially dealing with large heaps).
>>
>> Next, your IndexWriter parameters would help figure out why you are using so much RAM
>>        getMaxFieldLength()
>>        getMaxBufferedDocs()
>>        getMaxMergeDocs()
>>        getRAMBufferSizeMB()
>>
>> How often are you calling commit?
>> Do you close your IndexWriter after every document?
>> How many documents of this size are you indexing?
>> Have you used luke to look at your index?
>> If this is a large index, have you optimized it recently?
>> Are there any searches going on while you are indexing?
>>
>>
>> Regards,
>> Dan
>>
>>
>> -----Original Message-----
>> From: Paul_Murdoch@emainc.com [mailto:Paul_Murdoch@emainc.com]
>> Sent: Friday, September 11, 2009 7:57 AM
>> To: java-user@lucene.apache.org
>> Subject: RE: Indexing large files? - No answers yet...
>>
>> This issue is still open.  Any suggestions/help with this would be
>> greatly appreciated.
>>
>> Thanks,
>>
>> Paul
>>
>>
>> -----Original Message-----
>> From: java-user-return-42080-Paul_Murdoch=emainc.com@lucene.apache.org
>> [mailto:java-user-return-42080-Paul_Murdoch=emainc.com@lucene.apache.org
>> ] On Behalf Of Paul_Murdoch@emainc.com
>> Sent: Monday, August 31, 2009 10:28 AM
>> To: java-user@lucene.apache.org
>> Subject: Indexing large files?
>>
>> Hi,
>>
>>
>>
>> I'm working with Lucene 2.4.0 and the JVM (JDK 1.6.0_07).  I'm
>> consistently receiving "OutOfMemoryError: Java heap space", when trying
>> to index large text files.
>>
>>
>>
>> Example 1: Indexing a 5 MB text file runs out of memory with a 16 MB
>> max. heap size.  So I increased the max. heap size to 512 MB.  This
>> worked for the 5 MB text file, but Lucene still used 84 MB of heap space
>> to do this.  Why so much?
>>
>>
>>
>> The class FreqProxTermsWriterPerField appears to be the biggest memory
>> consumer by far according to JConsole and the TPTP Memory Profiling
>> plugin for Eclipse Ganymede.
>>
>>
>>
>> Example 2: Indexing a 62 MB text file runs out of memory with a 512 MB
>> max. heap size.  Increasing the max. heap size to 1024 MB works but
>> Lucene uses 826 MB of heap space while performing this.  Still seems
>> like way too much memory is being used to do this.  I'm sure larger
>> files would cause the error as it seems correlative.
>>
>>
>>
>> I'm on a Windows XP SP2 platform with 2 GB of RAM.  So what is the best
>> practice for indexing large files?  Here is a code snippet that I'm
>> using:
>>
>>
>>
>> // Index the content of a text file.
>>
>>      private Boolean saveTXTFile(File textFile, Document textDocument)
>> throws CIDBException {
>>
>>
>>
>>            try {
>>
>>
>>
>>                  Boolean isFile = textFile.isFile();
>>
>>                  Boolean hasTextExtension =
>> textFile.getName().endsWith(".txt");
>>
>>
>>
>>                  if (isFile && hasTextExtension) {
>>
>>
>>
>>                        System.out.println("File " +
>> textFile.getCanonicalPath() + " is being indexed");
>>
>>                        Reader textFileReader = new
>> FileReader(textFile);
>>
>>                        if (textDocument == null)
>>
>>                              textDocument = new Document();
>>
>>                        textDocument.add(new Field("content",
>> textFileReader));
>>
>>                        indexWriter.addDocument(textDocument);
>> // BREAKS HERE!!!!
>>
>>                  }
>>
>>            } catch (FileNotFoundException fnfe) {
>>
>>                  System.out.println(fnfe.getMessage());
>>
>>                  return false;
>>
>>            } catch (CorruptIndexException cie) {
>>
>>                  throw new CIDBException("The index has become
>> corrupt.");
>>
>>            } catch (IOException ioe) {
>>
>>                  System.out.println(ioe.getMessage());
>>
>>                  return false;
>>
>>            }
>>
>>            return true;
>>
>>      }
>>
>>
>>
>>
>>
>> Thanks much,
>>
>>
>>
>> Paul
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
>
>
> --
>
> -
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>



-- 

-

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Indexing large files? - No answers yet...

Posted by Pa...@emainc.com.

Thanks Glen!

I will take at your project.  Unfortunately I will only have 512 MB to 1024 MB to work with as Lucene is only one component in a larger software system running on one machine.  I agree with you on the C\C++ comment.  That is what I would normally use for memory intense software.  It turns out that the larger file you want to index is the larger the heap space you will need.  What I would like to see is a way to "throttle" the indexing process to control the memory footprint.  I understand that this will take longer, but if I perform the task during off hours it shouldn't matter. At least the file will be indexed correctly.

Thanks,
Paul


-----Original Message-----
From: java-user-return-42272-Paul_Murdoch=emainc.com@lucene.apache.org [mailto:java-user-return-42272-Paul_Murdoch=emainc.com@lucene.apache.org] On Behalf Of Glen Newton
Sent: Friday, September 11, 2009 9:53 AM
To: java-user@lucene.apache.org
Subject: Re: Indexing large files? - No answers yet...

In this project:
 http://zzzoot.blogspot.com/2009/07/project-torngat-building-large-scale.html

I concatenate all the text of all of articles of a single journal into
a single text file.
This can create a text file that is 500MB in size.
Lucene is OK in indexing files this size (in parallel even), but I
have a heap size of 8GB.

I would suggest increasing your heap to as large as your machine can
reasonably take.
The reality is that Java programs (like Lucene) take up more memory
than a similar C or even C++ program.
Java may approach C/C++ in speed, but not memory.

We don't use Java because of its memory footprint!  ;-)

See:
 Programming language shootout: speed:
http://shootout.alioth.debian.org/u32q/benchmark.php?test=all&lang=all&d=ndata&calc=calculate&xfullcpu=1&xmem=0&xloc=0&binarytrees=1&chameneosredux=1&fannkuch=1&fasta=1&knucleotide=1&mandelbrot=1&meteor=0&nbody=1&pidigits=1&regexdna=1&revcomp=1&spectralnorm=1&threadring=0
 Programming language shootout: memory:
http://shootout.alioth.debian.org/u32q/benchmark.php?test=all&lang=all&d=ndata&calc=calculate&xfullcpu=0&xmem=1&xloc=0&binarytrees=1&chameneosredux=1&fannkuch=1&fasta=1&knucleotide=1&mandelbrot=1&meteor=0&nbody=1&pidigits=1&regexdna=1&revcomp=1&spectralnorm=1&threadring=0

-glen

2009/9/11 Dan OConnor <do...@acquiremedia.com>:
> Paul:
>
> My first suggestion would be to update your JVM to the latest version (or at least .14). There were several garbage collection related issues resolved in version 10 - 13 (especially dealing with large heaps).
>
> Next, your IndexWriter parameters would help figure out why you are using so much RAM
>        getMaxFieldLength()
>        getMaxBufferedDocs()
>        getMaxMergeDocs()
>        getRAMBufferSizeMB()
>
> How often are you calling commit?
> Do you close your IndexWriter after every document?
> How many documents of this size are you indexing?
> Have you used luke to look at your index?
> If this is a large index, have you optimized it recently?
> Are there any searches going on while you are indexing?
>
>
> Regards,
> Dan
>
>
> -----Original Message-----
> From: Paul_Murdoch@emainc.com [mailto:Paul_Murdoch@emainc.com]
> Sent: Friday, September 11, 2009 7:57 AM
> To: java-user@lucene.apache.org
> Subject: RE: Indexing large files? - No answers yet...
>
> This issue is still open.  Any suggestions/help with this would be
> greatly appreciated.
>
> Thanks,
>
> Paul
>
>
> -----Original Message-----
> From: java-user-return-42080-Paul_Murdoch=emainc.com@lucene.apache.org
> [mailto:java-user-return-42080-Paul_Murdoch=emainc.com@lucene.apache.org
> ] On Behalf Of Paul_Murdoch@emainc.com
> Sent: Monday, August 31, 2009 10:28 AM
> To: java-user@lucene.apache.org
> Subject: Indexing large files?
>
> Hi,
>
>
>
> I'm working with Lucene 2.4.0 and the JVM (JDK 1.6.0_07).  I'm
> consistently receiving "OutOfMemoryError: Java heap space", when trying
> to index large text files.
>
>
>
> Example 1: Indexing a 5 MB text file runs out of memory with a 16 MB
> max. heap size.  So I increased the max. heap size to 512 MB.  This
> worked for the 5 MB text file, but Lucene still used 84 MB of heap space
> to do this.  Why so much?
>
>
>
> The class FreqProxTermsWriterPerField appears to be the biggest memory
> consumer by far according to JConsole and the TPTP Memory Profiling
> plugin for Eclipse Ganymede.
>
>
>
> Example 2: Indexing a 62 MB text file runs out of memory with a 512 MB
> max. heap size.  Increasing the max. heap size to 1024 MB works but
> Lucene uses 826 MB of heap space while performing this.  Still seems
> like way too much memory is being used to do this.  I'm sure larger
> files would cause the error as it seems correlative.
>
>
>
> I'm on a Windows XP SP2 platform with 2 GB of RAM.  So what is the best
> practice for indexing large files?  Here is a code snippet that I'm
> using:
>
>
>
> // Index the content of a text file.
>
>      private Boolean saveTXTFile(File textFile, Document textDocument)
> throws CIDBException {
>
>
>
>            try {
>
>
>
>                  Boolean isFile = textFile.isFile();
>
>                  Boolean hasTextExtension =
> textFile.getName().endsWith(".txt");
>
>
>
>                  if (isFile && hasTextExtension) {
>
>
>
>                        System.out.println("File " +
> textFile.getCanonicalPath() + " is being indexed");
>
>                        Reader textFileReader = new
> FileReader(textFile);
>
>                        if (textDocument == null)
>
>                              textDocument = new Document();
>
>                        textDocument.add(new Field("content",
> textFileReader));
>
>                        indexWriter.addDocument(textDocument);
> // BREAKS HERE!!!!
>
>                  }
>
>            } catch (FileNotFoundException fnfe) {
>
>                  System.out.println(fnfe.getMessage());
>
>                  return false;
>
>            } catch (CorruptIndexException cie) {
>
>                  throw new CIDBException("The index has become
> corrupt.");
>
>            } catch (IOException ioe) {
>
>                  System.out.println(ioe.getMessage());
>
>                  return false;
>
>            }
>
>            return true;
>
>      }
>
>
>
>
>
> Thanks much,
>
>
>
> Paul
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>



-- 

-

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Indexing large files? - No answers yet...

Posted by Glen Newton <gl...@gmail.com>.

In this project:
 http://zzzoot.blogspot.com/2009/07/project-torngat-building-large-scale.html

I concatenate all the text of all of articles of a single journal into
a single text file.
This can create a text file that is 500MB in size.
Lucene is OK in indexing files this size (in parallel even), but I
have a heap size of 8GB.

I would suggest increasing your heap to as large as your machine can
reasonably take.
The reality is that Java programs (like Lucene) take up more memory
than a similar C or even C++ program.
Java may approach C/C++ in speed, but not memory.

We don't use Java because of its memory footprint!  ;-)

See:
 Programming language shootout: speed:
http://shootout.alioth.debian.org/u32q/benchmark.php?test=all&lang=all&d=ndata&calc=calculate&xfullcpu=1&xmem=0&xloc=0&binarytrees=1&chameneosredux=1&fannkuch=1&fasta=1&knucleotide=1&mandelbrot=1&meteor=0&nbody=1&pidigits=1&regexdna=1&revcomp=1&spectralnorm=1&threadring=0
 Programming language shootout: memory:
http://shootout.alioth.debian.org/u32q/benchmark.php?test=all&lang=all&d=ndata&calc=calculate&xfullcpu=0&xmem=1&xloc=0&binarytrees=1&chameneosredux=1&fannkuch=1&fasta=1&knucleotide=1&mandelbrot=1&meteor=0&nbody=1&pidigits=1&regexdna=1&revcomp=1&spectralnorm=1&threadring=0

-glen

2009/9/11 Dan OConnor <do...@acquiremedia.com>:
> Paul:
>
> My first suggestion would be to update your JVM to the latest version (or at least .14). There were several garbage collection related issues resolved in version 10 - 13 (especially dealing with large heaps).
>
> Next, your IndexWriter parameters would help figure out why you are using so much RAM
>        getMaxFieldLength()
>        getMaxBufferedDocs()
>        getMaxMergeDocs()
>        getRAMBufferSizeMB()
>
> How often are you calling commit?
> Do you close your IndexWriter after every document?
> How many documents of this size are you indexing?
> Have you used luke to look at your index?
> If this is a large index, have you optimized it recently?
> Are there any searches going on while you are indexing?
>
>
> Regards,
> Dan
>
>
> -----Original Message-----
> From: Paul_Murdoch@emainc.com [mailto:Paul_Murdoch@emainc.com]
> Sent: Friday, September 11, 2009 7:57 AM
> To: java-user@lucene.apache.org
> Subject: RE: Indexing large files? - No answers yet...
>
> This issue is still open.  Any suggestions/help with this would be
> greatly appreciated.
>
> Thanks,
>
> Paul
>
>
> -----Original Message-----
> From: java-user-return-42080-Paul_Murdoch=emainc.com@lucene.apache.org
> [mailto:java-user-return-42080-Paul_Murdoch=emainc.com@lucene.apache.org
> ] On Behalf Of Paul_Murdoch@emainc.com
> Sent: Monday, August 31, 2009 10:28 AM
> To: java-user@lucene.apache.org
> Subject: Indexing large files?
>
> Hi,
>
>
>
> I'm working with Lucene 2.4.0 and the JVM (JDK 1.6.0_07).  I'm
> consistently receiving "OutOfMemoryError: Java heap space", when trying
> to index large text files.
>
>
>
> Example 1: Indexing a 5 MB text file runs out of memory with a 16 MB
> max. heap size.  So I increased the max. heap size to 512 MB.  This
> worked for the 5 MB text file, but Lucene still used 84 MB of heap space
> to do this.  Why so much?
>
>
>
> The class FreqProxTermsWriterPerField appears to be the biggest memory
> consumer by far according to JConsole and the TPTP Memory Profiling
> plugin for Eclipse Ganymede.
>
>
>
> Example 2: Indexing a 62 MB text file runs out of memory with a 512 MB
> max. heap size.  Increasing the max. heap size to 1024 MB works but
> Lucene uses 826 MB of heap space while performing this.  Still seems
> like way too much memory is being used to do this.  I'm sure larger
> files would cause the error as it seems correlative.
>
>
>
> I'm on a Windows XP SP2 platform with 2 GB of RAM.  So what is the best
> practice for indexing large files?  Here is a code snippet that I'm
> using:
>
>
>
> // Index the content of a text file.
>
>      private Boolean saveTXTFile(File textFile, Document textDocument)
> throws CIDBException {
>
>
>
>            try {
>
>
>
>                  Boolean isFile = textFile.isFile();
>
>                  Boolean hasTextExtension =
> textFile.getName().endsWith(".txt");
>
>
>
>                  if (isFile && hasTextExtension) {
>
>
>
>                        System.out.println("File " +
> textFile.getCanonicalPath() + " is being indexed");
>
>                        Reader textFileReader = new
> FileReader(textFile);
>
>                        if (textDocument == null)
>
>                              textDocument = new Document();
>
>                        textDocument.add(new Field("content",
> textFileReader));
>
>                        indexWriter.addDocument(textDocument);
> // BREAKS HERE!!!!
>
>                  }
>
>            } catch (FileNotFoundException fnfe) {
>
>                  System.out.println(fnfe.getMessage());
>
>                  return false;
>
>            } catch (CorruptIndexException cie) {
>
>                  throw new CIDBException("The index has become
> corrupt.");
>
>            } catch (IOException ioe) {
>
>                  System.out.println(ioe.getMessage());
>
>                  return false;
>
>            }
>
>            return true;
>
>      }
>
>
>
>
>
> Thanks much,
>
>
>
> Paul
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>



-- 

-

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Indexing large files? - No answers yet...

Posted by Pa...@emainc.com.

Thanks Dan!

I upgraded my JVM from .12 to .16.  I'll test with that.

I've been testing by setting many IndexWriter parameters manually to see
where the best performance is.  Then net result was just delaying the
OOM.

The scenario is a test with an empty index.  I have a 5 MB file with
800,000 unique terms in it. I make one document with the file and then
add that document to the IndexWriter for indexing.  At a heap space of
64 MB the OOM occurs almost immediately when the
IndexWriter.add(document) is called. If I increase the heap space to 128
MB indexing is successful taking less than 5 seconds to complete. So...

I commit only once.
The OOM occurs before I can close the IndexWriter at 64 MB heap space. I
close and optimize on the successful 128 MB test.
I'm only indexing one document.
I use Luke all the time for other indexes, but after the OOM on this
test not even the "Force Open" will get me into the index.
There are no searches going on.  This is a test to try and index one 5
MB file to an empty index.

So you're probably asking why I don't just increase the heap space and
be happy.  The answer is that the larger the file the more heap space is
needed.  The system I'm developing doesn't have the heap space required
for the potential large files that the end user might try to index.  So
I would like a way to index the file as one document using a small
memory footprint.  It would be nice to be able to "throttle" the
indexing of large files to control memory usage.

Thanks,
Paul


-----Original Message-----
From: java-user-return-42271-Paul_Murdoch=emainc.com@lucene.apache.org
[mailto:java-user-return-42271-Paul_Murdoch=emainc.com@lucene.apache.org
] On Behalf Of Dan OConnor
Sent: Friday, September 11, 2009 8:13 AM
To: java-user@lucene.apache.org
Subject: RE: Indexing large files? - No answers yet...

Paul:

My first suggestion would be to update your JVM to the latest version
(or at least .14). There were several garbage collection related issues
resolved in version 10 - 13 (especially dealing with large heaps).

Next, your IndexWriter parameters would help figure out why you are
using so much RAM
	getMaxFieldLength()
	getMaxBufferedDocs()
	getMaxMergeDocs()
	getRAMBufferSizeMB()

How often are you calling commit?
Do you close your IndexWriter after every document?
How many documents of this size are you indexing?
Have you used luke to look at your index?
If this is a large index, have you optimized it recently?
Are there any searches going on while you are indexing?


Regards,
Dan


-----Original Message-----
From: Paul_Murdoch@emainc.com [mailto:Paul_Murdoch@emainc.com] 
Sent: Friday, September 11, 2009 7:57 AM
To: java-user@lucene.apache.org
Subject: RE: Indexing large files? - No answers yet...

This issue is still open.  Any suggestions/help with this would be
greatly appreciated.

Thanks,

Paul


-----Original Message-----
From: java-user-return-42080-Paul_Murdoch=emainc.com@lucene.apache.org
[mailto:java-user-return-42080-Paul_Murdoch=emainc.com@lucene.apache.org
] On Behalf Of Paul_Murdoch@emainc.com
Sent: Monday, August 31, 2009 10:28 AM
To: java-user@lucene.apache.org
Subject: Indexing large files?

Hi,

 

I'm working with Lucene 2.4.0 and the JVM (JDK 1.6.0_07).  I'm
consistently receiving "OutOfMemoryError: Java heap space", when trying
to index large text files.

 

Example 1: Indexing a 5 MB text file runs out of memory with a 16 MB
max. heap size.  So I increased the max. heap size to 512 MB.  This
worked for the 5 MB text file, but Lucene still used 84 MB of heap space
to do this.  Why so much?

 

The class FreqProxTermsWriterPerField appears to be the biggest memory
consumer by far according to JConsole and the TPTP Memory Profiling
plugin for Eclipse Ganymede.

 

Example 2: Indexing a 62 MB text file runs out of memory with a 512 MB
max. heap size.  Increasing the max. heap size to 1024 MB works but
Lucene uses 826 MB of heap space while performing this.  Still seems
like way too much memory is being used to do this.  I'm sure larger
files would cause the error as it seems correlative.

 

I'm on a Windows XP SP2 platform with 2 GB of RAM.  So what is the best
practice for indexing large files?  Here is a code snippet that I'm
using:

 

// Index the content of a text file.

      private Boolean saveTXTFile(File textFile, Document textDocument)
throws CIDBException {           

            

            try {             

                              

                  Boolean isFile = textFile.isFile();

                  Boolean hasTextExtension =
textFile.getName().endsWith(".txt");

                  

                  if (isFile && hasTextExtension) {

             

                        System.out.println("File " +
textFile.getCanonicalPath() + " is being indexed");

                        Reader textFileReader = new
FileReader(textFile);

                        if (textDocument == null)

                              textDocument = new Document();

                        textDocument.add(new Field("content",
textFileReader));

                        indexWriter.addDocument(textDocument);
// BREAKS HERE!!!!

                  }                    

            } catch (FileNotFoundException fnfe) {

                  System.out.println(fnfe.getMessage());

                  return false;

            } catch (CorruptIndexException cie) {

                  throw new CIDBException("The index has become
corrupt.");

            } catch (IOException ioe) {

                  System.out.println(ioe.getMessage());

                  return false;

            }                    

            return true;

      }

 

 

Thanks much,

 

Paul

 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Indexing large files? - No answers yet...

Posted by Dan OConnor <do...@acquiremedia.com>.

Paul:

My first suggestion would be to update your JVM to the latest version (or at least .14). There were several garbage collection related issues resolved in version 10 - 13 (especially dealing with large heaps).

Next, your IndexWriter parameters would help figure out why you are using so much RAM
	getMaxFieldLength()
	getMaxBufferedDocs()
	getMaxMergeDocs()
	getRAMBufferSizeMB()

How often are you calling commit?
Do you close your IndexWriter after every document?
How many documents of this size are you indexing?
Have you used luke to look at your index?
If this is a large index, have you optimized it recently?
Are there any searches going on while you are indexing?


Regards,
Dan


-----Original Message-----
From: Paul_Murdoch@emainc.com [mailto:Paul_Murdoch@emainc.com] 
Sent: Friday, September 11, 2009 7:57 AM
To: java-user@lucene.apache.org
Subject: RE: Indexing large files? - No answers yet...

This issue is still open.  Any suggestions/help with this would be
greatly appreciated.

Thanks,

Paul


-----Original Message-----
From: java-user-return-42080-Paul_Murdoch=emainc.com@lucene.apache.org
[mailto:java-user-return-42080-Paul_Murdoch=emainc.com@lucene.apache.org
] On Behalf Of Paul_Murdoch@emainc.com
Sent: Monday, August 31, 2009 10:28 AM
To: java-user@lucene.apache.org
Subject: Indexing large files?

Hi,

 

I'm working with Lucene 2.4.0 and the JVM (JDK 1.6.0_07).  I'm
consistently receiving "OutOfMemoryError: Java heap space", when trying
to index large text files.

 

Example 1: Indexing a 5 MB text file runs out of memory with a 16 MB
max. heap size.  So I increased the max. heap size to 512 MB.  This
worked for the 5 MB text file, but Lucene still used 84 MB of heap space
to do this.  Why so much?

 

The class FreqProxTermsWriterPerField appears to be the biggest memory
consumer by far according to JConsole and the TPTP Memory Profiling
plugin for Eclipse Ganymede.

 

Example 2: Indexing a 62 MB text file runs out of memory with a 512 MB
max. heap size.  Increasing the max. heap size to 1024 MB works but
Lucene uses 826 MB of heap space while performing this.  Still seems
like way too much memory is being used to do this.  I'm sure larger
files would cause the error as it seems correlative.

 

I'm on a Windows XP SP2 platform with 2 GB of RAM.  So what is the best
practice for indexing large files?  Here is a code snippet that I'm
using:

 

// Index the content of a text file.

      private Boolean saveTXTFile(File textFile, Document textDocument)
throws CIDBException {           

            

            try {             

                              

                  Boolean isFile = textFile.isFile();

                  Boolean hasTextExtension =
textFile.getName().endsWith(".txt");

                  

                  if (isFile && hasTextExtension) {

             

                        System.out.println("File " +
textFile.getCanonicalPath() + " is being indexed");

                        Reader textFileReader = new
FileReader(textFile);

                        if (textDocument == null)

                              textDocument = new Document();

                        textDocument.add(new Field("content",
textFileReader));

                        indexWriter.addDocument(textDocument);
// BREAKS HERE!!!!

                  }                    

            } catch (FileNotFoundException fnfe) {

                  System.out.println(fnfe.getMessage());

                  return false;

            } catch (CorruptIndexException cie) {

                  throw new CIDBException("The index has become
corrupt.");

            } catch (IOException ioe) {

                  System.out.println(ioe.getMessage());

                  return false;

            }                    

            return true;

      }

 

 

Thanks much,

 

Paul

 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Indexing large files? - No answers yet...

Posted by Pa...@emainc.com.

This issue is still open.  Any suggestions/help with this would be
greatly appreciated.

Thanks,

Paul


-----Original Message-----
From: java-user-return-42080-Paul_Murdoch=emainc.com@lucene.apache.org
[mailto:java-user-return-42080-Paul_Murdoch=emainc.com@lucene.apache.org
] On Behalf Of Paul_Murdoch@emainc.com
Sent: Monday, August 31, 2009 10:28 AM
To: java-user@lucene.apache.org
Subject: Indexing large files?

Hi,

 

I'm working with Lucene 2.4.0 and the JVM (JDK 1.6.0_07).  I'm
consistently receiving "OutOfMemoryError: Java heap space", when trying
to index large text files.

 

Example 1: Indexing a 5 MB text file runs out of memory with a 16 MB
max. heap size.  So I increased the max. heap size to 512 MB.  This
worked for the 5 MB text file, but Lucene still used 84 MB of heap space
to do this.  Why so much?

 

The class FreqProxTermsWriterPerField appears to be the biggest memory
consumer by far according to JConsole and the TPTP Memory Profiling
plugin for Eclipse Ganymede.

 

Example 2: Indexing a 62 MB text file runs out of memory with a 512 MB
max. heap size.  Increasing the max. heap size to 1024 MB works but
Lucene uses 826 MB of heap space while performing this.  Still seems
like way too much memory is being used to do this.  I'm sure larger
files would cause the error as it seems correlative.

 

I'm on a Windows XP SP2 platform with 2 GB of RAM.  So what is the best
practice for indexing large files?  Here is a code snippet that I'm
using:

 

// Index the content of a text file.

      private Boolean saveTXTFile(File textFile, Document textDocument)
throws CIDBException {           

            

            try {             

                              

                  Boolean isFile = textFile.isFile();

                  Boolean hasTextExtension =
textFile.getName().endsWith(".txt");

                  

                  if (isFile && hasTextExtension) {

             

                        System.out.println("File " +
textFile.getCanonicalPath() + " is being indexed");

                        Reader textFileReader = new
FileReader(textFile);

                        if (textDocument == null)

                              textDocument = new Document();

                        textDocument.add(new Field("content",
textFileReader));

                        indexWriter.addDocument(textDocument);
// BREAKS HERE!!!!

                  }                    

            } catch (FileNotFoundException fnfe) {

                  System.out.println(fnfe.getMessage());

                  return false;

            } catch (CorruptIndexException cie) {

                  throw new CIDBException("The index has become
corrupt.");

            } catch (IOException ioe) {

                  System.out.println(ioe.getMessage());

                  return false;

            }                    

            return true;

      }

 

 

Thanks much,

 

Paul

 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Why perform optimization in 'off hours'?

Posted by Ted Stockwell <em...@yahoo.com>.

Thanks for the reply.
I suspected that was the case, I was just wondering if there was something more to it.



----- Original Message ----
> From: Shai Erera <se...@gmail.com>
> To: java-user@lucene.apache.org
> Sent: Monday, August 31, 2009 10:28:41 AM
> Subject: Re: Why perform optimization in 'off hours'?
> 
> When you run optimize(), you consume CPU and do lots of IO operations which
> can really mess up the OS IO cache. Optimize is a very heavy process and
> therefore is recommended to run at off hours. Sometimes, when your index is
> large enough, it's recommended to run it during weekends, since the
> optimize() process itself may take several hours, so that a nightly job
> won't be enough.
> 
> Shai
> 
> On Mon, Aug 31, 2009 at 6:25 PM, Ted Stockwell wrote:
> 
> > Hi All,
> >
> > I am new to Lucene and I was reading 'Lucene in Action' this weekend.
> > The book recommends that optimization be performed when the index is not in
> > use.
> > The book makes it clear that optimization *may* be performed while indexing
> > but it says that optimizing while indexing makes indexing slower.
> > However, the book does not explain *why* indexing would be slower while
> > optimizing.
> > Since I know that optimization will create new segments and not mess with
> > the old ones, I'm confused as to how optimizing may cause indexing to slow
> > down.
> >
> > Any ideas?
> >
> >
> > Thanks,
> > ted stockwell
> >
> >
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >



      

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Why perform optimization in 'off hours'?

Posted by Shai Erera <se...@gmail.com>.

When you run optimize(), you consume CPU and do lots of IO operations which
can really mess up the OS IO cache. Optimize is a very heavy process and
therefore is recommended to run at off hours. Sometimes, when your index is
large enough, it's recommended to run it during weekends, since the
optimize() process itself may take several hours, so that a nightly job
won't be enough.

Shai

On Mon, Aug 31, 2009 at 6:25 PM, Ted Stockwell <em...@yahoo.com> wrote:

> Hi All,
>
> I am new to Lucene and I was reading 'Lucene in Action' this weekend.
> The book recommends that optimization be performed when the index is not in
> use.
> The book makes it clear that optimization *may* be performed while indexing
> but it says that optimizing while indexing makes indexing slower.
> However, the book does not explain *why* indexing would be slower while
> optimizing.
> Since I know that optimization will create new segments and not mess with
> the old ones, I'm confused as to how optimizing may cause indexing to slow
> down.
>
> Any ideas?
>
>
> Thanks,
> ted stockwell
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Why perform optimization in 'off hours'?

Posted by Chris Hostetter <ho...@fucit.org>.

: Subject: Why perform optimization in 'off hours'?
: In-Reply-To:
:     <5B...@sc1exc2.corp.emainc.com>

http://people.apache.org/~hossman/#threadhijack
Thread Hijacking on Mailing Lists

When starting a new discussion on a mailing list, please do not reply to 
an existing message, instead start a fresh email.  Even if you change the 
subject line of your email, other mail headers still track which thread 
you replied to and your question is "hidden" in that thread and gets less 
attention.   It makes following discussions in the mailing list archives 
particularly difficult.
See Also:  http://en.wikipedia.org/wiki/Thread_hijacking



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Why perform optimization in 'off hours'?

Posted by Ted Stockwell <em...@yahoo.com>.

Hi All,

I am new to Lucene and I was reading 'Lucene in Action' this weekend.
The book recommends that optimization be performed when the index is not in use.
The book makes it clear that optimization *may* be performed while indexing but it says that optimizing while indexing makes indexing slower.
However, the book does not explain *why* indexing would be slower while optimizing.
Since I know that optimization will create new segments and not mess with the old ones, I'm confused as to how optimizing may cause indexing to slow down.

Any ideas?


Thanks,
ted stockwell



      

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: TokenStream API, Quick Question.

Posted by Uwe Schindler <uw...@thetaphi.de>.

The indexer only call getAttribute/addAttribute one time after initializing
(see docs). It will never call it later. If you cache tokens, you always
have to restore the state into the TokenStream's attributes.

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de

> -----Original Message-----
> From: Daniel Shane [mailto:shaned@LEXUM.UMontreal.CA]
> Sent: Thursday, September 03, 2009 8:55 PM
> To: java-user@lucene.apache.org
> Subject: TokenStream API, Quick Question.
> 
> Does a TokenStream have to return always the same number of attributes
> with the same underlying classes for all the tokens it generates?
> 
> I mean, during the tokenization phase, can the first "token" have a Term
> and Offset Attribute and the second "token" only a Type Attribute or
> does this mean that the first token has to have an empty Type attribute
> as well?
> 
> I'm just not sure,
> Daniel Shane
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

TokenStream API, Quick Question.

Posted by Daniel Shane <sh...@LEXUM.UMontreal.CA>.

Does a TokenStream have to return always the same number of attributes 
with the same underlying classes for all the tokens it generates?

I mean, during the tokenization phase, can the first "token" have a Term 
and Offset Attribute and the second "token" only a Type Attribute or 
does this mean that the first token has to have an empty Type attribute 
as well?

I'm just not sure,
Daniel Shane

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene 2.9.0-rc2 [PROBLEM] : TokenStream API (incrementToken / AttributeSource), cannot implement a LookaheadTokenFilter.

Posted by Daniel Shane <sh...@LEXUM.UMontreal.CA>.

Ok, I got it, from checking other filters, I should call 
input.incrementToken() instead of super.incrementToken().

Do you feel this kind of breaks the object model (super.incrementToken() 
should also work).

Maybe when the old API is gone, we can stop checking if someone has 
overloaded next() or incrementToken()?

Daniel S.

> Humm... I looked at captureState() and restoreState() and it doesnt 
> seem like it would work in my scenario.
>
> I'd like the LookAheadFilter to be able to peek() several tokens 
> forward and they can have different attributes, so I don't think I 
> should assume I can restoreState() safely.
>
> Here is an application for the filter, lets say I want to recognize 
> abbreviations (like S.C.R.) at the token level. I'd need to be able to 
> peek() a few tokens forward to make sure S.C.R. is an abbreviation and 
> not simply the end of a sentence.
>
> So the user should be able to peek() a number of token forward before 
> returning to usual behavior.
>
> Here is the implementation I had in mind (untested yet because of a 
> StackOverflow) :
>
> public class LookaheadTokenFilter extends TokenFilter {
>    /** List of tokens that were peeked but not returned with next. */
>    LinkedList<AttributeSource> peekedTokens = new 
> LinkedList<AttributeSource>();
>
>    /** The position of the next character that peek() will return in 
> peekedTokens */
>    int peekPosition = 0;
>
>    public LookaheadTokenFilter(TokenStream input) {
>        super(input);
>    }
>
>    public boolean peekIncrementToken() throws IOException {
>        if (this.peekPosition >= this.peekedTokens.size()) {
>            if (this.input.incrementToken() == false) {
>                return false;
>            }
>                      
> this.peekedTokens.add(cloneAttributes());                      
> this.peekPosition = this.peekedTokens.size();
>            return true;
>        }
>               this.peekPosition++;              return true;
>    }
>      @Override
>    public boolean incrementToken() throws IOException {
>        reset();
>              if (this.peekedTokens.isEmpty() == false) {
>            this.peekedTokens.removeFirst();
>        }
>              if (this.peekedTokens.isEmpty() == false) {
>            return true;
>        }
>              return super.incrementToken();
>    }
>          @Override
>    public void reset() {
>        this.peekPosition = 0;
>    }    
>    //Overloaded methods...
>      public Attribute getAttribute(Class attClass) {
>        if (this.peekedTokens.size() > 0) {
>            return 
> this.peekedTokens.get(this.peekPosition).getAttribute(attClass);
>        }              return super.getAttribute(attClass);
>    }
>      //Overload all these just like getAttribute() ...
>    public Iterator<?> getAttributeClassesIterator() ...
>    public AttributeFactory getAttributeFactory() ...
>    public Iterator getAttributeImplsIterator() ...
>    public Attribute addAttribute(Class attClass) ...
>    public void addAttributeImpl(AttributeImpl att) ...
>    public State captureState() ...
>    public void clearAttributes() ...
>    public AttributeSource cloneAttributes() ...
>    public boolean hasAttribute(Class attClass) ...
>    public boolean hasAttributes() ...
>    public void restoreState(State state) ...                     }
>
>
> Now the problem I have is that the below code triggers an evil 
> StackOverflow because I'm overriding incrementToken() and calling 
> super.incrementToken() which will loop back because of this :
>
> public boolean incrementToken() throws IOException {
>    assert tokenWrapper != null;
>      final Token token;
>    if (supportedMethods.hasReusableNext) {
>      token = next(tokenWrapper.delegate);
>    } else {
>      assert supportedMethods.hasNext;
>      token = next(); <----- Lucene calls next();
>    }
>    if (token == null) return false;
>    tokenWrapper.delegate = token;
>    return true;
>  }
>
> which then calls :
>
> public Token next() throws IOException {
>    if (tokenWrapper == null)
>      throw new UnsupportedOperationException("This TokenStream only 
> supports the new Attributes API.");
>      if (supportedMethods.hasIncrementToken) {
>      return incrementToken() ? ((Token) tokenWrapper.delegate.clone()) 
> : null; <--- incrementToken() gets called
>    } else {
>      assert supportedMethods.hasReusableNext;
>      final Token token = next(tokenWrapper.delegate);
>      if (token == null) return null;
>      tokenWrapper.delegate = token;
>      return (Token) token.clone();
>    }
>  }
>
> and hasIncrementToken is true because I overloaded incrementToken();
>
> MethodSupport(Class clazz) {
>    hasIncrementToken = isMethodOverridden(clazz, "incrementToken", 
> METHOD_NO_PARAMS);
>    hasReusableNext = isMethodOverridden(clazz, "next", 
> METHOD_TOKEN_PARAM);
>    hasNext = isMethodOverridden(clazz, "next", METHOD_NO_PARAMS);
> }
>
> Seems like a "catch-22". From what I understand, if I override 
> incrementToken() I should not call super.incrementToken()????
>
> Daniel S.
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene 2.9.0-rc2 [PROBLEM] : TokenStream API (incrementToken / AttributeSource), cannot implement a LookaheadTokenFilter.

Posted by Jamie <ja...@stimulussoft.com>.

Hi THere

In the absense of documentation, I am trying to convert an EmailFilter 
class to Lucene 3.0. Its not working! Obviously, my understanding of the 
new token filter mechanism is misguided.
Can someone in the know help me out for a sec and let me know where I am 
going wrong. Thanks.

import org.apache.commons.logging.*;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.TokenFilter;
import org.apache.lucene.analysis.Token;
import org.apache.lucene.analysis.tokenattributes.TermAttribute;

import java.io.IOException;
import java.io.Serializable;
import java.util.ArrayList;
import java.util.Stack;

/* Many thanks to Michael J. Prichard" <mi...@mac.com> for his
  * original the email filter code. It is rewritten. */

public class EmailFilter extends TokenFilter  implements Serializable {

     public EmailFilter(TokenStream in) {
         super(in);
     }

     public final boolean incrementToken() throws java.io.IOException {

         if (!input.incrementToken()) {
             return false;
         }


         TermAttribute termAtt = (TermAttribute) 
input.getAttribute(TermAttribute.class);

         char[] buffer = termAtt.termBuffer();
         final int bufferLength = termAtt.termLength();
         String emailAddress = new String(buffer, 0,bufferLength);
         emailAddress = emailAddress.replaceAll("<", "");
         emailAddress = emailAddress.replaceAll(">", "");
         emailAddress = emailAddress.replaceAll("\"", "");

         String [] parts = extractEmailParts(emailAddress);
         clearAttributes();
         for (int i = 0; i < parts.length; i++) {
             if (parts[i]!=null) {
                 TermAttribute newTermAttribute = 
addAttribute(TermAttribute.class);
                 newTermAttribute.setTermBuffer(parts[i]);
                 newTermAttribute.setTermLength(parts[i].length());
             }
         }
         return true;
     }

     private String[] extractWhitespaceParts(String email) {
         String[] whitespaceParts = email.split(" ");
         ArrayList<String> partsList = new ArrayList<String>();
         for (int i=0; i < whitespaceParts.length; i++) {
             partsList.add(whitespaceParts[i]);
         }
         return whitespaceParts;
     }

     private String[] extractEmailParts(String email) {

         if (email.indexOf('@')==-1)
             return extractWhitespaceParts(email);

         ArrayList<String> partsList = new ArrayList<String>();

         String[] whitespaceParts = extractWhitespaceParts(email);

          for (int w=0;w<whitespaceParts.length;w++) {

              if (whitespaceParts[w].indexOf('@')==-1)
                  partsList.add(whitespaceParts[w]);
              else {
                  partsList.add(whitespaceParts[w]);
                  String[] splitOnAmpersand = whitespaceParts[w].split("@");
                  try {
                      partsList.add(splitOnAmpersand[0]);
                      partsList.add(splitOnAmpersand[1]);
                  } catch (ArrayIndexOutOfBoundsException ae) {}

                 if (splitOnAmpersand.length > 0) {
                     String[] splitOnDot = splitOnAmpersand[0].split("\\.");
                      for (int i=0; i < splitOnDot.length; i++) {
                          partsList.add(splitOnDot[i]);
                      }
                 }
                 if (splitOnAmpersand.length > 1) {
                     String[] splitOnDot = splitOnAmpersand[1].split("\\.");
                     for (int i=0; i < splitOnDot.length; i++) {
                         partsList.add(splitOnDot[i]);
                     }

                     if (splitOnDot.length > 2) {
                         String domain = splitOnDot[splitOnDot.length-2] 
+ "." + splitOnDot[splitOnDot.length-1];
                         partsList.add(domain);
                     }
                 }
              }
          }
         return partsList.toArray(new String[0]);
     }

}


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene 2.9.0-rc2 [PROBLEM] : TokenStream API (incrementToken / AttributeSource), cannot implement a LookaheadTokenFilter.

Posted by Daniel Shane <sh...@LEXUM.UMontreal.CA>.

Uwe Schindler wrote:
> There may be a problem that you may not want to restore the peek token into
> the TokenFilter's attributes itsself. It looks like you want to have a Token
> instance returned from peek, but the current Stream should not reset to this
> Token (you only want to "look" into the next Token and then possibly do
> something special with the current Token). To achive this, there is a method
> cloneAttributes() in TokenStream, that creates a new AttributeSource with
> same attribute types, which is independent from the cloned one. You can then
> use clone.getAttribute(TermAttribute.class).term() or similar to look into
> the next token. But creating this new clone is costy, so you may also create
> it once and reuse. In the peek method, you simply copy the state of this to
> the cloned attributesource.
>
> It's a bit complicated but should work somehow. Tell me if you need more
> help. Maybe you should provide us with some code, what you want to do with
> the TokenFilter.
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
>
>
>   
Humm... I looked at captureState() and restoreState() and it doesnt seem 
like it would work in my scenario.

I'd like the LookAheadFilter to be able to peek() several tokens forward 
and they can have different attributes, so I don't think I should assume 
I can restoreState() safely.

Here is an application for the filter, lets say I want to recognize 
abbreviations (like S.C.R.) at the token level. I'd need to be able to 
peek() a few tokens forward to make sure S.C.R. is an abbreviation and 
not simply the end of a sentence.

So the user should be able to peek() a number of token forward before 
returning to usual behavior.

Here is the implementation I had in mind (untested yet because of a 
StackOverflow) :

public class LookaheadTokenFilter extends TokenFilter {
    /** List of tokens that were peeked but not returned with next. */
    LinkedList<AttributeSource> peekedTokens = new 
LinkedList<AttributeSource>();

    /** The position of the next character that peek() will return in 
peekedTokens */
    int peekPosition = 0;

    public LookaheadTokenFilter(TokenStream input) {
        super(input);
    }
 
    public boolean peekIncrementToken() throws IOException {
        if (this.peekPosition >= this.peekedTokens.size()) {
            if (this.input.incrementToken() == false) {
                return false;
            }
           
            this.peekedTokens.add(cloneAttributes());           
            this.peekPosition = this.peekedTokens.size();
            return true;
        }
        
        this.peekPosition++;       
        return true;
    }
   
    @Override
    public boolean incrementToken() throws IOException {
        reset();
       
        if (this.peekedTokens.isEmpty() == false) {
            this.peekedTokens.removeFirst();
        }
       
        if (this.peekedTokens.isEmpty() == false) {
            return true;
        }
       
        return super.incrementToken();
    }
       
    @Override
    public void reset() {
        this.peekPosition = 0;
    }   
   

    //Overloaded methods...
   
    public Attribute getAttribute(Class attClass) {
        if (this.peekedTokens.size() > 0) {
            return 
this.peekedTokens.get(this.peekPosition).getAttribute(attClass);
        }       
        return super.getAttribute(attClass);
    }
   
    //Overload all these just like getAttribute() ...
    public Iterator<?> getAttributeClassesIterator() ...
    public AttributeFactory getAttributeFactory() ...
    public Iterator getAttributeImplsIterator() ...
    public Attribute addAttribute(Class attClass) ...
    public void addAttributeImpl(AttributeImpl att) ...
    public State captureState() ...
    public void clearAttributes() ...
    public AttributeSource cloneAttributes() ...
    public boolean hasAttribute(Class attClass) ...
    public boolean hasAttributes() ...
    public void restoreState(State state) ...                     
}


Now the problem I have is that the below code triggers an evil 
StackOverflow because I'm overriding incrementToken() and calling 
super.incrementToken() which will loop back because of this :

public boolean incrementToken() throws IOException {
    assert tokenWrapper != null;
   
    final Token token;
    if (supportedMethods.hasReusableNext) {
      token = next(tokenWrapper.delegate);
    } else {
      assert supportedMethods.hasNext;
      token = next(); <----- Lucene calls next();
    }
    if (token == null) return false;
    tokenWrapper.delegate = token;
    return true;
  }

which then calls :

public Token next() throws IOException {
    if (tokenWrapper == null)
      throw new UnsupportedOperationException("This TokenStream only 
supports the new Attributes API.");
   
    if (supportedMethods.hasIncrementToken) {
      return incrementToken() ? ((Token) tokenWrapper.delegate.clone()) 
: null; <--- incrementToken() gets called
    } else {
      assert supportedMethods.hasReusableNext;
      final Token token = next(tokenWrapper.delegate);
      if (token == null) return null;
      tokenWrapper.delegate = token;
      return (Token) token.clone();
    }
  }

and hasIncrementToken is true because I overloaded incrementToken();

 MethodSupport(Class clazz) {
    hasIncrementToken = isMethodOverridden(clazz, "incrementToken", 
METHOD_NO_PARAMS);
    hasReusableNext = isMethodOverridden(clazz, "next", METHOD_TOKEN_PARAM);
    hasNext = isMethodOverridden(clazz, "next", METHOD_NO_PARAMS);
}

Seems like a "catch-22". From what I understand, if I override 
incrementToken() I should not call super.incrementToken()????

Daniel S.

RE: Lucene 2.9.0-rc2 [PROBLEM] : TokenStream API (incrementToken / AttributeSource), cannot implement a LookaheadTokenFilter.

Posted by Uwe Schindler <uw...@thetaphi.de>.

There may be a problem that you may not want to restore the peek token into
the TokenFilter's attributes itsself. It looks like you want to have a Token
instance returned from peek, but the current Stream should not reset to this
Token (you only want to "look" into the next Token and then possibly do
something special with the current Token). To achive this, there is a method
cloneAttributes() in TokenStream, that creates a new AttributeSource with
same attribute types, which is independent from the cloned one. You can then
use clone.getAttribute(TermAttribute.class).term() or similar to look into
the next token. But creating this new clone is costy, so you may also create
it once and reuse. In the peek method, you simply copy the state of this to
the cloned attributesource.

It's a bit complicated but should work somehow. Tell me if you need more
help. Maybe you should provide us with some code, what you want to do with
the TokenFilter.

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: Michael Busch [mailto:buschmic@gmail.com]
> Sent: Wednesday, September 02, 2009 1:53 AM
> To: java-user@lucene.apache.org
> Subject: Re: Lucene 2.9.0-rc2 [PROBLEM] : TokenStream API (incrementToken
> / AttributeSource), cannot implement a LookaheadTokenFilter.
> 
> This is what I had in mind (completely untested!):
> 
> public class LookaheadTokenFilter extends TokenFilter {
>    /** List of tokens that were peeked but not returned with next. */
>    LinkedList<AttributeSource.State> peekedTokens = new
> LinkedList<AttributeSource.State>();
> 
>    /** The position of the next character that peek() will return in
> peekedTokens */
>    int peekPosition = 0;
> 
>    public LookaheadTokenFilter(TokenStream input) {
>        super(input);
>    }
>      public boolean peek() throws IOException {
>        if (this.peekPosition >= this.peekedTokens.size()) {
>            boolean hasNext = input.incrementToken();
>            if (hasNext) {
>                this.peekedTokens.add(captureState());
>                this.peekPosition = this.peekedTokens.size();
>            }
>            return hasNext;
>        }
> 
>        restoreState(this.peekedTokens.get(this.peekPosition++));
>        return true;
>    }
> 
>    public void reset() { this.peekPosition = 0; }
> 
>    public boolean incrementToken() throws IOException {
>      reset();
> 
>      if (this.peekedTokens.size() > 0) {
>        restoreState(this.peekedTokens.removeFirst());
>        return true;
>      }
>      return this.input.incrementToken();
>    }
> }
> 
> 
> On 9/1/09 4:44 PM, Michael Busch wrote:
> > Daniel,
> >
> > take a look at the captureState() and restoreState() APIs in
> > AttributeSource and TokenStream. captureState() returns a State object
> > containing all attributes with its' current values.
> > restoreState(State) takes a given State and copies its values back
> > into the TokenStream. You should be able to achieve the same thing by
> > storing State objects in your List, instead of Token objects. peek()
> > would change to return true/false instead of Token and the caller of
> > peek consumes the values using the new attribute API. The change on
> > your side should be pretty simple, let us know if you run into problems!
> >
> >  Michael
> >
> > On 9/1/09 3:12 PM, Daniel Shane wrote:
> >> After thinking about it, the only conclusion I got was instead of
> >> saving the token, to save an iterator of Attributes and use that
> >> instead. It may work.
> >>
> >> Daniel Shane
> >>
> >> Daniel Shane wrote:
> >>> Hi all!
> >>>
> >>> I'm trying to port my Lucene code to the new TokenStream API and I
> >>> have a filter that I cannot seem to port using the current new API.
> >>>
> >>> The filter is called LookaheadTokenFilter. It behaves exactly like a
> >>> normal token filter, except, you can call peek() and get information
> >>> on the next token in the stream.
> >>>
> >>> Since Lucene does not support stream "rewinding", we did this by
> >>> buffering tokens when peek() was called and giving those back when
> >>> next() was called and when no more "peeked" tokens exist, we then
> >>> call super.next();
> >>>
> >>> Now, I'm looking at this new API and really I'm stuck at how to port
> >>> this using incrementToken...
> >>>
> >>> Am I missing something, is there an object I can get from the
> >>> TokenStream that I can save and get all the attributes from?
> >>>
> >>> Here is the code I'm trying to port :
> >>>
> >>> public class LookaheadTokenFilter extends TokenFilter {
> >>>    /** List of tokens that were peeked but not returned with next. */
> >>>    LinkedList<Token> peekedTokens = new LinkedList<Token>();
> >>>
> >>>    /** The position of the next character that peek() will return in
> >>> peekedTokens */
> >>>    int peekPosition = 0;
> >>>
> >>>    public LookaheadTokenFilter(TokenStream input) {
> >>>        super(input);
> >>>    }
> >>>      public Token peek() throws IOException {
> >>>        if (this.peekPosition >= this.peekedTokens.size()) {
> >>>            Token token = new Token();
> >>>            token = this.input.next(token);
> >>>            if (token != null) {
> >>>                this.peekedTokens.add(token);
> >>>                this.peekPosition = this.peekedTokens.size();
> >>>            }
> >>>            return token;
> >>>        }
> >>>
> >>>        return this.peekedTokens.get(this.peekPosition++);
> >>>    }
> >>>
> >>>    public void reset() { this.peekPosition = 0; }
> >>>
> >>>    public Token next(Token token) throws IOException {
> >>>        reset();
> >>>
> >>>        if (this.peekedTokens.size() > 0) {
> >>>            return this.peekedTokens.removeFirst();
> >>>        }
> >>>                  return this.input.next(token);          }
> >>> }
> >>>
> >>> Let me know if anyone has an idea,
> >>> Daniel Shane
> >>>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
> >
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene 2.9.0-rc2 [PROBLEM] : TokenStream API (incrementToken / AttributeSource), cannot implement a LookaheadTokenFilter.

Posted by Michael Busch <bu...@gmail.com>.

This is what I had in mind (completely untested!):

public class LookaheadTokenFilter extends TokenFilter {
   /** List of tokens that were peeked but not returned with next. */
   LinkedList<AttributeSource.State> peekedTokens = new 
LinkedList<AttributeSource.State>();

   /** The position of the next character that peek() will return in 
peekedTokens */
   int peekPosition = 0;

   public LookaheadTokenFilter(TokenStream input) {
       super(input);
   }
     public boolean peek() throws IOException {
       if (this.peekPosition >= this.peekedTokens.size()) {
           boolean hasNext = input.incrementToken();
           if (hasNext) {
               this.peekedTokens.add(captureState());
               this.peekPosition = this.peekedTokens.size();
           }
           return hasNext;
       }

       restoreState(this.peekedTokens.get(this.peekPosition++));
       return true;
   }

   public void reset() { this.peekPosition = 0; }

   public boolean incrementToken() throws IOException {
     reset();

     if (this.peekedTokens.size() > 0) {
       restoreState(this.peekedTokens.removeFirst());
       return true;
     }
     return this.input.incrementToken();
   }
}


On 9/1/09 4:44 PM, Michael Busch wrote:
> Daniel,
>
> take a look at the captureState() and restoreState() APIs in 
> AttributeSource and TokenStream. captureState() returns a State object 
> containing all attributes with its' current values. 
> restoreState(State) takes a given State and copies its values back 
> into the TokenStream. You should be able to achieve the same thing by 
> storing State objects in your List, instead of Token objects. peek() 
> would change to return true/false instead of Token and the caller of 
> peek consumes the values using the new attribute API. The change on 
> your side should be pretty simple, let us know if you run into problems!
>
>  Michael
>
> On 9/1/09 3:12 PM, Daniel Shane wrote:
>> After thinking about it, the only conclusion I got was instead of 
>> saving the token, to save an iterator of Attributes and use that 
>> instead. It may work.
>>
>> Daniel Shane
>>
>> Daniel Shane wrote:
>>> Hi all!
>>>
>>> I'm trying to port my Lucene code to the new TokenStream API and I 
>>> have a filter that I cannot seem to port using the current new API.
>>>
>>> The filter is called LookaheadTokenFilter. It behaves exactly like a 
>>> normal token filter, except, you can call peek() and get information 
>>> on the next token in the stream.
>>>
>>> Since Lucene does not support stream "rewinding", we did this by 
>>> buffering tokens when peek() was called and giving those back when 
>>> next() was called and when no more "peeked" tokens exist, we then 
>>> call super.next();
>>>
>>> Now, I'm looking at this new API and really I'm stuck at how to port 
>>> this using incrementToken...
>>>
>>> Am I missing something, is there an object I can get from the 
>>> TokenStream that I can save and get all the attributes from?
>>>
>>> Here is the code I'm trying to port :
>>>
>>> public class LookaheadTokenFilter extends TokenFilter {
>>>    /** List of tokens that were peeked but not returned with next. */
>>>    LinkedList<Token> peekedTokens = new LinkedList<Token>();
>>>
>>>    /** The position of the next character that peek() will return in 
>>> peekedTokens */
>>>    int peekPosition = 0;
>>>
>>>    public LookaheadTokenFilter(TokenStream input) {
>>>        super(input);
>>>    }
>>>      public Token peek() throws IOException {
>>>        if (this.peekPosition >= this.peekedTokens.size()) {
>>>            Token token = new Token();
>>>            token = this.input.next(token);
>>>            if (token != null) {
>>>                this.peekedTokens.add(token);
>>>                this.peekPosition = this.peekedTokens.size();
>>>            }
>>>            return token;
>>>        }
>>>
>>>        return this.peekedTokens.get(this.peekPosition++);
>>>    }
>>>
>>>    public void reset() { this.peekPosition = 0; }
>>>
>>>    public Token next(Token token) throws IOException {
>>>        reset();
>>>
>>>        if (this.peekedTokens.size() > 0) {
>>>            return this.peekedTokens.removeFirst();
>>>        }
>>>                  return this.input.next(token);          }
>>> }
>>>
>>> Let me know if anyone has an idea,
>>> Daniel Shane
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene 2.9.0-rc2 [PROBLEM] : TokenStream API (incrementToken / AttributeSource), cannot implement a LookaheadTokenFilter.

Posted by Michael Busch <bu...@gmail.com>.

Daniel,

take a look at the captureState() and restoreState() APIs in 
AttributeSource and TokenStream. captureState() returns a State object 
containing all attributes with its' current values. restoreState(State) 
takes a given State and copies its values back into the TokenStream. You 
should be able to achieve the same thing by storing State objects in 
your List, instead of Token objects. peek() would change to return 
true/false instead of Token and the caller of peek consumes the values 
using the new attribute API. The change on your side should be pretty 
simple, let us know if you run into problems!

  Michael

On 9/1/09 3:12 PM, Daniel Shane wrote:
> After thinking about it, the only conclusion I got was instead of 
> saving the token, to save an iterator of Attributes and use that 
> instead. It may work.
>
> Daniel Shane
>
> Daniel Shane wrote:
>> Hi all!
>>
>> I'm trying to port my Lucene code to the new TokenStream API and I 
>> have a filter that I cannot seem to port using the current new API.
>>
>> The filter is called LookaheadTokenFilter. It behaves exactly like a 
>> normal token filter, except, you can call peek() and get information 
>> on the next token in the stream.
>>
>> Since Lucene does not support stream "rewinding", we did this by 
>> buffering tokens when peek() was called and giving those back when 
>> next() was called and when no more "peeked" tokens exist, we then 
>> call super.next();
>>
>> Now, I'm looking at this new API and really I'm stuck at how to port 
>> this using incrementToken...
>>
>> Am I missing something, is there an object I can get from the 
>> TokenStream that I can save and get all the attributes from?
>>
>> Here is the code I'm trying to port :
>>
>> public class LookaheadTokenFilter extends TokenFilter {
>>    /** List of tokens that were peeked but not returned with next. */
>>    LinkedList<Token> peekedTokens = new LinkedList<Token>();
>>
>>    /** The position of the next character that peek() will return in 
>> peekedTokens */
>>    int peekPosition = 0;
>>
>>    public LookaheadTokenFilter(TokenStream input) {
>>        super(input);
>>    }
>>      public Token peek() throws IOException {
>>        if (this.peekPosition >= this.peekedTokens.size()) {
>>            Token token = new Token();
>>            token = this.input.next(token);
>>            if (token != null) {
>>                this.peekedTokens.add(token);
>>                this.peekPosition = this.peekedTokens.size();
>>            }
>>            return token;
>>        }
>>
>>        return this.peekedTokens.get(this.peekPosition++);
>>    }
>>
>>    public void reset() { this.peekPosition = 0; }
>>
>>    public Token next(Token token) throws IOException {
>>        reset();
>>
>>        if (this.peekedTokens.size() > 0) {
>>            return this.peekedTokens.removeFirst();
>>        }
>>                  return this.input.next(token);          }
>> }
>>
>> Let me know if anyone has an idea,
>> Daniel Shane
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene 2.9.0-rc2 [PROBLEM] : TokenStream API (incrementToken / AttributeSource), cannot implement a LookaheadTokenFilter.

Posted by Daniel Shane <sh...@LEXUM.UMontreal.CA>.

After thinking about it, the only conclusion I got was instead of saving 
the token, to save an iterator of Attributes and use that instead. It 
may work.

Daniel Shane

Daniel Shane wrote:
> Hi all!
>
> I'm trying to port my Lucene code to the new TokenStream API and I 
> have a filter that I cannot seem to port using the current new API.
>
> The filter is called LookaheadTokenFilter. It behaves exactly like a 
> normal token filter, except, you can call peek() and get information 
> on the next token in the stream.
>
> Since Lucene does not support stream "rewinding", we did this by 
> buffering tokens when peek() was called and giving those back when 
> next() was called and when no more "peeked" tokens exist, we then call 
> super.next();
>
> Now, I'm looking at this new API and really I'm stuck at how to port 
> this using incrementToken...
>
> Am I missing something, is there an object I can get from the 
> TokenStream that I can save and get all the attributes from?
>
> Here is the code I'm trying to port :
>
> public class LookaheadTokenFilter extends TokenFilter {
>    /** List of tokens that were peeked but not returned with next. */
>    LinkedList<Token> peekedTokens = new LinkedList<Token>();
>
>    /** The position of the next character that peek() will return in 
> peekedTokens */
>    int peekPosition = 0;
>
>    public LookaheadTokenFilter(TokenStream input) {
>        super(input);
>    }
>      public Token peek() throws IOException {
>        if (this.peekPosition >= this.peekedTokens.size()) {
>            Token token = new Token();
>            token = this.input.next(token);
>            if (token != null) {
>                this.peekedTokens.add(token);
>                this.peekPosition = this.peekedTokens.size();
>            }
>            return token;
>        }
>
>        return this.peekedTokens.get(this.peekPosition++);
>    }
>
>    public void reset() { this.peekPosition = 0; }
>
>    public Token next(Token token) throws IOException {
>        reset();
>
>        if (this.peekedTokens.size() > 0) {
>            return this.peekedTokens.removeFirst();
>        }
>                  return this.input.next(token);          }
> }
>
> Let me know if anyone has an idea,
> Daniel Shane
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Lucene 2.9.0-rc2 [PROBLEM] : TokenStream API (incrementToken / AttributeSource), cannot implement a LookaheadTokenFilter.

Posted by Daniel Shane <sh...@LEXUM.UMontreal.CA>.

Hi all!

I'm trying to port my Lucene code to the new TokenStream API and I have 
a filter that I cannot seem to port using the current new API.

The filter is called LookaheadTokenFilter. It behaves exactly like a 
normal token filter, except, you can call peek() and get information on 
the next token in the stream.

Since Lucene does not support stream "rewinding", we did this by 
buffering tokens when peek() was called and giving those back when 
next() was called and when no more "peeked" tokens exist, we then call 
super.next();

Now, I'm looking at this new API and really I'm stuck at how to port 
this using incrementToken...

Am I missing something, is there an object I can get from the 
TokenStream that I can save and get all the attributes from?

Here is the code I'm trying to port :

public class LookaheadTokenFilter extends TokenFilter {
    /** List of tokens that were peeked but not returned with next. */
    LinkedList<Token> peekedTokens = new LinkedList<Token>();

    /** The position of the next character that peek() will return in 
peekedTokens */
    int peekPosition = 0;

    public LookaheadTokenFilter(TokenStream input) {
        super(input);
    }
   
    public Token peek() throws IOException {
        if (this.peekPosition >= this.peekedTokens.size()) {
            Token token = new Token();
            token = this.input.next(token);
            if (token != null) {
                this.peekedTokens.add(token);
                this.peekPosition = this.peekedTokens.size();
            }
            return token;
        }

        return this.peekedTokens.get(this.peekPosition++);
    }
 
    public void reset() { this.peekPosition = 0; }

    public Token next(Token token) throws IOException {
        reset();

        if (this.peekedTokens.size() > 0) {
            return this.peekedTokens.removeFirst();
        }
           
        return this.input.next(token);       
    }
}

Let me know if anyone has an idea,
Daniel Shane