You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Divya Rajendranath <di...@gmail.com> on 2007/04/24 13:00:55 UTC

Adding large files to index

Hello All,

Could any one help me find solution to the following problem ?

I am facing problems while trying to add files of size 50MB to my
application. The application has on-demand indexing of documents in
place.whenever we add a file to our application, we first put the file
details/metadata to our database, then create an encrytpted copy of the file
in our own filesystem, then extract the content and add it to Lucene index.
We set Java heap min = 256 M and max = 1024 mb. Extraction & adding to index
is done in a background thread. Soon, after we extract the document's
content, add it to index and close the index writer, we are getting
OutOfMemory Error (memory used is about 1160mb). Because, the main thread is
unaware of the exception, it allows users to continue using the application.
I observed that the write.lock file continues to remain in /var/tmp
directory for about 10 mins, it is not getting cleaned up soon. On
searching, the 50 MB document do not come up in search. But, when I continue
using my application by deleting some other documents, after adding the 50
MB file, I find that lucene is not getting updated with the recent changes.
Once, this error occurs lucene is not getting modified/updated even if I add
a new 10KB file nor delete any file.

Here, is the stack trace

===

2007-04-24 13:16:47,454 INFO [docman.index.Index] Adding 20 to physical
index

2007-04-24 13:16:47,454 INFO [docman.index.Index] Index::addToIndex() took
27207 msec.

2007-04-24 13:16:47,454 INFO [docman.index.Index] going to close index
writer

2007-04-24 13:17:00,945 INFO [STDOUT] Exception in thread "Thread-12"

2007-04-24 13:17:00,945 INFO [STDOUT] java.lang.OutOfMemoryError: Java heap
space

2007-04-24 13:17:00,945 INFO [STDOUT] at
org.apache.lucene.store.IndexInput.readString(IndexInput.java:91)

2007-04-24 13:17:00,945 INFO [STDOUT] at
org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:130)

2007-04-24 13:17:00,945 INFO [STDOUT] at
org.apache.lucene.index.SegmentReader.document(SegmentReader.java:284)

2007-04-24 13:17:00,945 INFO [STDOUT] at
org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:186)

2007-04-24 13:17:00,945 INFO [STDOUT] at
org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:88)

2007-04-24 13:17:00,945 INFO [STDOUT] at
org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:709)

2007-04-24 13:17:00,945 INFO [STDOUT] at
org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:686)

2007-04-24 13:17:00,945 INFO [STDOUT] at
org.apache.lucene.index.IndexWriter.flushRamSegments(IndexWriter.java:656)

2007-04-24 13:17:00,945 INFO [STDOUT] at
org.apache.lucene.index.IndexWriter.close(IndexWriter.java:402)

2007-04-24 13:17:00,945 INFO [STDOUT] at docman.index.Index.closeIndex(
Index.java:594)

2007-04-24 13:17:00,946 INFO [STDOUT] at docman.index.Index.addToIndex(
Index.java:401)

2007-04-24 13:17:00,946 INFO [STDOUT] at
docman.index.Index.index(Index.java:437)


2007-04-24 13:17:00,946 INFO [STDOUT] at
docman.index.FastIndex$IndexWorker.run(FastIndex.java:174)

===



user@sx86qa3:/u/user/test/application> ls -al index

total 450208

drwxrwsr-x 2 user group 4096 Apr 24 15:40 .

drwxrwsrwx 10 user group 4096 Apr 23 18:36 ..

-rw-rw-r-- 1 user group 80289688 Apr 23 18:45 _j.cfs

-rw-rw-r-- 1 user group 114021843 Apr 23 19:34 _x.cfs

-rw-rw-r-- 1 user group 35937776 Apr 24 15:40 _z.fdt

-rw-rw-r-- 1 user group 16 Apr 24 15:40 _z.fdx

-rw-rw-r-- 1 user group 503 Apr 24 15:40 _z.fnm

-rw-rw-r-- 1 user group 4 Apr 23 19:34 deletable

-rw-rw-r-- 1 user group 34 Apr 23 19:34 segments

user@sx86qa3:/u/user/test/application

There are about 30 files in the application of which 6-7 are of 50 MB, and
the remaining are hardly a few KB.

Is the OutFMemory error corrupting the index ? Why is it that the index is
not responsive to delete operation after the error ?

I waited for 10mins for the lock file to get removed on its own, then tried
deleting/adding new 10KB files (after memory usage came back within java max
heap size), still the index is not responsive.

Please let me know the cause and the solution to the problem.



Divya.

Re: Adding large files to index

Posted by Daniel Noll <da...@nuix.com>.
David Xiao wrote:
> Consider reduce size of per file. Split them into smaller pieces will
> definitely help indexer working faster.
> 
> A 50M pure text file is amazing size, very few text files reach that
> size: 50M. It must be very reasonable if you have to keep all
> information in such one big file.
> 
> What you think?

Not everyone using Lucene is writing a CMS.  Some of us do have to deal 
with arbitrarily big data, if it appears.  How do you split an 
arbitrary, large text file?  Will breaking it in two make some queries 
not work, e.g. if the user enters +term1 +term2 and each one was on 
opposite sides of the split? etc.

That being said, I haven't had issues adding files of this size.  But 
then, our application doesn't require the ability to read at the same 
time some other thread is writing (so our memory requirements are lower 
to begin with.)

Daniel

-- 
Daniel Noll

Nuix Pty Ltd
Suite 79, 89 Jones St, Ultimo NSW 2007, Australia    Ph: +61 2 9280 0699
Web: http://nuix.com/                               Fax: +61 2 9212 6902

This message is intended only for the named recipient. If you are not
the intended recipient you are notified that disclosing, copying,
distributing or taking any action in reliance on the contents of this
message or attachment is strictly prohibited.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: Adding large files to index

Posted by David Xiao <da...@gmail.com>.
Consider reduce size of per file. Split them into smaller pieces will definitely help indexer working faster.

A 50M pure text file is amazing size, very few text files reach that size: 50M. It must be very reasonable if you have to keep all information in such one big file.

What you think?



-----Original Message-----
From: Rajendranath, Divya [mailto:Divya.Rajendranath@deshaw.com] 
Sent: Tuesday, April 24, 2007 8:55 PM
To: java-user@lucene.apache.org
Subject: RE: Adding large files to index

But, I am facing the problem even with -Xms 256m and -Xmx 1024m.

Yes, the file was not added to the index because java process was
already using 1156m of memory, which is much higher than the max heap
memory.
But, even after waiting for a few minutes till the memory came below the
max heap value, the index was not responsive (no documents were being
added/deleted in the index).
What could be the reason for this ?


Divya

-----Original Message-----
From: David Xiao [mailto:davihigh@gmail.com] 
Sent: Tuesday, April 24, 2007 6:01 PM
To: java-user@lucene.apache.org
Subject: RE: Adding large files to index

Use java -Xms50m to start your program, that gives a 50M initial heap
size.
The OutofHeapMemory is because the default heap memory is not enough for
your application.


-----Original Message-----
From: Divya Rajendranath [mailto:divya.rajendranath@gmail.com] 
Sent: Tuesday, April 24, 2007 7:01 PM
To: java-user@lucene.apache.org
Subject: Adding large files to index

Hello All,

Could any one help me find solution to the following problem ?

I am facing problems while trying to add files of size 50MB to my
application. The application has on-demand indexing of documents in
place.whenever we add a file to our application, we first put the file
details/metadata to our database, then create an encrytpted copy of the
file
in our own filesystem, then extract the content and add it to Lucene
index.
We set Java heap min = 256 M and max = 1024 mb. Extraction & adding to
index
is done in a background thread. Soon, after we extract the document's
content, add it to index and close the index writer, we are getting
OutOfMemory Error (memory used is about 1160mb). Because, the main
thread is
unaware of the exception, it allows users to continue using the
application.
I observed that the write.lock file continues to remain in /var/tmp
directory for about 10 mins, it is not getting cleaned up soon. On
searching, the 50 MB document do not come up in search. But, when I
continue
using my application by deleting some other documents, after adding the
50
MB file, I find that lucene is not getting updated with the recent
changes.
Once, this error occurs lucene is not getting modified/updated even if I
add
a new 10KB file nor delete any file.

Here, is the stack trace

===

2007-04-24 13:16:47,454 INFO [docman.index.Index] Adding 20 to physical
index

2007-04-24 13:16:47,454 INFO [docman.index.Index] Index::addToIndex()
took
27207 msec.

2007-04-24 13:16:47,454 INFO [docman.index.Index] going to close index
writer

2007-04-24 13:17:00,945 INFO [STDOUT] Exception in thread "Thread-12"

2007-04-24 13:17:00,945 INFO [STDOUT] java.lang.OutOfMemoryError: Java
heap
space

2007-04-24 13:17:00,945 INFO [STDOUT] at
org.apache.lucene.store.IndexInput.readString(IndexInput.java:91)

2007-04-24 13:17:00,945 INFO [STDOUT] at
org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:130)

2007-04-24 13:17:00,945 INFO [STDOUT] at
org.apache.lucene.index.SegmentReader.document(SegmentReader.java:284)

2007-04-24 13:17:00,945 INFO [STDOUT] at
org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:186
)

2007-04-24 13:17:00,945 INFO [STDOUT] at
org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:88)

2007-04-24 13:17:00,945 INFO [STDOUT] at
org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:709)

2007-04-24 13:17:00,945 INFO [STDOUT] at
org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:686)

2007-04-24 13:17:00,945 INFO [STDOUT] at
org.apache.lucene.index.IndexWriter.flushRamSegments(IndexWriter.java:65
6)

2007-04-24 13:17:00,945 INFO [STDOUT] at
org.apache.lucene.index.IndexWriter.close(IndexWriter.java:402)

2007-04-24 13:17:00,945 INFO [STDOUT] at docman.index.Index.closeIndex(
Index.java:594)

2007-04-24 13:17:00,946 INFO [STDOUT] at docman.index.Index.addToIndex(
Index.java:401)

2007-04-24 13:17:00,946 INFO [STDOUT] at
docman.index.Index.index(Index.java:437)


2007-04-24 13:17:00,946 INFO [STDOUT] at
docman.index.FastIndex$IndexWorker.run(FastIndex.java:174)

===



user@sx86qa3:/u/user/test/application> ls -al index

total 450208

drwxrwsr-x 2 user group 4096 Apr 24 15:40 .

drwxrwsrwx 10 user group 4096 Apr 23 18:36 ..

-rw-rw-r-- 1 user group 80289688 Apr 23 18:45 _j.cfs

-rw-rw-r-- 1 user group 114021843 Apr 23 19:34 _x.cfs

-rw-rw-r-- 1 user group 35937776 Apr 24 15:40 _z.fdt

-rw-rw-r-- 1 user group 16 Apr 24 15:40 _z.fdx

-rw-rw-r-- 1 user group 503 Apr 24 15:40 _z.fnm

-rw-rw-r-- 1 user group 4 Apr 23 19:34 deletable

-rw-rw-r-- 1 user group 34 Apr 23 19:34 segments

user@sx86qa3:/u/user/test/application

There are about 30 files in the application of which 6-7 are of 50 MB,
and
the remaining are hardly a few KB.

Is the OutFMemory error corrupting the index ? Why is it that the index
is
not responsive to delete operation after the error ?

I waited for 10mins for the lock file to get removed on its own, then
tried
deleting/adding new 10KB files (after memory usage came back within java
max
heap size), still the index is not responsive.

Please let me know the cause and the solution to the problem.



Divya.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: Adding large files to index

Posted by "Rajendranath, Divya" <Di...@deshaw.com>.
But, I am facing the problem even with -Xms 256m and -Xmx 1024m.

Yes, the file was not added to the index because java process was
already using 1156m of memory, which is much higher than the max heap
memory.
But, even after waiting for a few minutes till the memory came below the
max heap value, the index was not responsive (no documents were being
added/deleted in the index).
What could be the reason for this ?


Divya

-----Original Message-----
From: David Xiao [mailto:davihigh@gmail.com] 
Sent: Tuesday, April 24, 2007 6:01 PM
To: java-user@lucene.apache.org
Subject: RE: Adding large files to index

Use java -Xms50m to start your program, that gives a 50M initial heap
size.
The OutofHeapMemory is because the default heap memory is not enough for
your application.


-----Original Message-----
From: Divya Rajendranath [mailto:divya.rajendranath@gmail.com] 
Sent: Tuesday, April 24, 2007 7:01 PM
To: java-user@lucene.apache.org
Subject: Adding large files to index

Hello All,

Could any one help me find solution to the following problem ?

I am facing problems while trying to add files of size 50MB to my
application. The application has on-demand indexing of documents in
place.whenever we add a file to our application, we first put the file
details/metadata to our database, then create an encrytpted copy of the
file
in our own filesystem, then extract the content and add it to Lucene
index.
We set Java heap min = 256 M and max = 1024 mb. Extraction & adding to
index
is done in a background thread. Soon, after we extract the document's
content, add it to index and close the index writer, we are getting
OutOfMemory Error (memory used is about 1160mb). Because, the main
thread is
unaware of the exception, it allows users to continue using the
application.
I observed that the write.lock file continues to remain in /var/tmp
directory for about 10 mins, it is not getting cleaned up soon. On
searching, the 50 MB document do not come up in search. But, when I
continue
using my application by deleting some other documents, after adding the
50
MB file, I find that lucene is not getting updated with the recent
changes.
Once, this error occurs lucene is not getting modified/updated even if I
add
a new 10KB file nor delete any file.

Here, is the stack trace

===

2007-04-24 13:16:47,454 INFO [docman.index.Index] Adding 20 to physical
index

2007-04-24 13:16:47,454 INFO [docman.index.Index] Index::addToIndex()
took
27207 msec.

2007-04-24 13:16:47,454 INFO [docman.index.Index] going to close index
writer

2007-04-24 13:17:00,945 INFO [STDOUT] Exception in thread "Thread-12"

2007-04-24 13:17:00,945 INFO [STDOUT] java.lang.OutOfMemoryError: Java
heap
space

2007-04-24 13:17:00,945 INFO [STDOUT] at
org.apache.lucene.store.IndexInput.readString(IndexInput.java:91)

2007-04-24 13:17:00,945 INFO [STDOUT] at
org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:130)

2007-04-24 13:17:00,945 INFO [STDOUT] at
org.apache.lucene.index.SegmentReader.document(SegmentReader.java:284)

2007-04-24 13:17:00,945 INFO [STDOUT] at
org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:186
)

2007-04-24 13:17:00,945 INFO [STDOUT] at
org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:88)

2007-04-24 13:17:00,945 INFO [STDOUT] at
org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:709)

2007-04-24 13:17:00,945 INFO [STDOUT] at
org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:686)

2007-04-24 13:17:00,945 INFO [STDOUT] at
org.apache.lucene.index.IndexWriter.flushRamSegments(IndexWriter.java:65
6)

2007-04-24 13:17:00,945 INFO [STDOUT] at
org.apache.lucene.index.IndexWriter.close(IndexWriter.java:402)

2007-04-24 13:17:00,945 INFO [STDOUT] at docman.index.Index.closeIndex(
Index.java:594)

2007-04-24 13:17:00,946 INFO [STDOUT] at docman.index.Index.addToIndex(
Index.java:401)

2007-04-24 13:17:00,946 INFO [STDOUT] at
docman.index.Index.index(Index.java:437)


2007-04-24 13:17:00,946 INFO [STDOUT] at
docman.index.FastIndex$IndexWorker.run(FastIndex.java:174)

===



user@sx86qa3:/u/user/test/application> ls -al index

total 450208

drwxrwsr-x 2 user group 4096 Apr 24 15:40 .

drwxrwsrwx 10 user group 4096 Apr 23 18:36 ..

-rw-rw-r-- 1 user group 80289688 Apr 23 18:45 _j.cfs

-rw-rw-r-- 1 user group 114021843 Apr 23 19:34 _x.cfs

-rw-rw-r-- 1 user group 35937776 Apr 24 15:40 _z.fdt

-rw-rw-r-- 1 user group 16 Apr 24 15:40 _z.fdx

-rw-rw-r-- 1 user group 503 Apr 24 15:40 _z.fnm

-rw-rw-r-- 1 user group 4 Apr 23 19:34 deletable

-rw-rw-r-- 1 user group 34 Apr 23 19:34 segments

user@sx86qa3:/u/user/test/application

There are about 30 files in the application of which 6-7 are of 50 MB,
and
the remaining are hardly a few KB.

Is the OutFMemory error corrupting the index ? Why is it that the index
is
not responsive to delete operation after the error ?

I waited for 10mins for the lock file to get removed on its own, then
tried
deleting/adding new 10KB files (after memory usage came back within java
max
heap size), still the index is not responsive.

Please let me know the cause and the solution to the problem.



Divya.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: Adding large files to index

Posted by David Xiao <da...@gmail.com>.
Use java -Xms50m to start your program, that gives a 50M initial heap size.
The OutofHeapMemory is because the default heap memory is not enough for your application.


-----Original Message-----
From: Divya Rajendranath [mailto:divya.rajendranath@gmail.com] 
Sent: Tuesday, April 24, 2007 7:01 PM
To: java-user@lucene.apache.org
Subject: Adding large files to index

Hello All,

Could any one help me find solution to the following problem ?

I am facing problems while trying to add files of size 50MB to my
application. The application has on-demand indexing of documents in
place.whenever we add a file to our application, we first put the file
details/metadata to our database, then create an encrytpted copy of the file
in our own filesystem, then extract the content and add it to Lucene index.
We set Java heap min = 256 M and max = 1024 mb. Extraction & adding to index
is done in a background thread. Soon, after we extract the document's
content, add it to index and close the index writer, we are getting
OutOfMemory Error (memory used is about 1160mb). Because, the main thread is
unaware of the exception, it allows users to continue using the application.
I observed that the write.lock file continues to remain in /var/tmp
directory for about 10 mins, it is not getting cleaned up soon. On
searching, the 50 MB document do not come up in search. But, when I continue
using my application by deleting some other documents, after adding the 50
MB file, I find that lucene is not getting updated with the recent changes.
Once, this error occurs lucene is not getting modified/updated even if I add
a new 10KB file nor delete any file.

Here, is the stack trace

===

2007-04-24 13:16:47,454 INFO [docman.index.Index] Adding 20 to physical
index

2007-04-24 13:16:47,454 INFO [docman.index.Index] Index::addToIndex() took
27207 msec.

2007-04-24 13:16:47,454 INFO [docman.index.Index] going to close index
writer

2007-04-24 13:17:00,945 INFO [STDOUT] Exception in thread "Thread-12"

2007-04-24 13:17:00,945 INFO [STDOUT] java.lang.OutOfMemoryError: Java heap
space

2007-04-24 13:17:00,945 INFO [STDOUT] at
org.apache.lucene.store.IndexInput.readString(IndexInput.java:91)

2007-04-24 13:17:00,945 INFO [STDOUT] at
org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:130)

2007-04-24 13:17:00,945 INFO [STDOUT] at
org.apache.lucene.index.SegmentReader.document(SegmentReader.java:284)

2007-04-24 13:17:00,945 INFO [STDOUT] at
org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:186)

2007-04-24 13:17:00,945 INFO [STDOUT] at
org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:88)

2007-04-24 13:17:00,945 INFO [STDOUT] at
org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:709)

2007-04-24 13:17:00,945 INFO [STDOUT] at
org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:686)

2007-04-24 13:17:00,945 INFO [STDOUT] at
org.apache.lucene.index.IndexWriter.flushRamSegments(IndexWriter.java:656)

2007-04-24 13:17:00,945 INFO [STDOUT] at
org.apache.lucene.index.IndexWriter.close(IndexWriter.java:402)

2007-04-24 13:17:00,945 INFO [STDOUT] at docman.index.Index.closeIndex(
Index.java:594)

2007-04-24 13:17:00,946 INFO [STDOUT] at docman.index.Index.addToIndex(
Index.java:401)

2007-04-24 13:17:00,946 INFO [STDOUT] at
docman.index.Index.index(Index.java:437)


2007-04-24 13:17:00,946 INFO [STDOUT] at
docman.index.FastIndex$IndexWorker.run(FastIndex.java:174)

===



user@sx86qa3:/u/user/test/application> ls -al index

total 450208

drwxrwsr-x 2 user group 4096 Apr 24 15:40 .

drwxrwsrwx 10 user group 4096 Apr 23 18:36 ..

-rw-rw-r-- 1 user group 80289688 Apr 23 18:45 _j.cfs

-rw-rw-r-- 1 user group 114021843 Apr 23 19:34 _x.cfs

-rw-rw-r-- 1 user group 35937776 Apr 24 15:40 _z.fdt

-rw-rw-r-- 1 user group 16 Apr 24 15:40 _z.fdx

-rw-rw-r-- 1 user group 503 Apr 24 15:40 _z.fnm

-rw-rw-r-- 1 user group 4 Apr 23 19:34 deletable

-rw-rw-r-- 1 user group 34 Apr 23 19:34 segments

user@sx86qa3:/u/user/test/application

There are about 30 files in the application of which 6-7 are of 50 MB, and
the remaining are hardly a few KB.

Is the OutFMemory error corrupting the index ? Why is it that the index is
not responsive to delete operation after the error ?

I waited for 10mins for the lock file to get removed on its own, then tried
deleting/adding new 10KB files (after memory usage came back within java max
heap size), still the index is not responsive.

Please let me know the cause and the solution to the problem.



Divya.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org