You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by vybe3142 <vy...@gmail.com> on 2012/04/03 20:32:20 UTC

Incremantally updating a VERY LARGE field - Is this possibe ?

Some days ago, I posted about an issue with SOLR running out of memory when
attempting to index large text files (say 300 MB ). Details at
http://lucene.472066.n3.nabble.com/Solr-Tika-crashing-when-attempting-to-index-large-files-td3846939.html

Two things I need to point out: 

1. I don't need Tika for content extraction as the files are already in
plain text format.
2. The heap space error was caused by a futile Tika/SOLR attempt at creating
the corresponding huge XML document in memory

I've decided to develop a custom handler that 
1. reads the file text directly
2. attempts to create a SOLR document and directly add the text data to the
corresponding field. 

One approach I've taken is to read manageable chunks of text data
sequentially from the file and process. We've used this approach sucessfully
with Lucene in the past and I'm attempting to make it work with SOLR too. I
got most of the work done yesterday, but need a bit of guidance w.r.t. point
2.

How can I achieve updating the same field multiple times. Looking at the
SOLR source, processor.addField() merely 
a. adds to the in-memory field map and 
b. attempts to write EVERYTHING to the index later on. 

In my situation, (a) eventually causes a heap space error.




Here's part of the handler code.



Thanks much

Thanks

--
View this message in context: http://lucene.472066.n3.nabble.com/Incremantally-updating-a-VERY-LARGE-field-Is-this-possibe-tp3881945p3881945.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Incremantally updating a VERY LARGE field - Is this possibe ?

Posted by Ravish Bhagdev <ra...@gmail.com>.
Yes, I think there are good reasons why it works like that.  Focus of
search system is to be efficient on query side at cost of being not that
efficient on storage.

You must however also note that by default a field's length is limited to
10000 words in solrconf.xml which you may also need to modify.  But I guess
if its going out of memory you might have already done this?

Ravish

On Wed, Apr 4, 2012 at 1:34 PM, Mikhail Khludnev <mkhludnev@griddynamics.com
> wrote:

> There is https://issues.apache.org/jira/browse/LUCENE-3837 but I suppose
> it's too far from completion.
>
> On Wed, Apr 4, 2012 at 2:48 PM, Ravish Bhagdev <ravish.bhagdev@gmail.com
> >wrote:
>
> > Updating a single field is not possible in solr.  The whole record has to
> > be rewritten.
> >
> > 300 MB is still not that big a file.  Have you tried doing the indexing
> (if
> > its only a one time thing) by giving it ~2 GB or xmx?
> >
> > A single file with that size is strange!  May I ask what is it?
> >
> > Rav
> >
> > On Tue, Apr 3, 2012 at 7:32 PM, vybe3142 <vy...@gmail.com> wrote:
> >
> > >
> > > Some days ago, I posted about an issue with SOLR running out of memory
> > when
> > > attempting to index large text files (say 300 MB ). Details at
> > >
> > >
> >
> http://lucene.472066.n3.nabble.com/Solr-Tika-crashing-when-attempting-to-index-large-files-td3846939.html
> > >
> > > Two things I need to point out:
> > >
> > > 1. I don't need Tika for content extraction as the files are already in
> > > plain text format.
> > > 2. The heap space error was caused by a futile Tika/SOLR attempt at
> > > creating
> > > the corresponding huge XML document in memory
> > >
> > > I've decided to develop a custom handler that
> > > 1. reads the file text directly
> > > 2. attempts to create a SOLR document and directly add the text data to
> > the
> > > corresponding field.
> > >
> > > One approach I've taken is to read manageable chunks of text data
> > > sequentially from the file and process. We've used this approach
> > > sucessfully
> > > with Lucene in the past and I'm attempting to make it work with SOLR
> > too. I
> > > got most of the work done yesterday, but need a bit of guidance w.r.t.
> > > point
> > > 2.
> > >
> > > How can I achieve updating the same field multiple times. Looking at
> the
> > > SOLR source, processor.addField() merely
> > > a. adds to the in-memory field map and
> > > b. attempts to write EVERYTHING to the index later on.
> > >
> > > In my situation, (a) eventually causes a heap space error.
> > >
> > >
> > >
> > >
> > > Here's part of the handler code.
> > >
> > >
> > >
> > > Thanks much
> > >
> > > Thanks
> > >
> > > --
> > > View this message in context:
> > >
> >
> http://lucene.472066.n3.nabble.com/Incremantally-updating-a-VERY-LARGE-field-Is-this-possibe-tp3881945p3881945.html
> > > Sent from the Solr - User mailing list archive at Nabble.com.
> > >
> >
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> gedel@yandex.ru
>
> <http://www.griddynamics.com>
>  <mk...@griddynamics.com>
>

Re: Incremantally updating a VERY LARGE field - Is this possibe ?

Posted by Walter Underwood <wu...@wunderwood.org>.
I believe we are talking about two different things. The original question was about incrementally building up a field during indexing, right? 

After a document is committed, a field cannot be separately updated, that is true in both Lucene and Solr.

wunder

On Apr 4, 2012, at 12:20 PM, Yonik Seeley wrote:

> On Wed, Apr 4, 2012 at 3:14 PM, vybe3142 <vy...@gmail.com> wrote:
>> 
>>> Updating a single field is not possible in solr.  The whole record has to
>>> be rewritten.
>> 
>> Unfortunate. Lucene allows it.
> 
> I think you're mistaken - the same limitations apply to Lucene.
> 
> -Yonik
> lucenerevolution.com - Lucene/Solr Open Source Search Conference.
> Boston May 7-10






Re: Incremantally updating a VERY LARGE field - Is this possibe ?

Posted by vybe3142 <vy...@gmail.com>.
  
Yonik Seeley-2-2 wrote
> 
> On Wed, Apr 4, 2012 at 3:14 PM, vybe3142 &lt;vybe3142@&gt; wrote:
>>
>>> Updating a single field is not possible in solr.  The whole record has
>>> to
>>> be rewritten.
>>
>> Unfortunate. Lucene allows it.
> 
> I think you're mistaken - the same limitations apply to Lucene.
> 
> -Yonik
> lucenerevolution.com - Lucene/Solr Open Source Search Conference.
> Boston May 7-10
> 

You're correct (and I stand corrected). 

I looked at our older codebase that used lucene. I need to dig deeper to
understand how come it doesn't crash when invoking addField() multiple times
on each portion of the large text data whereas SOLR does. Speaking to the
developer who wrote that code, we resorted to multiple addField()
invocations to address the heap space issue.

I'll post back



--
View this message in context: http://lucene.472066.n3.nabble.com/Incrementally-updating-a-VERY-LARGE-field-Is-this-possibe-tp3881945p3885711.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Incremantally updating a VERY LARGE field - Is this possibe ?

Posted by Yonik Seeley <yo...@lucidimagination.com>.
On Wed, Apr 4, 2012 at 3:14 PM, vybe3142 <vy...@gmail.com> wrote:
>
>> Updating a single field is not possible in solr.  The whole record has to
>> be rewritten.
>
> Unfortunate. Lucene allows it.

I think you're mistaken - the same limitations apply to Lucene.

-Yonik
lucenerevolution.com - Lucene/Solr Open Source Search Conference.
Boston May 7-10

Re: Incremantally updating a VERY LARGE field - Is this possibe ?

Posted by vybe3142 <vy...@gmail.com>.
> Updating a single field is not possible in solr.  The whole record has to 
> be rewritten. 

Unfortunate. Lucene allows it.

--
View this message in context: http://lucene.472066.n3.nabble.com/Incrementally-updating-a-VERY-LARGE-field-Is-this-possibe-tp3881945p3885253.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Incremantally updating a VERY LARGE field - Is this possibe ?

Posted by Mikhail Khludnev <mk...@griddynamics.com>.
There is https://issues.apache.org/jira/browse/LUCENE-3837 but I suppose
it's too far from completion.

On Wed, Apr 4, 2012 at 2:48 PM, Ravish Bhagdev <ra...@gmail.com>wrote:

> Updating a single field is not possible in solr.  The whole record has to
> be rewritten.
>
> 300 MB is still not that big a file.  Have you tried doing the indexing (if
> its only a one time thing) by giving it ~2 GB or xmx?
>
> A single file with that size is strange!  May I ask what is it?
>
> Rav
>
> On Tue, Apr 3, 2012 at 7:32 PM, vybe3142 <vy...@gmail.com> wrote:
>
> >
> > Some days ago, I posted about an issue with SOLR running out of memory
> when
> > attempting to index large text files (say 300 MB ). Details at
> >
> >
> http://lucene.472066.n3.nabble.com/Solr-Tika-crashing-when-attempting-to-index-large-files-td3846939.html
> >
> > Two things I need to point out:
> >
> > 1. I don't need Tika for content extraction as the files are already in
> > plain text format.
> > 2. The heap space error was caused by a futile Tika/SOLR attempt at
> > creating
> > the corresponding huge XML document in memory
> >
> > I've decided to develop a custom handler that
> > 1. reads the file text directly
> > 2. attempts to create a SOLR document and directly add the text data to
> the
> > corresponding field.
> >
> > One approach I've taken is to read manageable chunks of text data
> > sequentially from the file and process. We've used this approach
> > sucessfully
> > with Lucene in the past and I'm attempting to make it work with SOLR
> too. I
> > got most of the work done yesterday, but need a bit of guidance w.r.t.
> > point
> > 2.
> >
> > How can I achieve updating the same field multiple times. Looking at the
> > SOLR source, processor.addField() merely
> > a. adds to the in-memory field map and
> > b. attempts to write EVERYTHING to the index later on.
> >
> > In my situation, (a) eventually causes a heap space error.
> >
> >
> >
> >
> > Here's part of the handler code.
> >
> >
> >
> > Thanks much
> >
> > Thanks
> >
> > --
> > View this message in context:
> >
> http://lucene.472066.n3.nabble.com/Incremantally-updating-a-VERY-LARGE-field-Is-this-possibe-tp3881945p3881945.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
> >
>



-- 
Sincerely yours
Mikhail Khludnev
gedel@yandex.ru

<http://www.griddynamics.com>
 <mk...@griddynamics.com>

Re: Incremantally updating a VERY LARGE field - Is this possibe ?

Posted by jmlucjav <jm...@gmail.com>.
depending on you jvm version, -XX:+UseCompressedStrings would help alleviate
the problem. It did help me before.

xab

--
View this message in context: http://lucene.472066.n3.nabble.com/Incrementally-updating-a-VERY-LARGE-field-Is-this-possibe-tp3881945p3885493.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Incremantally updating a VERY LARGE field - Is this possibe ?

Posted by vybe3142 <vy...@gmail.com>.
Thanks.

Increasing max. heap space is not a scalable option as it reduces the
ability of the system to scale with multiple concurrent index requests.

The use case is indexing a set of text files which we have no control over
i.e. could be small or large. 

--
View this message in context: http://lucene.472066.n3.nabble.com/Incrementally-updating-a-VERY-LARGE-field-Is-this-possibe-tp3881945p3885233.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Incremantally updating a VERY LARGE field - Is this possibe ?

Posted by Ravish Bhagdev <ra...@gmail.com>.
Updating a single field is not possible in solr.  The whole record has to
be rewritten.

300 MB is still not that big a file.  Have you tried doing the indexing (if
its only a one time thing) by giving it ~2 GB or xmx?

A single file with that size is strange!  May I ask what is it?

Rav

On Tue, Apr 3, 2012 at 7:32 PM, vybe3142 <vy...@gmail.com> wrote:

>
> Some days ago, I posted about an issue with SOLR running out of memory when
> attempting to index large text files (say 300 MB ). Details at
>
> http://lucene.472066.n3.nabble.com/Solr-Tika-crashing-when-attempting-to-index-large-files-td3846939.html
>
> Two things I need to point out:
>
> 1. I don't need Tika for content extraction as the files are already in
> plain text format.
> 2. The heap space error was caused by a futile Tika/SOLR attempt at
> creating
> the corresponding huge XML document in memory
>
> I've decided to develop a custom handler that
> 1. reads the file text directly
> 2. attempts to create a SOLR document and directly add the text data to the
> corresponding field.
>
> One approach I've taken is to read manageable chunks of text data
> sequentially from the file and process. We've used this approach
> sucessfully
> with Lucene in the past and I'm attempting to make it work with SOLR too. I
> got most of the work done yesterday, but need a bit of guidance w.r.t.
> point
> 2.
>
> How can I achieve updating the same field multiple times. Looking at the
> SOLR source, processor.addField() merely
> a. adds to the in-memory field map and
> b. attempts to write EVERYTHING to the index later on.
>
> In my situation, (a) eventually causes a heap space error.
>
>
>
>
> Here's part of the handler code.
>
>
>
> Thanks much
>
> Thanks
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Incremantally-updating-a-VERY-LARGE-field-Is-this-possibe-tp3881945p3881945.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>