You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Alexandre Rafalovitch <ar...@gmail.com> on 2012/08/21 14:41:18 UTC

Does DIH commit during large import?

Hello,

I am doing an import of large records (with large full-text fields)
and somewhere around 300000 records DataImportHandler runs out of
memory (Heap) on a TIKA import (triggered from custom Processor) and
does roll-back. I am using store=false and trying some tricks and
tracking possible memory leaks, but also have a question about DIH
itself.

What actually happens when I run DIH on a large (XML Source) job? Does
it accumulate some sort of status in memory that it commits at the
end? If so, can I do intermediate commits to drop the memory
requirements? Or, will it help to do several passes over the same
dataset and import only particular entries at a time? I am using the
Solr 4 (alpha) UI, so I can see some of the options there.

Regards,
   Alex.

Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all
at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
book)

Re: Does DIH commit during large import?

Posted by Erick Erickson <er...@gmail.com>.
solrconfig.xml has a setting ramBufferSizeMB that can be set
to limit the memory consumed during indexing. When this limit
is reached, the buffers are flushed to the current segment. NOTE:
the segment is NOT closed, there is no implied commit here, and
the data will not be searchable until a commit happens.

Best
Erick

On Wed, Aug 22, 2012 at 7:10 AM, Alexandre Rafalovitch
<ar...@gmail.com> wrote:
> Thanks, I will look into autoCommit.
>
> I assume there are memory implications of not committing? Or is it
> just writing in a separate file and can theoretically do it
> indefinitely?
>
> Regards,
>    Alex.
> Personal blog: http://blog.outerthoughts.com/
> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
> - Time is the quality of nature that keeps events from happening all
> at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
> book)
>
>
> On Wed, Aug 22, 2012 at 2:42 AM, Lance Norskog <go...@gmail.com> wrote:
>> Solr has a separate feature called 'autoCommit'. This is configured in
>> solrconfig.xml. You can set Solr to commit all documents every N
>> milliseconds or every N documents, whichever comes first. If you want
>> intermediate commits during a long DIH session, you have to use this
>> or make your own script that does commits.
>>
>> On Tue, Aug 21, 2012 at 8:48 AM, Shawn Heisey <so...@elyograg.org> wrote:
>>> On 8/21/2012 6:41 AM, Alexandre Rafalovitch wrote:
>>>>
>>>> I am doing an import of large records (with large full-text fields)
>>>> and somewhere around 300000 records DataImportHandler runs out of
>>>> memory (Heap) on a TIKA import (triggered from custom Processor) and
>>>> does roll-back. I am using store=false and trying some tricks and
>>>> tracking possible memory leaks, but also have a question about DIH
>>>> itself.
>>>>
>>>> What actually happens when I run DIH on a large (XML Source) job? Does
>>>> it accumulate some sort of status in memory that it commits at the
>>>> end? If so, can I do intermediate commits to drop the memory
>>>> requirements? Or, will it help to do several passes over the same
>>>> dataset and import only particular entries at a time? I am using the
>>>> Solr 4 (alpha) UI, so I can see some of the options there.
>>>
>>>
>>> I use Solr 3.5 and a MySQL database for import, so my setup may not be
>>> completely relevant, but here is my experience.
>>>
>>> Unless you turn on autocommit in solrconfig, documents will not be
>>> searchable during the import.  If you have "commit=true" for DIH (which I
>>> believe is the default), there will be a commit at the end of the import.
>>>
>>> It looks like there's an out of memory issue filed on Solr 4 DIH with Tika
>>> that is suspected to be a bug in Tika rather than Solr.  The issue details
>>> talk about some workarounds for those who are familiar with Tika -- I'm not.
>>> The issue URL:
>>>
>>> https://issues.apache.org/jira/browse/SOLR-2886
>>>
>>> Thanks,
>>> Shawn
>>>
>>
>>
>>
>> --
>> Lance Norskog
>> goksron@gmail.com

Re: Does DIH commit during large import?

Posted by Alexandre Rafalovitch <ar...@gmail.com>.
Thanks, I will look into autoCommit.

I assume there are memory implications of not committing? Or is it
just writing in a separate file and can theoretically do it
indefinitely?

Regards,
   Alex.
Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all
at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
book)


On Wed, Aug 22, 2012 at 2:42 AM, Lance Norskog <go...@gmail.com> wrote:
> Solr has a separate feature called 'autoCommit'. This is configured in
> solrconfig.xml. You can set Solr to commit all documents every N
> milliseconds or every N documents, whichever comes first. If you want
> intermediate commits during a long DIH session, you have to use this
> or make your own script that does commits.
>
> On Tue, Aug 21, 2012 at 8:48 AM, Shawn Heisey <so...@elyograg.org> wrote:
>> On 8/21/2012 6:41 AM, Alexandre Rafalovitch wrote:
>>>
>>> I am doing an import of large records (with large full-text fields)
>>> and somewhere around 300000 records DataImportHandler runs out of
>>> memory (Heap) on a TIKA import (triggered from custom Processor) and
>>> does roll-back. I am using store=false and trying some tricks and
>>> tracking possible memory leaks, but also have a question about DIH
>>> itself.
>>>
>>> What actually happens when I run DIH on a large (XML Source) job? Does
>>> it accumulate some sort of status in memory that it commits at the
>>> end? If so, can I do intermediate commits to drop the memory
>>> requirements? Or, will it help to do several passes over the same
>>> dataset and import only particular entries at a time? I am using the
>>> Solr 4 (alpha) UI, so I can see some of the options there.
>>
>>
>> I use Solr 3.5 and a MySQL database for import, so my setup may not be
>> completely relevant, but here is my experience.
>>
>> Unless you turn on autocommit in solrconfig, documents will not be
>> searchable during the import.  If you have "commit=true" for DIH (which I
>> believe is the default), there will be a commit at the end of the import.
>>
>> It looks like there's an out of memory issue filed on Solr 4 DIH with Tika
>> that is suspected to be a bug in Tika rather than Solr.  The issue details
>> talk about some workarounds for those who are familiar with Tika -- I'm not.
>> The issue URL:
>>
>> https://issues.apache.org/jira/browse/SOLR-2886
>>
>> Thanks,
>> Shawn
>>
>
>
>
> --
> Lance Norskog
> goksron@gmail.com

Re: Does DIH commit during large import?

Posted by Lance Norskog <go...@gmail.com>.
Solr has a separate feature called 'autoCommit'. This is configured in
solrconfig.xml. You can set Solr to commit all documents every N
milliseconds or every N documents, whichever comes first. If you want
intermediate commits during a long DIH session, you have to use this
or make your own script that does commits.

On Tue, Aug 21, 2012 at 8:48 AM, Shawn Heisey <so...@elyograg.org> wrote:
> On 8/21/2012 6:41 AM, Alexandre Rafalovitch wrote:
>>
>> I am doing an import of large records (with large full-text fields)
>> and somewhere around 300000 records DataImportHandler runs out of
>> memory (Heap) on a TIKA import (triggered from custom Processor) and
>> does roll-back. I am using store=false and trying some tricks and
>> tracking possible memory leaks, but also have a question about DIH
>> itself.
>>
>> What actually happens when I run DIH on a large (XML Source) job? Does
>> it accumulate some sort of status in memory that it commits at the
>> end? If so, can I do intermediate commits to drop the memory
>> requirements? Or, will it help to do several passes over the same
>> dataset and import only particular entries at a time? I am using the
>> Solr 4 (alpha) UI, so I can see some of the options there.
>
>
> I use Solr 3.5 and a MySQL database for import, so my setup may not be
> completely relevant, but here is my experience.
>
> Unless you turn on autocommit in solrconfig, documents will not be
> searchable during the import.  If you have "commit=true" for DIH (which I
> believe is the default), there will be a commit at the end of the import.
>
> It looks like there's an out of memory issue filed on Solr 4 DIH with Tika
> that is suspected to be a bug in Tika rather than Solr.  The issue details
> talk about some workarounds for those who are familiar with Tika -- I'm not.
> The issue URL:
>
> https://issues.apache.org/jira/browse/SOLR-2886
>
> Thanks,
> Shawn
>



-- 
Lance Norskog
goksron@gmail.com

Re: Does DIH commit during large import?

Posted by Shawn Heisey <so...@elyograg.org>.
On 8/21/2012 6:41 AM, Alexandre Rafalovitch wrote:
> I am doing an import of large records (with large full-text fields)
> and somewhere around 300000 records DataImportHandler runs out of
> memory (Heap) on a TIKA import (triggered from custom Processor) and
> does roll-back. I am using store=false and trying some tricks and
> tracking possible memory leaks, but also have a question about DIH
> itself.
>
> What actually happens when I run DIH on a large (XML Source) job? Does
> it accumulate some sort of status in memory that it commits at the
> end? If so, can I do intermediate commits to drop the memory
> requirements? Or, will it help to do several passes over the same
> dataset and import only particular entries at a time? I am using the
> Solr 4 (alpha) UI, so I can see some of the options there.

I use Solr 3.5 and a MySQL database for import, so my setup may not be 
completely relevant, but here is my experience.

Unless you turn on autocommit in solrconfig, documents will not be 
searchable during the import.  If you have "commit=true" for DIH (which 
I believe is the default), there will be a commit at the end of the import.

It looks like there's an out of memory issue filed on Solr 4 DIH with 
Tika that is suspected to be a bug in Tika rather than Solr.  The issue 
details talk about some workarounds for those who are familiar with Tika 
-- I'm not.  The issue URL:

https://issues.apache.org/jira/browse/SOLR-2886

Thanks,
Shawn