You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Floyd Wu <fl...@gmail.com> on 2014/03/31 07:00:12 UTC

how to index 20 MB plain-text xml

I have many plain text xml that I transfer to form of solr xml format.
But every time I send them to solr, I hit OOM exception.
How to configure solr to "eat" these big xml?
Please guide me a way. Thanks

floyd

Re: how to index 20 MB plain-text xml

Posted by pr...@policija.si.
Hi!

I had the same issue with XML files. Even small XML files produced OOM 
exception. I read that the way XMLs are parsed can sometimes blow up 
memory requirements to such values that java runs out of heap. My solution 
was:

1. Don't parse XML files
2. Parse only small XML files and hope for the best
3. Give Solr the largest possible amount of java heap size (and hope for 
the best)

But then again, one time I also got OOM exception with Word documents - it 
turned out that some user had pasted 400 MB worth of photos into a Word 
file.

Regards,

Primoz




From:   Floyd Wu <fl...@gmail.com>
To:     solr-user@lucene.apache.org
Date:   31.03.2014 08:18
Subject:        Re: how to index 20 MB plain-text xml



Hi Alex,

Thanks for your responding. Personally I don't want to feed these big xml
to solr. But users wants.
I'll try your suggestions later.

Many thanks.

Floyd



2014-03-31 13:44 GMT+08:00 Alexandre Rafalovitch <ar...@gmail.com>:

> Without digging too deep into why exactly this is happening, here are
> the general options:
>
> 0. Are you actually committing? Check the messages in the logs and see
> if the records show up when you expect them too.
> 1. Are you actually trying to feed 20Mb file to Solr? Maybe it's HTTP
> buffer that's blowing up? Try using stream.file instead (notice
> security warning though): http://wiki.apache.org/solr/ContentStream
> 2. Split file into smaller ones and and commit each separately
> 3. Set hard auto-commit in solrconfig.xml based on number of documents
> to flush in-memory structures to disk
> 4. Switch to using DataImportHandler to pull from XML instead of pushing
> 5. Increase amount of memory to Solr (-X command line flags)
>
> Regards,
>    Alex.
>
> Personal website: http://www.outerthoughts.com/
> Current project: http://www.solr-start.com/ - Accelerating your Solr
> proficiency
>
> On Mon, Mar 31, 2014 at 12:00 PM, Floyd Wu <fl...@gmail.com> wrote:
> > I have many plain text xml that I transfer to form of solr xml format.
> > But every time I send them to solr, I hit OOM exception.
> > How to configure solr to "eat" these big xml?
> > Please guide me a way. Thanks
> >
> > floyd
>


Re: how to index 20 MB plain-text xml

Posted by Alexandre Rafalovitch <ar...@gmail.com>.
If you have an application, why are you sending XML documents to Solr?
Can't you convert it to any other format and then send them in
batches? Or even if it is XML, just bite and send in 100 document
batches. Or in smaller batches and use auto-commit settings I
mentioned earlier.

Regards,
   Alex.
Personal website: http://www.outerthoughts.com/
Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency


On Tue, Apr 1, 2014 at 7:30 AM, Floyd Wu <fl...@gmail.com> wrote:
> Hi Upayavira,
> User don't hit solr directly, the search documents through my application.
> The application is a entrance for user to upload documents and then indexed
> by solr.
> the situation is they upload a plain-text, something like dictionary. You
> know, that dictionary is something big.
> I'm trying to figure out some good technique before I can split these xml
> to small one and streaming to solr.
>
> Floyd
>
>
>
> 2014-04-01 2:55 GMT+08:00 Upayavira <uv...@odoko.co.uk>:
>
>> Tell the user they can't have!
>>
>> Or, write a small app that reads in their XML in one go, and pushes it
>> in parts to Solr. Generally, I'd say letting a user hit Solr directly is
>> a bad thing - especially a user who doesn't know the details of how Solr
>> works.
>>
>> Upayavira
>>
>> On Mon, Mar 31, 2014, at 07:17 AM, Floyd Wu wrote:
>> > Hi Alex,
>> >
>> > Thanks for your responding. Personally I don't want to feed these big xml
>> > to solr. But users wants.
>> > I'll try your suggestions later.
>> >
>> > Many thanks.
>> >
>> > Floyd
>> >
>> >
>> >
>> > 2014-03-31 13:44 GMT+08:00 Alexandre Rafalovitch <ar...@gmail.com>:
>> >
>> > > Without digging too deep into why exactly this is happening, here are
>> > > the general options:
>> > >
>> > > 0. Are you actually committing? Check the messages in the logs and see
>> > > if the records show up when you expect them too.
>> > > 1. Are you actually trying to feed 20Mb file to Solr? Maybe it's HTTP
>> > > buffer that's blowing up? Try using stream.file instead (notice
>> > > security warning though): http://wiki.apache.org/solr/ContentStream
>> > > 2. Split file into smaller ones and and commit each separately
>> > > 3. Set hard auto-commit in solrconfig.xml based on number of documents
>> > > to flush in-memory structures to disk
>> > > 4. Switch to using DataImportHandler to pull from XML instead of
>> pushing
>> > > 5. Increase amount of memory to Solr (-X command line flags)
>> > >
>> > > Regards,
>> > >    Alex.
>> > >
>> > > Personal website: http://www.outerthoughts.com/
>> > > Current project: http://www.solr-start.com/ - Accelerating your Solr
>> > > proficiency
>> > >
>> > > On Mon, Mar 31, 2014 at 12:00 PM, Floyd Wu <fl...@gmail.com> wrote:
>> > > > I have many plain text xml that I transfer to form of solr xml
>> format.
>> > > > But every time I send them to solr, I hit OOM exception.
>> > > > How to configure solr to "eat" these big xml?
>> > > > Please guide me a way. Thanks
>> > > >
>> > > > floyd
>> > >
>>

Re: how to index 20 MB plain-text xml

Posted by Floyd Wu <fl...@gmail.com>.
Hi Upayavira,
User don't hit solr directly, the search documents through my application.
The application is a entrance for user to upload documents and then indexed
by solr.
the situation is they upload a plain-text, something like dictionary. You
know, that dictionary is something big.
I'm trying to figure out some good technique before I can split these xml
to small one and streaming to solr.

Floyd



2014-04-01 2:55 GMT+08:00 Upayavira <uv...@odoko.co.uk>:

> Tell the user they can't have!
>
> Or, write a small app that reads in their XML in one go, and pushes it
> in parts to Solr. Generally, I'd say letting a user hit Solr directly is
> a bad thing - especially a user who doesn't know the details of how Solr
> works.
>
> Upayavira
>
> On Mon, Mar 31, 2014, at 07:17 AM, Floyd Wu wrote:
> > Hi Alex,
> >
> > Thanks for your responding. Personally I don't want to feed these big xml
> > to solr. But users wants.
> > I'll try your suggestions later.
> >
> > Many thanks.
> >
> > Floyd
> >
> >
> >
> > 2014-03-31 13:44 GMT+08:00 Alexandre Rafalovitch <ar...@gmail.com>:
> >
> > > Without digging too deep into why exactly this is happening, here are
> > > the general options:
> > >
> > > 0. Are you actually committing? Check the messages in the logs and see
> > > if the records show up when you expect them too.
> > > 1. Are you actually trying to feed 20Mb file to Solr? Maybe it's HTTP
> > > buffer that's blowing up? Try using stream.file instead (notice
> > > security warning though): http://wiki.apache.org/solr/ContentStream
> > > 2. Split file into smaller ones and and commit each separately
> > > 3. Set hard auto-commit in solrconfig.xml based on number of documents
> > > to flush in-memory structures to disk
> > > 4. Switch to using DataImportHandler to pull from XML instead of
> pushing
> > > 5. Increase amount of memory to Solr (-X command line flags)
> > >
> > > Regards,
> > >    Alex.
> > >
> > > Personal website: http://www.outerthoughts.com/
> > > Current project: http://www.solr-start.com/ - Accelerating your Solr
> > > proficiency
> > >
> > > On Mon, Mar 31, 2014 at 12:00 PM, Floyd Wu <fl...@gmail.com> wrote:
> > > > I have many plain text xml that I transfer to form of solr xml
> format.
> > > > But every time I send them to solr, I hit OOM exception.
> > > > How to configure solr to "eat" these big xml?
> > > > Please guide me a way. Thanks
> > > >
> > > > floyd
> > >
>

Re: how to index 20 MB plain-text xml

Posted by Upayavira <uv...@odoko.co.uk>.
Tell the user they can't have!

Or, write a small app that reads in their XML in one go, and pushes it
in parts to Solr. Generally, I'd say letting a user hit Solr directly is
a bad thing - especially a user who doesn't know the details of how Solr
works.

Upayavira

On Mon, Mar 31, 2014, at 07:17 AM, Floyd Wu wrote:
> Hi Alex,
> 
> Thanks for your responding. Personally I don't want to feed these big xml
> to solr. But users wants.
> I'll try your suggestions later.
> 
> Many thanks.
> 
> Floyd
> 
> 
> 
> 2014-03-31 13:44 GMT+08:00 Alexandre Rafalovitch <ar...@gmail.com>:
> 
> > Without digging too deep into why exactly this is happening, here are
> > the general options:
> >
> > 0. Are you actually committing? Check the messages in the logs and see
> > if the records show up when you expect them too.
> > 1. Are you actually trying to feed 20Mb file to Solr? Maybe it's HTTP
> > buffer that's blowing up? Try using stream.file instead (notice
> > security warning though): http://wiki.apache.org/solr/ContentStream
> > 2. Split file into smaller ones and and commit each separately
> > 3. Set hard auto-commit in solrconfig.xml based on number of documents
> > to flush in-memory structures to disk
> > 4. Switch to using DataImportHandler to pull from XML instead of pushing
> > 5. Increase amount of memory to Solr (-X command line flags)
> >
> > Regards,
> >    Alex.
> >
> > Personal website: http://www.outerthoughts.com/
> > Current project: http://www.solr-start.com/ - Accelerating your Solr
> > proficiency
> >
> > On Mon, Mar 31, 2014 at 12:00 PM, Floyd Wu <fl...@gmail.com> wrote:
> > > I have many plain text xml that I transfer to form of solr xml format.
> > > But every time I send them to solr, I hit OOM exception.
> > > How to configure solr to "eat" these big xml?
> > > Please guide me a way. Thanks
> > >
> > > floyd
> >

Re: how to index 20 MB plain-text xml

Posted by Floyd Wu <fl...@gmail.com>.
Hi Alex,

Thanks for your responding. Personally I don't want to feed these big xml
to solr. But users wants.
I'll try your suggestions later.

Many thanks.

Floyd



2014-03-31 13:44 GMT+08:00 Alexandre Rafalovitch <ar...@gmail.com>:

> Without digging too deep into why exactly this is happening, here are
> the general options:
>
> 0. Are you actually committing? Check the messages in the logs and see
> if the records show up when you expect them too.
> 1. Are you actually trying to feed 20Mb file to Solr? Maybe it's HTTP
> buffer that's blowing up? Try using stream.file instead (notice
> security warning though): http://wiki.apache.org/solr/ContentStream
> 2. Split file into smaller ones and and commit each separately
> 3. Set hard auto-commit in solrconfig.xml based on number of documents
> to flush in-memory structures to disk
> 4. Switch to using DataImportHandler to pull from XML instead of pushing
> 5. Increase amount of memory to Solr (-X command line flags)
>
> Regards,
>    Alex.
>
> Personal website: http://www.outerthoughts.com/
> Current project: http://www.solr-start.com/ - Accelerating your Solr
> proficiency
>
> On Mon, Mar 31, 2014 at 12:00 PM, Floyd Wu <fl...@gmail.com> wrote:
> > I have many plain text xml that I transfer to form of solr xml format.
> > But every time I send them to solr, I hit OOM exception.
> > How to configure solr to "eat" these big xml?
> > Please guide me a way. Thanks
> >
> > floyd
>

Re: how to index 20 MB plain-text xml

Posted by Alexandre Rafalovitch <ar...@gmail.com>.
Without digging too deep into why exactly this is happening, here are
the general options:

0. Are you actually committing? Check the messages in the logs and see
if the records show up when you expect them too.
1. Are you actually trying to feed 20Mb file to Solr? Maybe it's HTTP
buffer that's blowing up? Try using stream.file instead (notice
security warning though): http://wiki.apache.org/solr/ContentStream
2. Split file into smaller ones and and commit each separately
3. Set hard auto-commit in solrconfig.xml based on number of documents
to flush in-memory structures to disk
4. Switch to using DataImportHandler to pull from XML instead of pushing
5. Increase amount of memory to Solr (-X command line flags)

Regards,
   Alex.

Personal website: http://www.outerthoughts.com/
Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency

On Mon, Mar 31, 2014 at 12:00 PM, Floyd Wu <fl...@gmail.com> wrote:
> I have many plain text xml that I transfer to form of solr xml format.
> But every time I send them to solr, I hit OOM exception.
> How to configure solr to "eat" these big xml?
> Please guide me a way. Thanks
>
> floyd