You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by neosky <ne...@yahoo.com> on 2012/08/27 20:56:26 UTC

Fail to huge collection extraction

I am using Solr 3.5 and Jetty 8.12
I need to pull out huge query results at a time(for example, 1 million
documents, probably a couple gigabytes size) and my machine is about 64 G
memory.
I use the java bin and SolrJ as my client. And I use a Servelt to provide a
query down service for the end user. However, when I pull out the result at
a time, it fails.
solrQuery.setStart(0);
solrQuery.setRows(totalNumber);// the totalNumber sometimes is 1 million)
logs:
Aug 27, 2012 2:34:35 PM org.apache.solr.common.SolrException log
SEVERE: org.eclipse.jetty.io.EofException
        at
org.eclipse.jetty.io.nio.SelectChannelEndPoint.blockWritable(SelectChannelEndPoint.java:422)
        at
org.eclipse.jetty.http.AbstractGenerator.blockForOutput(AbstractGenerator.java:512)
        at org.eclipse.jetty.server.HttpOutput.write(HttpOutput.java:159)
        at org.eclipse.jetty.server.HttpOutput.write(HttpOutput.java:101)
        at
org.apache.solr.common.util.FastOutputStream.flushBuffer(FastOutputStream.java:184)
        at
org.apache.solr.common.util.JavaBinCodec.marshal(JavaBinCodec.java:89)
        at
org.apache.solr.response.BinaryResponseWriter.write(BinaryResponseWriter.java:46)
        at
org.apache.solr.servlet.SolrDispatchFilter.writeResponse(SolrDispatchFilter.java:336)
...

I am not sure where is the bottleneck. I tried to increase the timeout 
solrServer.setSoTimeout(300000);
		 solrServer.setConnectionTimeout(3000000);
		 solrServer.setDefaultMaxConnectionsPerHost(100);
		 solrServer.setMaxTotalConnections(300);

I also tried to increase the cached documents in the Solr configuration
 <queryResultMaxDocsCached>20000</queryResultMaxDocsCached>

It doesn't work at all. Any advice will be appreciated!

Btw: I want to use the compression, but I don't know how it works. Because
after my Java client pull out the result, I need to printer out to the end
user as a download file.






--
View this message in context: http://lucene.472066.n3.nabble.com/Fail-to-huge-collection-extraction-tp4003559.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Fail to huge collection extraction

Posted by Erick Erickson <er...@gmail.com>.

Alexandre:

I'll buy you a beer sometime, it's just sooo pleasant when someone
else has the same worldview I do....

http://searchhub.org/dev/2011/11/03/stop-being-so-agreeable/

neosky:
Particularly look at the paragraph that has "the XY problem" in it.

Best
Erick

On Sun, Sep 9, 2012 at 8:56 AM, Alexandre Rafalovitch
<ar...@gmail.com> wrote:
> I am sorry, but your customer is extremely unlikely to want the whole
> result in his browser. It is just a red flag that they are converting
> their (business) requirements into your (IT) language and that's what
> they end up with.
>
> Go the other way, ask them to pretend that you've done it already and
> then explain what happens once all those records are on their screen
> (and their operating system is no longer responsive :-) ). What is the
> business process that request is for. And how often they want to do
> this (and what is the significance of that frequency).
>
> Do they want a weekly audit copy to make sure nobody changed the
> records? Then, maybe they want a batch report emailed to them instead
> (or even just generated weekly on a shared drive). Do they want
> something they can access on their laptop while they are not connected
> to a network? Maybe they need a local replica of the (subset of the)
> app working from local index?
>
> Perhaps you have already asked that and this is just what they want.
> Then, I am afraid, you are just stuck fighting against the system
> designed for other use cases. Good luck.
>
> But if you haven't asked yet, do try! Do it often enough and you may
> get a payrise out of it because you will be meeting your clients on
> their territory instead of them having to come to yours.
>
> Regards,
>    Alex.
> Personal blog: http://blog.outerthoughts.com/
> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
> - Time is the quality of nature that keeps events from happening all
> at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
> book)
>
>
> On Sun, Sep 9, 2012 at 11:24 AM, neosky <ne...@yahoo.com> wrote:
>> Thanks Alex!
>> Yes, you hit my key points.
>> Actually I have to implement both of the requirements.
>> The first one works very well as the reason you state. Now I have a website
>> client which is 20 records per page. It is fast.
>> However, my customer also wants to use Servlet to  download the whole query
>> set.(1 millions records maximum possible)
>> So at this time, I tried to use Solr pull out 10000 or 5000 records for each
>> page(Divided to 100 times or 200 times queries) . Then just print out these
>> records to the client browser.
>> I am not sure how the exception was generated?
>> Is my client program(the Servlet program)  out of memory?or Connect timeout
>> for some reason?
>> This exception doesn't always happen. Sometimes it works well even I query
>> 10000 records each time and works for many times , but sometimes it crashes
>> only 5000 records without an explicit reason.
>> You suggestion is great! But the implementation is a little complicated for
>> us.
>> Is Lucene better than Solr for this requirement? But the paging in Lucene
>> seems not very intuitively.
>>
>>
>>
>> --
>> View this message in context: http://lucene.472066.n3.nabble.com/Fail-to-huge-collection-extraction-tp4003559p4006450.html
>> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Fail to huge collection extraction

Posted by neosky <ne...@yahoo.com>.

To Alex,
Thanks for you advice. I did ask and I can understand the requirement is
necessary for them. They won't browser all the result in one page. But they
will use the query result to do some additional research. 
So what they want are something exact match the query. So they need to pull
out all the query result once as a download file. I am testing using the
Lucene.
Thanks very much!

To Erick,
Thanks for your information.





--
View this message in context: http://lucene.472066.n3.nabble.com/Fail-to-huge-collection-extraction-tp4003559p4006701.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Fail to huge collection extraction

Posted by Alexandre Rafalovitch <ar...@gmail.com>.

I am sorry, but your customer is extremely unlikely to want the whole
result in his browser. It is just a red flag that they are converting
their (business) requirements into your (IT) language and that's what
they end up with.

Go the other way, ask them to pretend that you've done it already and
then explain what happens once all those records are on their screen
(and their operating system is no longer responsive :-) ). What is the
business process that request is for. And how often they want to do
this (and what is the significance of that frequency).

Do they want a weekly audit copy to make sure nobody changed the
records? Then, maybe they want a batch report emailed to them instead
(or even just generated weekly on a shared drive). Do they want
something they can access on their laptop while they are not connected
to a network? Maybe they need a local replica of the (subset of the)
app working from local index?

Perhaps you have already asked that and this is just what they want.
Then, I am afraid, you are just stuck fighting against the system
designed for other use cases. Good luck.

But if you haven't asked yet, do try! Do it often enough and you may
get a payrise out of it because you will be meeting your clients on
their territory instead of them having to come to yours.

Regards,
   Alex.
Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all
at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
book)

On Sun, Sep 9, 2012 at 11:24 AM, neosky <ne...@yahoo.com> wrote:
> Thanks Alex!
> Yes, you hit my key points.
> Actually I have to implement both of the requirements.
> The first one works very well as the reason you state. Now I have a website
> client which is 20 records per page. It is fast.
> However, my customer also wants to use Servlet to  download the whole query
> set.(1 millions records maximum possible)
> So at this time, I tried to use Solr pull out 10000 or 5000 records for each
> page(Divided to 100 times or 200 times queries) . Then just print out these
> records to the client browser.
> I am not sure how the exception was generated?
> Is my client program(the Servlet program)  out of memory?or Connect timeout
> for some reason?
> This exception doesn't always happen. Sometimes it works well even I query
> 10000 records each time and works for many times , but sometimes it crashes
> only 5000 records without an explicit reason.
> You suggestion is great! But the implementation is a little complicated for
> us.
> Is Lucene better than Solr for this requirement? But the paging in Lucene
> seems not very intuitively.
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Fail-to-huge-collection-extraction-tp4003559p4006450.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Fail to huge collection extraction

Posted by neosky <ne...@yahoo.com>.

Thanks Alex!
Yes, you hit my key points.
Actually I have to implement both of the requirements.
The first one works very well as the reason you state. Now I have a website
client which is 20 records per page. It is fast.
However, my customer also wants to use Servlet to  download the whole query
set.(1 millions records maximum possible)
So at this time, I tried to use Solr pull out 10000 or 5000 records for each
page(Divided to 100 times or 200 times queries) . Then just print out these
records to the client browser. 
I am not sure how the exception was generated?
Is my client program(the Servlet program)  out of memory?or Connect timeout
for some reason?
This exception doesn't always happen. Sometimes it works well even I query
10000 records each time and works for many times , but sometimes it crashes
only 5000 records without an explicit reason.   
You suggestion is great! But the implementation is a little complicated for
us.
Is Lucene better than Solr for this requirement? But the paging in Lucene
seems not very intuitively.  



--
View this message in context: http://lucene.472066.n3.nabble.com/Fail-to-huge-collection-extraction-tp4003559p4006450.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Fail to huge collection extraction

Posted by Alexandre Rafalovitch <ar...@gmail.com>.

I think the point here is the question about your use of the data.

If you want to show it to the client, then you are unlikely to need
details of more than 1 screenful of records (e.g. 10). When user goes
to another screen, you rerun the query and specify values 11-20, etc.
SOLR does not have a problem rerunning complex queries and returning
different subset of results.

On the other hand, if you are not presenting this to the user directly
and do need all records at once, perhaps you should not be pulling all
records details from SOLR, but just use it for search. That is, let
SOLR return just the primary keys of the matches and you can then send
a request to a dedicated database with the list of IDs. Databases and
drives are specifically designed around giving streaming results
without crashing/timing-out. SOLR is a search system and is not
perfect as a retrieval system or primary system of record (though it
is getting there slowly).

Hope this helps.

Regards,
   Alex.

Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all
at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
book)

On Sat, Sep 8, 2012 at 11:20 PM, neosky <ne...@yahoo.com> wrote:
> I am sorry that I can't get your point. Would you explain a little more?
> I am still struggling with this problem. It seems crash by no meaning
> sometimes. Even I reduce to 5000 records each time, but sometimes it works
> well with 10000 per page.
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Fail-to-huge-collection-extraction-tp4003559p4006399.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Fail to huge collection extraction

Posted by neosky <ne...@yahoo.com>.

I am sorry that I can't get your point. Would you explain a little more?
I am still struggling with this problem. It seems crash by no meaning
sometimes. Even I reduce to 5000 records each time, but sometimes it works
well with 10000 per page.



--
View this message in context: http://lucene.472066.n3.nabble.com/Fail-to-huge-collection-extraction-tp4003559p4006399.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Fail to huge collection extraction

Posted by Erick Erickson <er...@gmail.com>.

I really think you need to think about firing successive page requests
at the index and reporting in chunks.

Best
Erick

On Mon, Aug 27, 2012 at 2:56 PM, neosky <ne...@yahoo.com> wrote:
> I am using Solr 3.5 and Jetty 8.12
> I need to pull out huge query results at a time(for example, 1 million
> documents, probably a couple gigabytes size) and my machine is about 64 G
> memory.
> I use the java bin and SolrJ as my client. And I use a Servelt to provide a
> query down service for the end user. However, when I pull out the result at
> a time, it fails.
> solrQuery.setStart(0);
> solrQuery.setRows(totalNumber);// the totalNumber sometimes is 1 million)
> logs:
> Aug 27, 2012 2:34:35 PM org.apache.solr.common.SolrException log
> SEVERE: org.eclipse.jetty.io.EofException
>         at
> org.eclipse.jetty.io.nio.SelectChannelEndPoint.blockWritable(SelectChannelEndPoint.java:422)
>         at
> org.eclipse.jetty.http.AbstractGenerator.blockForOutput(AbstractGenerator.java:512)
>         at org.eclipse.jetty.server.HttpOutput.write(HttpOutput.java:159)
>         at org.eclipse.jetty.server.HttpOutput.write(HttpOutput.java:101)
>         at
> org.apache.solr.common.util.FastOutputStream.flushBuffer(FastOutputStream.java:184)
>         at
> org.apache.solr.common.util.JavaBinCodec.marshal(JavaBinCodec.java:89)
>         at
> org.apache.solr.response.BinaryResponseWriter.write(BinaryResponseWriter.java:46)
>         at
> org.apache.solr.servlet.SolrDispatchFilter.writeResponse(SolrDispatchFilter.java:336)
> ...
>
> I am not sure where is the bottleneck. I tried to increase the timeout
> solrServer.setSoTimeout(300000);
>                  solrServer.setConnectionTimeout(3000000);
>                  solrServer.setDefaultMaxConnectionsPerHost(100);
>                  solrServer.setMaxTotalConnections(300);
>
> I also tried to increase the cached documents in the Solr configuration
>  <queryResultMaxDocsCached>20000</queryResultMaxDocsCached>
>
> It doesn't work at all. Any advice will be appreciated!
>
> Btw: I want to use the compression, but I don't know how it works. Because
> after my Java client pull out the result, I need to printer out to the end
> user as a download file.
>
>
>
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Fail-to-huge-collection-extraction-tp4003559.html
> Sent from the Solr - User mailing list archive at Nabble.com.