You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Nicholas Ball <ni...@nodelay.com> on 2012/03/15 01:17:39 UTC

Responding to Requests with Chunks/Streaming

Hello all,

I've been working on a plugin with a custom component and a few handlers for a research project. It's aim is to do some interesting distributed work, however I seem to have come to a road block when trying to respond to a clients request in multiple steps. Not even sure if this is possible with Solr but after no luck on the IRC channel, thought I'd ask here.

What I'd like to achieve is to be able to have the requestHandler return results to a user as soon as it has data available, then continue processing or performing other distributed calls, and then return some more data, all on the same single client request.

Now my understanding is that solr does some kind of streaming. Not sure how it's technically done over http in Solr so any information would be useful. I believe something like this would work well but again not sure:

http://en.m.wikipedia.org/wiki/Chunked_transfer_encoding

I also came across this issue/feature request in JIRA but not completely sure what the conclusion was or how someone might do/use this. Is it even relevant to what I'm looking for?

https://issues.apache.org/jira/browse/SOLR-578

Thank you very much for any help and time you can spare!

Nicholas (incunix)

Re: Responding to Requests with Chunks/Streaming

Posted by Mikhail Khludnev <mk...@griddynamics.com>.

Hello Developers,

I just want to ask don't you think that response streaming can be useful
for things like OLAP, e.g. is you have sharded index presorted and
pre-joined by BJQ way you can calculate counts in many cube cells in
parallel?
Essential distributed test for response streaming just passed.
https://github.com/m-khl/solr-patches/blob/ec4db7c0422a5515392a7019c5bd23ad3f546e4b/solr/core/src/test/org/apache/solr/response/RespStreamDistributedTest.java

branch is https://github.com/m-khl/solr-patches/tree/streaming

Regards

On Mon, Apr 2, 2012 at 10:55 AM, Mikhail Khludnev <
mkhludnev@griddynamics.com> wrote:

>
> Hello,
>
> Small update - reading streamed response is done via callback. No
> SolrDocumentList in memory.
> https://github.com/m-khl/solr-patches/tree/streaming
> here is the test
> https://github.com/m-khl/solr-patches/blob/d028d4fabe0c20cb23f16098637e2961e9e2366e/solr/core/src/test/org/apache/solr/response/ResponseStreamingTest.java#L138
>
> no progress in distributed search via streaming yet.
>
> Pls let me know if you don't want to have updates from my playground.
>
> Regards
>
>
> On Thu, Mar 29, 2012 at 1:02 PM, Mikhail Khludnev <
> mkhludnev@griddynamics.com> wrote:
>
>> @All
>> Why nobody desires such a pretty cool feature?
>>
>> Nicholas,
>> I have a tiny progress: I'm able to stream in javabin codec format while
>> searching, It implies sorting by _docid_
>>
>> here is the diff
>>
>> https://github.com/m-khl/solr-patches/commit/2f9ff068c379b3008bb983d0df69dff714ddde95
>>
>> The current issue is that reading response by SolrJ is done as whole.
>> Reading by callback is supported by EmbeddedServer only. Anyway it should
>> not a big deal. ResponseStreamingTest.java somehow works.
>> I'm stuck on introducing response streaming in distributes search, it's
>> actually more challenging  - RespStreamDistributedTest fails
>>
>> Regards
>>
>>
>> On Fri, Mar 16, 2012 at 3:51 PM, Nicholas Ball <nicholas.ball@nodelay.com
>> > wrote:
>>
>>>
>>> Mikhail & Ludovic,
>>>
>>> Thanks for both your replies, very helpful indeed!
>>>
>>> Ludovic, I was actually looking into just that and did some tests with
>>> SolrJ, it does work well but needs some changes on the Solr server if we
>>> want to send out individual documents a various times. This could be done
>>> with a write() and flush() to the FastOutputStream (daos) in
>>> JavBinCodec. I
>>> therefore think that a combination of this and Mikhail's solution would
>>> work best!
>>>
>>> Mikhail, you mention that your solution doesn't currently work and not
>>> sure why this is the case, but could it be that you haven't flushed the
>>> data (os.flush()) you've written in the collect method of
>>> DocSetStreamer? I
>>> think placing the output stream into the SolrQueryRequest is the way to
>>> go,
>>> so that we can access it and write to it how we intend. However, I think
>>> using the JavaBinCodec would be ideal so that we can work with SolrJ
>>> directly, and not mess around with the encoding of the docs/data etc...
>>>
>>> At the moment the entry point to JavaBinCodec is through the
>>> BinaryResponseWriter which calls the highest level marshal() method which
>>> decodes and sends out the entire SolrQueryResponse (line 49 @
>>> BinaryResponseWriter). What would be ideal is to be able to break up the
>>> response and call the JavaBinCodec for pieces of it with a flush after
>>> each
>>> call. Did a few tests with a simple Thread.sleep and a flush to see if
>>> this
>>> would actually work and looks like it's working out perfectly. Just
>>> trying
>>> to figure out the best way to actually do it now :) any ideas?
>>>
>>> An another note, for a solution to work with the chunked transfer
>>> encoding
>>> (and therefore web browsers), a lot more development is going to be
>>> needed.
>>> Not sure if it's worth trying yet but might look into it later down the
>>> line.
>>>
>>> Nick
>>>
>>> On Fri, 16 Mar 2012 07:29:20 +0300, Mikhail Khludnev
>>> <mk...@griddynamics.com> wrote:
>>> > Ludovic,
>>> >
>>> > I looked through. First of all, it seems to me you don't amend regular
>>> > "servlet" solr server, but the only embedded one.
>>> > Anyway, the difference is that you stream DocList via callback, but it
>>> > means that you've instantiated it in memory and keep it there until it
>>> will
>>> > be completely consumed. Think about a billion numfound. Core idea of my
>>> > approach is keep almost zero memory for response.
>>> >
>>> > Regards
>>> >
>>> > On Fri, Mar 16, 2012 at 12:12 AM, lboutros <bo...@gmail.com> wrote:
>>> >
>>> >> Hi,
>>> >>
>>> >> I was looking for something similar.
>>> >>
>>> >> I tried this patch :
>>> >>
>>> >> https://issues.apache.org/jira/browse/SOLR-2112
>>> >>
>>> >> it's working quite well (I've back-ported the code in Solr 3.5.0...).
>>> >>
>>> >> Is it really different from what you are trying to achieve ?
>>> >>
>>> >> Ludovic.
>>> >>
>>> >> -----
>>> >> Jouve
>>> >> France.
>>> >> --
>>> >> View this message in context:
>>> >>
>>>
>>> http://lucene.472066.n3.nabble.com/Responding-to-Requests-with-Chunks-Streaming-tp3827316p3829909.html
>>> >> Sent from the Solr - User mailing list archive at Nabble.com.
>>> >>
>>>
>>
>>
>>
>> --
>> Sincerely yours
>> Mikhail Khludnev
>> gedel@yandex.ru
>>
>> <http://www.griddynamics.com>
>>  <mk...@griddynamics.com>
>>
>>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> gedel@yandex.ru
>
> <http://www.griddynamics.com>
>  <mk...@griddynamics.com>
>
>


-- 
Sincerely yours
Mikhail Khludnev
gedel@yandex.ru

<http://www.griddynamics.com>
 <mk...@griddynamics.com>

Re: Responding to Requests with Chunks/Streaming

Posted by Mikhail Khludnev <mk...@griddynamics.com>.

Hello Developers,

I just want to ask don't you think that response streaming can be useful
for things like OLAP, e.g. is you have sharded index presorted and
pre-joined by BJQ way you can calculate counts in many cube cells in
parallel?
Essential distributed test for response streaming just passed.
https://github.com/m-khl/solr-patches/blob/ec4db7c0422a5515392a7019c5bd23ad3f546e4b/solr/core/src/test/org/apache/solr/response/RespStreamDistributedTest.java

branch is https://github.com/m-khl/solr-patches/tree/streaming

Regards

On Mon, Apr 2, 2012 at 10:55 AM, Mikhail Khludnev <
mkhludnev@griddynamics.com> wrote:

>
> Hello,
>
> Small update - reading streamed response is done via callback. No
> SolrDocumentList in memory.
> https://github.com/m-khl/solr-patches/tree/streaming
> here is the test
> https://github.com/m-khl/solr-patches/blob/d028d4fabe0c20cb23f16098637e2961e9e2366e/solr/core/src/test/org/apache/solr/response/ResponseStreamingTest.java#L138
>
> no progress in distributed search via streaming yet.
>
> Pls let me know if you don't want to have updates from my playground.
>
> Regards
>
>
> On Thu, Mar 29, 2012 at 1:02 PM, Mikhail Khludnev <
> mkhludnev@griddynamics.com> wrote:
>
>> @All
>> Why nobody desires such a pretty cool feature?
>>
>> Nicholas,
>> I have a tiny progress: I'm able to stream in javabin codec format while
>> searching, It implies sorting by _docid_
>>
>> here is the diff
>>
>> https://github.com/m-khl/solr-patches/commit/2f9ff068c379b3008bb983d0df69dff714ddde95
>>
>> The current issue is that reading response by SolrJ is done as whole.
>> Reading by callback is supported by EmbeddedServer only. Anyway it should
>> not a big deal. ResponseStreamingTest.java somehow works.
>> I'm stuck on introducing response streaming in distributes search, it's
>> actually more challenging  - RespStreamDistributedTest fails
>>
>> Regards
>>
>>
>> On Fri, Mar 16, 2012 at 3:51 PM, Nicholas Ball <nicholas.ball@nodelay.com
>> > wrote:
>>
>>>
>>> Mikhail & Ludovic,
>>>
>>> Thanks for both your replies, very helpful indeed!
>>>
>>> Ludovic, I was actually looking into just that and did some tests with
>>> SolrJ, it does work well but needs some changes on the Solr server if we
>>> want to send out individual documents a various times. This could be done
>>> with a write() and flush() to the FastOutputStream (daos) in
>>> JavBinCodec. I
>>> therefore think that a combination of this and Mikhail's solution would
>>> work best!
>>>
>>> Mikhail, you mention that your solution doesn't currently work and not
>>> sure why this is the case, but could it be that you haven't flushed the
>>> data (os.flush()) you've written in the collect method of
>>> DocSetStreamer? I
>>> think placing the output stream into the SolrQueryRequest is the way to
>>> go,
>>> so that we can access it and write to it how we intend. However, I think
>>> using the JavaBinCodec would be ideal so that we can work with SolrJ
>>> directly, and not mess around with the encoding of the docs/data etc...
>>>
>>> At the moment the entry point to JavaBinCodec is through the
>>> BinaryResponseWriter which calls the highest level marshal() method which
>>> decodes and sends out the entire SolrQueryResponse (line 49 @
>>> BinaryResponseWriter). What would be ideal is to be able to break up the
>>> response and call the JavaBinCodec for pieces of it with a flush after
>>> each
>>> call. Did a few tests with a simple Thread.sleep and a flush to see if
>>> this
>>> would actually work and looks like it's working out perfectly. Just
>>> trying
>>> to figure out the best way to actually do it now :) any ideas?
>>>
>>> An another note, for a solution to work with the chunked transfer
>>> encoding
>>> (and therefore web browsers), a lot more development is going to be
>>> needed.
>>> Not sure if it's worth trying yet but might look into it later down the
>>> line.
>>>
>>> Nick
>>>
>>> On Fri, 16 Mar 2012 07:29:20 +0300, Mikhail Khludnev
>>> <mk...@griddynamics.com> wrote:
>>> > Ludovic,
>>> >
>>> > I looked through. First of all, it seems to me you don't amend regular
>>> > "servlet" solr server, but the only embedded one.
>>> > Anyway, the difference is that you stream DocList via callback, but it
>>> > means that you've instantiated it in memory and keep it there until it
>>> will
>>> > be completely consumed. Think about a billion numfound. Core idea of my
>>> > approach is keep almost zero memory for response.
>>> >
>>> > Regards
>>> >
>>> > On Fri, Mar 16, 2012 at 12:12 AM, lboutros <bo...@gmail.com> wrote:
>>> >
>>> >> Hi,
>>> >>
>>> >> I was looking for something similar.
>>> >>
>>> >> I tried this patch :
>>> >>
>>> >> https://issues.apache.org/jira/browse/SOLR-2112
>>> >>
>>> >> it's working quite well (I've back-ported the code in Solr 3.5.0...).
>>> >>
>>> >> Is it really different from what you are trying to achieve ?
>>> >>
>>> >> Ludovic.
>>> >>
>>> >> -----
>>> >> Jouve
>>> >> France.
>>> >> --
>>> >> View this message in context:
>>> >>
>>>
>>> http://lucene.472066.n3.nabble.com/Responding-to-Requests-with-Chunks-Streaming-tp3827316p3829909.html
>>> >> Sent from the Solr - User mailing list archive at Nabble.com.
>>> >>
>>>
>>
>>
>>
>> --
>> Sincerely yours
>> Mikhail Khludnev
>> gedel@yandex.ru
>>
>> <http://www.griddynamics.com>
>>  <mk...@griddynamics.com>
>>
>>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> gedel@yandex.ru
>
> <http://www.griddynamics.com>
>  <mk...@griddynamics.com>
>
>


-- 
Sincerely yours
Mikhail Khludnev
gedel@yandex.ru

<http://www.griddynamics.com>
 <mk...@griddynamics.com>

Re: Responding to Requests with Chunks/Streaming

Posted by Mikhail Khludnev <mk...@griddynamics.com>.

Hello,

Small update - reading streamed response is done via callback. No
SolrDocumentList in memory.
https://github.com/m-khl/solr-patches/tree/streaming
here is the test
https://github.com/m-khl/solr-patches/blob/d028d4fabe0c20cb23f16098637e2961e9e2366e/solr/core/src/test/org/apache/solr/response/ResponseStreamingTest.java#L138

no progress in distributed search via streaming yet.

Pls let me know if you don't want to have updates from my playground.

Regards

On Thu, Mar 29, 2012 at 1:02 PM, Mikhail Khludnev <
mkhludnev@griddynamics.com> wrote:

> @All
> Why nobody desires such a pretty cool feature?
>
> Nicholas,
> I have a tiny progress: I'm able to stream in javabin codec format while
> searching, It implies sorting by _docid_
>
> here is the diff
>
> https://github.com/m-khl/solr-patches/commit/2f9ff068c379b3008bb983d0df69dff714ddde95
>
> The current issue is that reading response by SolrJ is done as whole.
> Reading by callback is supported by EmbeddedServer only. Anyway it should
> not a big deal. ResponseStreamingTest.java somehow works.
> I'm stuck on introducing response streaming in distributes search, it's
> actually more challenging  - RespStreamDistributedTest fails
>
> Regards
>
>
> On Fri, Mar 16, 2012 at 3:51 PM, Nicholas Ball <ni...@nodelay.com>wrote:
>
>>
>> Mikhail & Ludovic,
>>
>> Thanks for both your replies, very helpful indeed!
>>
>> Ludovic, I was actually looking into just that and did some tests with
>> SolrJ, it does work well but needs some changes on the Solr server if we
>> want to send out individual documents a various times. This could be done
>> with a write() and flush() to the FastOutputStream (daos) in JavBinCodec.
>> I
>> therefore think that a combination of this and Mikhail's solution would
>> work best!
>>
>> Mikhail, you mention that your solution doesn't currently work and not
>> sure why this is the case, but could it be that you haven't flushed the
>> data (os.flush()) you've written in the collect method of DocSetStreamer?
>> I
>> think placing the output stream into the SolrQueryRequest is the way to
>> go,
>> so that we can access it and write to it how we intend. However, I think
>> using the JavaBinCodec would be ideal so that we can work with SolrJ
>> directly, and not mess around with the encoding of the docs/data etc...
>>
>> At the moment the entry point to JavaBinCodec is through the
>> BinaryResponseWriter which calls the highest level marshal() method which
>> decodes and sends out the entire SolrQueryResponse (line 49 @
>> BinaryResponseWriter). What would be ideal is to be able to break up the
>> response and call the JavaBinCodec for pieces of it with a flush after
>> each
>> call. Did a few tests with a simple Thread.sleep and a flush to see if
>> this
>> would actually work and looks like it's working out perfectly. Just trying
>> to figure out the best way to actually do it now :) any ideas?
>>
>> An another note, for a solution to work with the chunked transfer encoding
>> (and therefore web browsers), a lot more development is going to be
>> needed.
>> Not sure if it's worth trying yet but might look into it later down the
>> line.
>>
>> Nick
>>
>> On Fri, 16 Mar 2012 07:29:20 +0300, Mikhail Khludnev
>> <mk...@griddynamics.com> wrote:
>> > Ludovic,
>> >
>> > I looked through. First of all, it seems to me you don't amend regular
>> > "servlet" solr server, but the only embedded one.
>> > Anyway, the difference is that you stream DocList via callback, but it
>> > means that you've instantiated it in memory and keep it there until it
>> will
>> > be completely consumed. Think about a billion numfound. Core idea of my
>> > approach is keep almost zero memory for response.
>> >
>> > Regards
>> >
>> > On Fri, Mar 16, 2012 at 12:12 AM, lboutros <bo...@gmail.com> wrote:
>> >
>> >> Hi,
>> >>
>> >> I was looking for something similar.
>> >>
>> >> I tried this patch :
>> >>
>> >> https://issues.apache.org/jira/browse/SOLR-2112
>> >>
>> >> it's working quite well (I've back-ported the code in Solr 3.5.0...).
>> >>
>> >> Is it really different from what you are trying to achieve ?
>> >>
>> >> Ludovic.
>> >>
>> >> -----
>> >> Jouve
>> >> France.
>> >> --
>> >> View this message in context:
>> >>
>>
>> http://lucene.472066.n3.nabble.com/Responding-to-Requests-with-Chunks-Streaming-tp3827316p3829909.html
>> >> Sent from the Solr - User mailing list archive at Nabble.com.
>> >>
>>
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> gedel@yandex.ru
>
> <http://www.griddynamics.com>
>  <mk...@griddynamics.com>
>
>


-- 
Sincerely yours
Mikhail Khludnev
gedel@yandex.ru

<http://www.griddynamics.com>
 <mk...@griddynamics.com>

Re: Responding to Requests with Chunks/Streaming

Posted by Mikhail Khludnev <mk...@griddynamics.com>.

@All
Why nobody desires such a pretty cool feature?

Nicholas,
I have a tiny progress: I'm able to stream in javabin codec format while
searching, It implies sorting by _docid_

here is the diff
https://github.com/m-khl/solr-patches/commit/2f9ff068c379b3008bb983d0df69dff714ddde95

The current issue is that reading response by SolrJ is done as whole.
Reading by callback is supported by EmbeddedServer only. Anyway it should
not a big deal. ResponseStreamingTest.java somehow works.
I'm stuck on introducing response streaming in distributes search, it's
actually more challenging  - RespStreamDistributedTest fails

Regards

On Fri, Mar 16, 2012 at 3:51 PM, Nicholas Ball <ni...@nodelay.com>wrote:

>
> Mikhail & Ludovic,
>
> Thanks for both your replies, very helpful indeed!
>
> Ludovic, I was actually looking into just that and did some tests with
> SolrJ, it does work well but needs some changes on the Solr server if we
> want to send out individual documents a various times. This could be done
> with a write() and flush() to the FastOutputStream (daos) in JavBinCodec. I
> therefore think that a combination of this and Mikhail's solution would
> work best!
>
> Mikhail, you mention that your solution doesn't currently work and not
> sure why this is the case, but could it be that you haven't flushed the
> data (os.flush()) you've written in the collect method of DocSetStreamer? I
> think placing the output stream into the SolrQueryRequest is the way to go,
> so that we can access it and write to it how we intend. However, I think
> using the JavaBinCodec would be ideal so that we can work with SolrJ
> directly, and not mess around with the encoding of the docs/data etc...
>
> At the moment the entry point to JavaBinCodec is through the
> BinaryResponseWriter which calls the highest level marshal() method which
> decodes and sends out the entire SolrQueryResponse (line 49 @
> BinaryResponseWriter). What would be ideal is to be able to break up the
> response and call the JavaBinCodec for pieces of it with a flush after each
> call. Did a few tests with a simple Thread.sleep and a flush to see if this
> would actually work and looks like it's working out perfectly. Just trying
> to figure out the best way to actually do it now :) any ideas?
>
> An another note, for a solution to work with the chunked transfer encoding
> (and therefore web browsers), a lot more development is going to be needed.
> Not sure if it's worth trying yet but might look into it later down the
> line.
>
> Nick
>
> On Fri, 16 Mar 2012 07:29:20 +0300, Mikhail Khludnev
> <mk...@griddynamics.com> wrote:
> > Ludovic,
> >
> > I looked through. First of all, it seems to me you don't amend regular
> > "servlet" solr server, but the only embedded one.
> > Anyway, the difference is that you stream DocList via callback, but it
> > means that you've instantiated it in memory and keep it there until it
> will
> > be completely consumed. Think about a billion numfound. Core idea of my
> > approach is keep almost zero memory for response.
> >
> > Regards
> >
> > On Fri, Mar 16, 2012 at 12:12 AM, lboutros <bo...@gmail.com> wrote:
> >
> >> Hi,
> >>
> >> I was looking for something similar.
> >>
> >> I tried this patch :
> >>
> >> https://issues.apache.org/jira/browse/SOLR-2112
> >>
> >> it's working quite well (I've back-ported the code in Solr 3.5.0...).
> >>
> >> Is it really different from what you are trying to achieve ?
> >>
> >> Ludovic.
> >>
> >> -----
> >> Jouve
> >> France.
> >> --
> >> View this message in context:
> >>
>
> http://lucene.472066.n3.nabble.com/Responding-to-Requests-with-Chunks-Streaming-tp3827316p3829909.html
> >> Sent from the Solr - User mailing list archive at Nabble.com.
> >>
>



-- 
Sincerely yours
Mikhail Khludnev
gedel@yandex.ru

<http://www.griddynamics.com>
 <mk...@griddynamics.com>

Re: Responding to Requests with Chunks/Streaming

Posted by Nicholas Ball <ni...@nodelay.com>.

Mikhail & Ludovic,

Thanks for both your replies, very helpful indeed!

Ludovic, I was actually looking into just that and did some tests with
SolrJ, it does work well but needs some changes on the Solr server if we
want to send out individual documents a various times. This could be done
with a write() and flush() to the FastOutputStream (daos) in JavBinCodec. I
therefore think that a combination of this and Mikhail's solution would
work best! 

Mikhail, you mention that your solution doesn't currently work and not
sure why this is the case, but could it be that you haven't flushed the
data (os.flush()) you've written in the collect method of DocSetStreamer? I
think placing the output stream into the SolrQueryRequest is the way to go,
so that we can access it and write to it how we intend. However, I think
using the JavaBinCodec would be ideal so that we can work with SolrJ
directly, and not mess around with the encoding of the docs/data etc... 

At the moment the entry point to JavaBinCodec is through the
BinaryResponseWriter which calls the highest level marshal() method which
decodes and sends out the entire SolrQueryResponse (line 49 @
BinaryResponseWriter). What would be ideal is to be able to break up the
response and call the JavaBinCodec for pieces of it with a flush after each
call. Did a few tests with a simple Thread.sleep and a flush to see if this
would actually work and looks like it's working out perfectly. Just trying
to figure out the best way to actually do it now :) any ideas?

An another note, for a solution to work with the chunked transfer encoding
(and therefore web browsers), a lot more development is going to be needed.
Not sure if it's worth trying yet but might look into it later down the
line.

Nick

On Fri, 16 Mar 2012 07:29:20 +0300, Mikhail Khludnev
<mk...@griddynamics.com> wrote:
> Ludovic,
> 
> I looked through. First of all, it seems to me you don't amend regular
> "servlet" solr server, but the only embedded one.
> Anyway, the difference is that you stream DocList via callback, but it
> means that you've instantiated it in memory and keep it there until it
will
> be completely consumed. Think about a billion numfound. Core idea of my
> approach is keep almost zero memory for response.
> 
> Regards
> 
> On Fri, Mar 16, 2012 at 12:12 AM, lboutros <bo...@gmail.com> wrote:
> 
>> Hi,
>>
>> I was looking for something similar.
>>
>> I tried this patch :
>>
>> https://issues.apache.org/jira/browse/SOLR-2112
>>
>> it's working quite well (I've back-ported the code in Solr 3.5.0...).
>>
>> Is it really different from what you are trying to achieve ?
>>
>> Ludovic.
>>
>> -----
>> Jouve
>> France.
>> --
>> View this message in context:
>>
http://lucene.472066.n3.nabble.com/Responding-to-Requests-with-Chunks-Streaming-tp3827316p3829909.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>

Re: Responding to Requests with Chunks/Streaming

Posted by Mikhail Khludnev <mk...@griddynamics.com>.

Ludovic,

I looked through. First of all, it seems to me you don't amend regular
"servlet" solr server, but the only embedded one.
Anyway, the difference is that you stream DocList via callback, but it
means that you've instantiated it in memory and keep it there until it will
be completely consumed. Think about a billion numfound. Core idea of my
approach is keep almost zero memory for response.

Regards

On Fri, Mar 16, 2012 at 12:12 AM, lboutros <bo...@gmail.com> wrote:

> Hi,
>
> I was looking for something similar.
>
> I tried this patch :
>
> https://issues.apache.org/jira/browse/SOLR-2112
>
> it's working quite well (I've back-ported the code in Solr 3.5.0...).
>
> Is it really different from what you are trying to achieve ?
>
> Ludovic.
>
> -----
> Jouve
> France.
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Responding-to-Requests-with-Chunks-Streaming-tp3827316p3829909.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

-- 
Sincerely yours
Mikhail Khludnev
Lucid Certified
Apache Lucene/Solr Developer
Grid Dynamics

<http://www.griddynamics.com>
 <mk...@griddynamics.com>

Re: Responding to Requests with Chunks/Streaming

Posted by lboutros <bo...@gmail.com>.

Hi,

I was looking for something similar.

I tried this patch :

https://issues.apache.org/jira/browse/SOLR-2112

it's working quite well (I've back-ported the code in Solr 3.5.0...).

Is it really different from what you are trying to achieve ?

Ludovic. 

-----
Jouve
France.
--
View this message in context: http://lucene.472066.n3.nabble.com/Responding-to-Requests-with-Chunks-Streaming-tp3827316p3829909.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Responding to Requests with Chunks/Streaming

Posted by Mikhail Khludnev <mk...@griddynamics.com>.

Hello Nicholas,

Looks like we are around the same point. Here is my branch
https://github.com/m-khl/solr-patches/tree/streaming there are only two
commits on top of it.  And here is the test
https://github.com/m-khl/solr-patches/blob/streaming/solr/core/src/test/org/apache/solr/response/ResponseStreamingTest.java
it
streams increasing int ids w/o keeping them in heap. There is also some
unnecessary stuff with Digits() but, work in progress, you know.

My usecase is really specific:
* Let's we have documents inserted with increasing PKs, then I want to
search them, and retrieve results ordered by PK
* but, in this case I can them sorted for free - they are already ordered.
* And I also can have a pretty huge results, because I don't need to store
them in response, but just stream them into output as they occurs on
searching.

Core points:
*
https://github.com/m-khl/solr-patches/blob/streaming/solr/core/src/java/org/apache/solr/servlet/ResponseStreamingRequestParsers.java
steals
servlet outputstream and put it into request
* Solr Component adds PostFilter in a chain, which puts collecting docs
into DocSetStreamer
https://github.com/m-khl/solr-patches/blob/streaming/solr/core/src/java/org/apache/solr/handler/component/ResponseStreamerComponent.java
* and DocSetStreamer writes collecting docs PKs into outputStream
https://github.com/m-khl/solr-patches/blob/streaming/solr/core/src/java/org/apache/solr/response/DocSetStreamer.java

It seems that it can have a huge results with a few of memory. It should be
damn scalable.

So, my plan is output a response, which is readable by intrinsic
distributed search. But, I'm stopped for some reason.

About chunked encoding I have a kind of common sense consideration: I guess
that there is a buffer behind servlet output stream, when output fits, it
forms http response with content-length, on overflow it switches to chunked
encoding. So, I believe but can be wrong (and even lie) that these
machinery should be behind the curtain.

WDYT?

On Thu, Mar 15, 2012 at 4:17 AM, Nicholas Ball <ni...@nodelay.com>wrote:

> Hello all,
>
> I've been working on a plugin with a custom component and a few handlers
> for a research project. It's aim is to do some interesting distributed
> work, however I seem to have come to a road block when trying to respond to
> a clients request in multiple steps. Not even sure if this is possible with
> Solr but after no luck on the IRC channel, thought I'd ask here.
>
> What I'd like to achieve is to be able to have the requestHandler return
> results to a user as soon as it has data available, then continue
> processing or performing other distributed calls, and then return some more
> data, all on the same single client request.
>
> Now my understanding is that solr does some kind of streaming. Not sure
> how it's technically done over http in Solr so any information would be
> useful. I believe something like this would work well but again not sure:
>
> http://en.m.wikipedia.org/wiki/Chunked_transfer_encoding
>
> I also came across this issue/feature request in JIRA but not completely
> sure what the conclusion was or how someone might do/use this. Is it even
> relevant to what I'm looking for?
>
> https://issues.apache.org/jira/browse/SOLR-578
>
> Thank you very much for any help and time you can spare!
>
> Nicholas (incunix)
>
>
>

-- 
Sincerely yours
Mikhail Khludnev
Lucid Certified
Apache Lucene/Solr Developer
Grid Dynamics

<http://www.griddynamics.com>
 <mk...@griddynamics.com>

Re: Responding to Requests with Chunks/Streaming

Posted by Nicholas Ball <ni...@nodelay.com>.

Thanks for the reply Erik,

Yep, the project is working with distributed Solr applications (i.e.
shards) but not with the Solr supplied shard implementation, rather a
custom version (not very different to it to be honest).
I understand that Solr has scoring at it's heart which is something we are
not dealing with at the moment, but will no doubt have to solve at some
later stage! One reasonable "solution" to this would be to return data to a
user unsorted, but as quickly as possible. The client can then decide what
lapse-time/score trade-off it wants to work with as speed may be more
critical than accuracy.

Clearly this is not going to be an easy change (even though I was sure it
would be! hah) but I don't mind exposing some methods/fields to the
handler/components in the 4.x code base if that could solve my issue.
However, I really don't know where to start with this! I'm hoping that all
I will have to do is call a flush() method on an output stream SOMEWHERE
and the data already written will be sent before my next chunk of data.

Any ideas on where I should start looking for this
(classes/packages/methods) and how you might go about exposing this
functionality to a handler/component/plugin?

Thanks again,
Nick

On Thu, 15 Mar 2012 12:39:45 -0500, Erick Erickson
<er...@gmail.com> wrote:
> Somehow you'd have to create a custom collector, probably queue off the
> docs
> that made it to the collector and have some asynchronous thread
consuming
> those docs and sending them in bits...
> 
> But this is so antithetical to how Solr operates that I suspect my
> hand-waving
> wouldn't really work out. The problem is, at heart, that Solr assumes
that
> scoring matters. The scheme you've outlined has no way of knowing which
> documents are most important until after you've already queued *all*
> documents for return. Even the queueing will take forever, consider
> the query q=*:*. You're essentially queueing all the docs in the index
> to return.
> 
> If you're talking about the other sort of distributed search (e.g.
> Shards), that's
> already built in to Solr, though admittedly the aggregator (whichever
> server
> distributes the requests in the first place) waits for a response from
all
> the
> shards before assembling the response. It seems like you might leverage
> some of that.
> 
> But I really don't understand the end goal very well here....
> 
> Best
> Erick
> 
> On Wed, Mar 14, 2012 at 7:17 PM, Nicholas Ball
> <ni...@nodelay.com> wrote:
>> Hello all,
>>
>> I've been working on a plugin with a custom component and a few
handlers
>> for a research project. It's aim is to do some interesting distributed
>> work, however I seem to have come to a road block when trying to
respond
>> to a clients request in multiple steps. Not even sure if this is
possible
>> with Solr but after no luck on the IRC channel, thought I'd ask here.
>>
>> What I'd like to achieve is to be able to have the requestHandler
return
>> results to a user as soon as it has data available, then continue
>> processing or performing other distributed calls, and then return some
>> more data, all on the same single client request.
>>
>> Now my understanding is that solr does some kind of streaming. Not sure
>> how it's technically done over http in Solr so any information would be
>> useful. I believe something like this would work well but again not
sure:
>>
>> http://en.m.wikipedia.org/wiki/Chunked_transfer_encoding
>>
>> I also came across this issue/feature request in JIRA but not
completely
>> sure what the conclusion was or how someone might do/use this. Is it
even
>> relevant to what I'm looking for?
>>
>> https://issues.apache.org/jira/browse/SOLR-578
>>
>> Thank you very much for any help and time you can spare!
>>
>> Nicholas (incunix)
>>
>>

Re: Responding to Requests with Chunks/Streaming

Posted by Erick Erickson <er...@gmail.com>.

Somehow you'd have to create a custom collector, probably queue off the docs
that made it to the collector and have some asynchronous thread consuming
those docs and sending them in bits...

But this is so antithetical to how Solr operates that I suspect my hand-waving
wouldn't really work out. The problem is, at heart, that Solr assumes that
scoring matters. The scheme you've outlined has no way of knowing which
documents are most important until after you've already queued *all*
documents for return. Even the queueing will take forever, consider
the query q=*:*. You're essentially queueing all the docs in the index
to return.

If you're talking about the other sort of distributed search (e.g.
Shards), that's
already built in to Solr, though admittedly the aggregator (whichever server
distributes the requests in the first place) waits for a response from all the
shards before assembling the response. It seems like you might leverage
some of that.

But I really don't understand the end goal very well here....

Best
Erick

On Wed, Mar 14, 2012 at 7:17 PM, Nicholas Ball
<ni...@nodelay.com> wrote:
> Hello all,
>
> I've been working on a plugin with a custom component and a few handlers for a research project. It's aim is to do some interesting distributed work, however I seem to have come to a road block when trying to respond to a clients request in multiple steps. Not even sure if this is possible with Solr but after no luck on the IRC channel, thought I'd ask here.
>
> What I'd like to achieve is to be able to have the requestHandler return results to a user as soon as it has data available, then continue processing or performing other distributed calls, and then return some more data, all on the same single client request.
>
> Now my understanding is that solr does some kind of streaming. Not sure how it's technically done over http in Solr so any information would be useful. I believe something like this would work well but again not sure:
>
> http://en.m.wikipedia.org/wiki/Chunked_transfer_encoding
>
> I also came across this issue/feature request in JIRA but not completely sure what the conclusion was or how someone might do/use this. Is it even relevant to what I'm looking for?
>
> https://issues.apache.org/jira/browse/SOLR-578
>
> Thank you very much for any help and time you can spare!
>
> Nicholas (incunix)
>
>