You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Martin Le <ma...@gmail.com> on 2016/07/29 14:57:25 UTC

sampling operation for DStream

Hi all,

I have to handle high-speed rate data stream. To reduce the heavy load, I
want to use sampling techniques for each stream window. It means that I
want to process a subset of data instead of whole window data. I saw Spark
support sampling operations for RDD, but for DStream, Spark supports
sampling operation as well? If not,  could you please give me a suggestion
how to implement it?

Thanks,
Martin

Re: sampling operation for DStream

Posted by Cody Koeninger <co...@koeninger.org>.

Put the queue in a static variable that is first referenced on the
workers (inside an rdd closure).  That way it will be created on each
of the workers, not the driver.

Easiest way to do that is with a lazy val in a companion object.

On Mon, Aug 1, 2016 at 3:22 PM, Martin Le <ma...@gmail.com> wrote:
> How to do that? if I put the queue inside .transform operation, it doesn't
> work.
>
> On Mon, Aug 1, 2016 at 6:43 PM, Cody Koeninger <co...@koeninger.org> wrote:
>>
>> Can you keep a queue per executor in memory?
>>
>> On Mon, Aug 1, 2016 at 11:24 AM, Martin Le <ma...@gmail.com>
>> wrote:
>> > Hi Cody and all,
>> >
>> > Thank you for your answer. I implement simple random sampling (SRS) for
>> > DStream using transform method, and it works fine.
>> > However, I have a problem when I implement reservoir sampling (RS). In
>> > RS, I
>> > need to maintain a reservoir (a queue) to store selected data items
>> > (RDDs).
>> > If I define a large stream window, the queue also increases  and it
>> > leads to
>> > the driver run out of memory.  I explain my problem in detail here:
>> >
>> > https://docs.google.com/document/d/1YBV_eHH6U_dVF1giiajuG4ayoVN5R8Qw3IthoKvW5ok
>> >
>> > Could you please give me some suggestions or advice to fix this problem?
>> >
>> > Thanks
>> >
>> > On Fri, Jul 29, 2016 at 6:28 PM, Cody Koeninger <co...@koeninger.org>
>> > wrote:
>> >>
>> >> Most stream systems you're still going to incur the cost of reading
>> >> each message... I suppose you could rotate among reading just the
>> >> latest messages from a single partition of a Kafka topic if they were
>> >> evenly balanced.
>> >>
>> >> But once you've read the messages, nothing's stopping you from
>> >> filtering most of them out before doing further processing.  The
>> >> dstream .transform method will let you do any filtering / sampling you
>> >> could have done on an rdd.
>> >>
>> >> On Fri, Jul 29, 2016 at 9:57 AM, Martin Le <ma...@gmail.com>
>> >> wrote:
>> >> > Hi all,
>> >> >
>> >> > I have to handle high-speed rate data stream. To reduce the heavy
>> >> > load,
>> >> > I
>> >> > want to use sampling techniques for each stream window. It means that
>> >> > I
>> >> > want
>> >> > to process a subset of data instead of whole window data. I saw Spark
>> >> > support sampling operations for RDD, but for DStream, Spark supports
>> >> > sampling operation as well? If not,  could you please give me a
>> >> > suggestion
>> >> > how to implement it?
>> >> >
>> >> > Thanks,
>> >> > Martin
>> >
>> >
>
>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: sampling operation for DStream

Posted by Cody Koeninger <co...@koeninger.org>.

Put the queue in a static variable that is first referenced on the
workers (inside an rdd closure).  That way it will be created on each
of the workers, not the driver.

Easiest way to do that is with a lazy val in a companion object.

On Mon, Aug 1, 2016 at 3:22 PM, Martin Le <ma...@gmail.com> wrote:
> How to do that? if I put the queue inside .transform operation, it doesn't
> work.
>
> On Mon, Aug 1, 2016 at 6:43 PM, Cody Koeninger <co...@koeninger.org> wrote:
>>
>> Can you keep a queue per executor in memory?
>>
>> On Mon, Aug 1, 2016 at 11:24 AM, Martin Le <ma...@gmail.com>
>> wrote:
>> > Hi Cody and all,
>> >
>> > Thank you for your answer. I implement simple random sampling (SRS) for
>> > DStream using transform method, and it works fine.
>> > However, I have a problem when I implement reservoir sampling (RS). In
>> > RS, I
>> > need to maintain a reservoir (a queue) to store selected data items
>> > (RDDs).
>> > If I define a large stream window, the queue also increases  and it
>> > leads to
>> > the driver run out of memory.  I explain my problem in detail here:
>> >
>> > https://docs.google.com/document/d/1YBV_eHH6U_dVF1giiajuG4ayoVN5R8Qw3IthoKvW5ok
>> >
>> > Could you please give me some suggestions or advice to fix this problem?
>> >
>> > Thanks
>> >
>> > On Fri, Jul 29, 2016 at 6:28 PM, Cody Koeninger <co...@koeninger.org>
>> > wrote:
>> >>
>> >> Most stream systems you're still going to incur the cost of reading
>> >> each message... I suppose you could rotate among reading just the
>> >> latest messages from a single partition of a Kafka topic if they were
>> >> evenly balanced.
>> >>
>> >> But once you've read the messages, nothing's stopping you from
>> >> filtering most of them out before doing further processing.  The
>> >> dstream .transform method will let you do any filtering / sampling you
>> >> could have done on an rdd.
>> >>
>> >> On Fri, Jul 29, 2016 at 9:57 AM, Martin Le <ma...@gmail.com>
>> >> wrote:
>> >> > Hi all,
>> >> >
>> >> > I have to handle high-speed rate data stream. To reduce the heavy
>> >> > load,
>> >> > I
>> >> > want to use sampling techniques for each stream window. It means that
>> >> > I
>> >> > want
>> >> > to process a subset of data instead of whole window data. I saw Spark
>> >> > support sampling operations for RDD, but for DStream, Spark supports
>> >> > sampling operation as well? If not,  could you please give me a
>> >> > suggestion
>> >> > how to implement it?
>> >> >
>> >> > Thanks,
>> >> > Martin
>> >
>> >
>
>

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: sampling operation for DStream

Posted by Martin Le <ma...@gmail.com>.

How to do that? if I put the queue inside .transform operation, it
doesn't work.

On Mon, Aug 1, 2016 at 6:43 PM, Cody Koeninger <co...@koeninger.org> wrote:

> Can you keep a queue per executor in memory?
>
> On Mon, Aug 1, 2016 at 11:24 AM, Martin Le <ma...@gmail.com>
> wrote:
> > Hi Cody and all,
> >
> > Thank you for your answer. I implement simple random sampling (SRS) for
> > DStream using transform method, and it works fine.
> > However, I have a problem when I implement reservoir sampling (RS). In
> RS, I
> > need to maintain a reservoir (a queue) to store selected data items
> (RDDs).
> > If I define a large stream window, the queue also increases  and it
> leads to
> > the driver run out of memory.  I explain my problem in detail here:
> >
> https://docs.google.com/document/d/1YBV_eHH6U_dVF1giiajuG4ayoVN5R8Qw3IthoKvW5ok
> >
> > Could you please give me some suggestions or advice to fix this problem?
> >
> > Thanks
> >
> > On Fri, Jul 29, 2016 at 6:28 PM, Cody Koeninger <co...@koeninger.org>
> wrote:
> >>
> >> Most stream systems you're still going to incur the cost of reading
> >> each message... I suppose you could rotate among reading just the
> >> latest messages from a single partition of a Kafka topic if they were
> >> evenly balanced.
> >>
> >> But once you've read the messages, nothing's stopping you from
> >> filtering most of them out before doing further processing.  The
> >> dstream .transform method will let you do any filtering / sampling you
> >> could have done on an rdd.
> >>
> >> On Fri, Jul 29, 2016 at 9:57 AM, Martin Le <ma...@gmail.com>
> >> wrote:
> >> > Hi all,
> >> >
> >> > I have to handle high-speed rate data stream. To reduce the heavy
> load,
> >> > I
> >> > want to use sampling techniques for each stream window. It means that
> I
> >> > want
> >> > to process a subset of data instead of whole window data. I saw Spark
> >> > support sampling operations for RDD, but for DStream, Spark supports
> >> > sampling operation as well? If not,  could you please give me a
> >> > suggestion
> >> > how to implement it?
> >> >
> >> > Thanks,
> >> > Martin
> >
> >
>

Re: sampling operation for DStream

Posted by Martin Le <ma...@gmail.com>.

How to do that? if I put the queue inside .transform operation, it
doesn't work.

On Mon, Aug 1, 2016 at 6:43 PM, Cody Koeninger <co...@koeninger.org> wrote:

> Can you keep a queue per executor in memory?
>
> On Mon, Aug 1, 2016 at 11:24 AM, Martin Le <ma...@gmail.com>
> wrote:
> > Hi Cody and all,
> >
> > Thank you for your answer. I implement simple random sampling (SRS) for
> > DStream using transform method, and it works fine.
> > However, I have a problem when I implement reservoir sampling (RS). In
> RS, I
> > need to maintain a reservoir (a queue) to store selected data items
> (RDDs).
> > If I define a large stream window, the queue also increases  and it
> leads to
> > the driver run out of memory.  I explain my problem in detail here:
> >
> https://docs.google.com/document/d/1YBV_eHH6U_dVF1giiajuG4ayoVN5R8Qw3IthoKvW5ok
> >
> > Could you please give me some suggestions or advice to fix this problem?
> >
> > Thanks
> >
> > On Fri, Jul 29, 2016 at 6:28 PM, Cody Koeninger <co...@koeninger.org>
> wrote:
> >>
> >> Most stream systems you're still going to incur the cost of reading
> >> each message... I suppose you could rotate among reading just the
> >> latest messages from a single partition of a Kafka topic if they were
> >> evenly balanced.
> >>
> >> But once you've read the messages, nothing's stopping you from
> >> filtering most of them out before doing further processing.  The
> >> dstream .transform method will let you do any filtering / sampling you
> >> could have done on an rdd.
> >>
> >> On Fri, Jul 29, 2016 at 9:57 AM, Martin Le <ma...@gmail.com>
> >> wrote:
> >> > Hi all,
> >> >
> >> > I have to handle high-speed rate data stream. To reduce the heavy
> load,
> >> > I
> >> > want to use sampling techniques for each stream window. It means that
> I
> >> > want
> >> > to process a subset of data instead of whole window data. I saw Spark
> >> > support sampling operations for RDD, but for DStream, Spark supports
> >> > sampling operation as well? If not,  could you please give me a
> >> > suggestion
> >> > how to implement it?
> >> >
> >> > Thanks,
> >> > Martin
> >
> >
>

Re: sampling operation for DStream

Posted by Cody Koeninger <co...@koeninger.org>.

Can you keep a queue per executor in memory?

On Mon, Aug 1, 2016 at 11:24 AM, Martin Le <ma...@gmail.com> wrote:
> Hi Cody and all,
>
> Thank you for your answer. I implement simple random sampling (SRS) for
> DStream using transform method, and it works fine.
> However, I have a problem when I implement reservoir sampling (RS). In RS, I
> need to maintain a reservoir (a queue) to store selected data items (RDDs).
> If I define a large stream window, the queue also increases  and it leads to
> the driver run out of memory.  I explain my problem in detail here:
> https://docs.google.com/document/d/1YBV_eHH6U_dVF1giiajuG4ayoVN5R8Qw3IthoKvW5ok
>
> Could you please give me some suggestions or advice to fix this problem?
>
> Thanks
>
> On Fri, Jul 29, 2016 at 6:28 PM, Cody Koeninger <co...@koeninger.org> wrote:
>>
>> Most stream systems you're still going to incur the cost of reading
>> each message... I suppose you could rotate among reading just the
>> latest messages from a single partition of a Kafka topic if they were
>> evenly balanced.
>>
>> But once you've read the messages, nothing's stopping you from
>> filtering most of them out before doing further processing.  The
>> dstream .transform method will let you do any filtering / sampling you
>> could have done on an rdd.
>>
>> On Fri, Jul 29, 2016 at 9:57 AM, Martin Le <ma...@gmail.com>
>> wrote:
>> > Hi all,
>> >
>> > I have to handle high-speed rate data stream. To reduce the heavy load,
>> > I
>> > want to use sampling techniques for each stream window. It means that I
>> > want
>> > to process a subset of data instead of whole window data. I saw Spark
>> > support sampling operations for RDD, but for DStream, Spark supports
>> > sampling operation as well? If not,  could you please give me a
>> > suggestion
>> > how to implement it?
>> >
>> > Thanks,
>> > Martin
>
>

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: sampling operation for DStream

Posted by Cody Koeninger <co...@koeninger.org>.

Can you keep a queue per executor in memory?

On Mon, Aug 1, 2016 at 11:24 AM, Martin Le <ma...@gmail.com> wrote:
> Hi Cody and all,
>
> Thank you for your answer. I implement simple random sampling (SRS) for
> DStream using transform method, and it works fine.
> However, I have a problem when I implement reservoir sampling (RS). In RS, I
> need to maintain a reservoir (a queue) to store selected data items (RDDs).
> If I define a large stream window, the queue also increases  and it leads to
> the driver run out of memory.  I explain my problem in detail here:
> https://docs.google.com/document/d/1YBV_eHH6U_dVF1giiajuG4ayoVN5R8Qw3IthoKvW5ok
>
> Could you please give me some suggestions or advice to fix this problem?
>
> Thanks
>
> On Fri, Jul 29, 2016 at 6:28 PM, Cody Koeninger <co...@koeninger.org> wrote:
>>
>> Most stream systems you're still going to incur the cost of reading
>> each message... I suppose you could rotate among reading just the
>> latest messages from a single partition of a Kafka topic if they were
>> evenly balanced.
>>
>> But once you've read the messages, nothing's stopping you from
>> filtering most of them out before doing further processing.  The
>> dstream .transform method will let you do any filtering / sampling you
>> could have done on an rdd.
>>
>> On Fri, Jul 29, 2016 at 9:57 AM, Martin Le <ma...@gmail.com>
>> wrote:
>> > Hi all,
>> >
>> > I have to handle high-speed rate data stream. To reduce the heavy load,
>> > I
>> > want to use sampling techniques for each stream window. It means that I
>> > want
>> > to process a subset of data instead of whole window data. I saw Spark
>> > support sampling operations for RDD, but for DStream, Spark supports
>> > sampling operation as well? If not,  could you please give me a
>> > suggestion
>> > how to implement it?
>> >
>> > Thanks,
>> > Martin
>
>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: sampling operation for DStream

Posted by Martin Le <ma...@gmail.com>.

Hi Cody and all,

Thank you for your answer. I implement simple random sampling (SRS) for
DStream using transform method, and it works fine.
However, I have a problem when I implement reservoir sampling (RS). In RS,
I need to maintain a reservoir (a queue) to store selected data items
(RDDs). If I define a large stream window, the queue also increases  and it
leads to the driver run out of memory.  I explain my problem in detail
here:
https://docs.google.com/document/d/1YBV_eHH6U_dVF1giiajuG4ayoVN5R8Qw3IthoKvW5ok

Could you please give me some suggestions or advice to fix this problem?

Thanks

On Fri, Jul 29, 2016 at 6:28 PM, Cody Koeninger <co...@koeninger.org> wrote:

> Most stream systems you're still going to incur the cost of reading
> each message... I suppose you could rotate among reading just the
> latest messages from a single partition of a Kafka topic if they were
> evenly balanced.
>
> But once you've read the messages, nothing's stopping you from
> filtering most of them out before doing further processing.  The
> dstream .transform method will let you do any filtering / sampling you
> could have done on an rdd.
>
> On Fri, Jul 29, 2016 at 9:57 AM, Martin Le <ma...@gmail.com>
> wrote:
> > Hi all,
> >
> > I have to handle high-speed rate data stream. To reduce the heavy load, I
> > want to use sampling techniques for each stream window. It means that I
> want
> > to process a subset of data instead of whole window data. I saw Spark
> > support sampling operations for RDD, but for DStream, Spark supports
> > sampling operation as well? If not,  could you please give me a
> suggestion
> > how to implement it?
> >
> > Thanks,
> > Martin
>

Re: sampling operation for DStream

Posted by Martin Le <ma...@gmail.com>.

Hi Cody and all,

Thank you for your answer. I implement simple random sampling (SRS) for
DStream using transform method, and it works fine.
However, I have a problem when I implement reservoir sampling (RS). In RS,
I need to maintain a reservoir (a queue) to store selected data items
(RDDs). If I define a large stream window, the queue also increases  and it
leads to the driver run out of memory.  I explain my problem in detail
here:
https://docs.google.com/document/d/1YBV_eHH6U_dVF1giiajuG4ayoVN5R8Qw3IthoKvW5ok

Could you please give me some suggestions or advice to fix this problem?

Thanks

On Fri, Jul 29, 2016 at 6:28 PM, Cody Koeninger <co...@koeninger.org> wrote:

> Most stream systems you're still going to incur the cost of reading
> each message... I suppose you could rotate among reading just the
> latest messages from a single partition of a Kafka topic if they were
> evenly balanced.
>
> But once you've read the messages, nothing's stopping you from
> filtering most of them out before doing further processing.  The
> dstream .transform method will let you do any filtering / sampling you
> could have done on an rdd.
>
> On Fri, Jul 29, 2016 at 9:57 AM, Martin Le <ma...@gmail.com>
> wrote:
> > Hi all,
> >
> > I have to handle high-speed rate data stream. To reduce the heavy load, I
> > want to use sampling techniques for each stream window. It means that I
> want
> > to process a subset of data instead of whole window data. I saw Spark
> > support sampling operations for RDD, but for DStream, Spark supports
> > sampling operation as well? If not,  could you please give me a
> suggestion
> > how to implement it?
> >
> > Thanks,
> > Martin
>

Re: sampling operation for DStream

Posted by Cody Koeninger <co...@koeninger.org>.

Most stream systems you're still going to incur the cost of reading
each message... I suppose you could rotate among reading just the
latest messages from a single partition of a Kafka topic if they were
evenly balanced.

But once you've read the messages, nothing's stopping you from
filtering most of them out before doing further processing.  The
dstream .transform method will let you do any filtering / sampling you
could have done on an rdd.

On Fri, Jul 29, 2016 at 9:57 AM, Martin Le <ma...@gmail.com> wrote:
> Hi all,
>
> I have to handle high-speed rate data stream. To reduce the heavy load, I
> want to use sampling techniques for each stream window. It means that I want
> to process a subset of data instead of whole window data. I saw Spark
> support sampling operations for RDD, but for DStream, Spark supports
> sampling operation as well? If not,  could you please give me a suggestion
> how to implement it?
>
> Thanks,
> Martin

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: sampling operation for DStream

Posted by Cody Koeninger <co...@koeninger.org>.

Most stream systems you're still going to incur the cost of reading
each message... I suppose you could rotate among reading just the
latest messages from a single partition of a Kafka topic if they were
evenly balanced.

But once you've read the messages, nothing's stopping you from
filtering most of them out before doing further processing.  The
dstream .transform method will let you do any filtering / sampling you
could have done on an rdd.

On Fri, Jul 29, 2016 at 9:57 AM, Martin Le <ma...@gmail.com> wrote:
> Hi all,
>
> I have to handle high-speed rate data stream. To reduce the heavy load, I
> want to use sampling techniques for each stream window. It means that I want
> to process a subset of data instead of whole window data. I saw Spark
> support sampling operations for RDD, but for DStream, Spark supports
> sampling operation as well? If not,  could you please give me a suggestion
> how to implement it?
>
> Thanks,
> Martin

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org