You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by RJ Nowling <rn...@gmail.com> on 2015/06/30 20:01:57 UTC

Grouping runs of elements in a RDD

Hi all,

I have a problem where I have a RDD of elements:

Item1 Item2 Item3 Item4 Item5 Item6 ...

and I want to run a function over them to decide which runs of elements to
group together:

[Item1 Item2] [Item3] [Item4 Item5 Item6] ...

Technically, I could use aggregate to do this, but I would have to use a
List of List of T which would produce a very large collection in memory.

Is there an easy way to accomplish this?  e.g.,, it would be nice to have a
version of aggregate where the combination function can return a complete
group that is added to the new RDD and an incomplete group which is passed
to the next call of the reduce function.

Thanks,
RJ

Re: Grouping runs of elements in a RDD

Posted by RJ Nowling <rn...@gmail.com>.

Thanks, Mohit.  It sounds like we're on the same page -- I used a similar
approach.

On Thu, Jul 2, 2015 at 12:27 PM, Mohit Jaggi <mo...@gmail.com> wrote:

> if you are joining successive lines together based on a predicate, then
> you are doing a "flatMap" not an "aggregate". you are on the right track
> with a multi-pass solution. i had the same challenge when i needed a
> sliding window over an RDD(see below).
>
> [ i had suggested that the sliding window API be moved to spark-core. not
> sure if that happened ]
>
> ----- previous posts ---
>
>
> http://spark.apache.org/docs/1.4.0/api/scala/index.html#org.apache.spark.mllib.rdd.RDDFunctions
>
> > On Fri, Jan 30, 2015 at 12:27 AM, Mohit Jaggi <mo...@gmail.com>
> > wrote:
> >
> >
> > http://mail-archives.apache.org/mod_mbox/spark-user/201405.mbox/%3CCALRVTpKN65rOLzbETC+Ddk4O+YJm+TfAF5DZ8EuCpL-2YHYPZA@mail.gmail.com%3E
> >
> > you can use the MLLib function or do the following (which is what I had
> > done):
> >
> > - in first pass over the data, using mapPartitionWithIndex, gather the
> > first item in each partition. you can use collect (or aggregator) for this.
> > “key” them by the partition index. at the end, you will have a map
> >    (partition index) --> first item
> > - in the second pass over the data, using mapPartitionWithIndex again,
> > look at two (or in the general case N items at a time, you can use scala’s
> > sliding iterator) items at a time and check the time difference(or any
> > sliding window computation). To this mapParitition, pass the map created in
> > previous step. You will need to use them to check the last item in this
> > partition.
> >
> > If you can tolerate a few inaccuracies then you can just do the second
> > step. You will miss the “boundaries” of the partitions but it might be
> > acceptable for your use case.
>
>
> On Tue, Jun 30, 2015 at 12:21 PM, RJ Nowling <rn...@gmail.com> wrote:
>
>> That's an interesting idea!  I hadn't considered that.  However, looking
>> at the Partitioner interface, I would need to know from looking at a single
>> key which doesn't fit my case, unfortunately.  For my case, I need to
>> compare successive pairs of keys.  (I'm trying to re-join lines that were
>> split prematurely.)
>>
>> On Tue, Jun 30, 2015 at 2:07 PM, Abhishek R. Singh <
>> abhishsi@tetrationanalytics.com> wrote:
>>
>>> could you use a custom partitioner to preserve boundaries such that all
>>> related tuples end up on the same partition?
>>>
>>> On Jun 30, 2015, at 12:00 PM, RJ Nowling <rn...@gmail.com> wrote:
>>>
>>> Thanks, Reynold.  I still need to handle incomplete groups that fall
>>> between partition boundaries. So, I need a two-pass approach. I came up
>>> with a somewhat hacky way to handle those using the partition indices and
>>> key-value pairs as a second pass after the first.
>>>
>>> OCaml's std library provides a function called group() that takes a
>>> break function that operators on pairs of successive elements.  It seems a
>>> similar approach could be used in Spark and would be more efficient than my
>>> approach with key-value pairs since you know the ordering of the partitions.
>>>
>>> Has this need been expressed by others?
>>>
>>> On Tue, Jun 30, 2015 at 1:03 PM, Reynold Xin <rx...@databricks.com>
>>> wrote:
>>>
>>>> Try mapPartitions, which gives you an iterator, and you can produce an
>>>> iterator back.
>>>>
>>>>
>>>> On Tue, Jun 30, 2015 at 11:01 AM, RJ Nowling <rn...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I have a problem where I have a RDD of elements:
>>>>>
>>>>> Item1 Item2 Item3 Item4 Item5 Item6 ...
>>>>>
>>>>> and I want to run a function over them to decide which runs of
>>>>> elements to group together:
>>>>>
>>>>> [Item1 Item2] [Item3] [Item4 Item5 Item6] ...
>>>>>
>>>>> Technically, I could use aggregate to do this, but I would have to use
>>>>> a List of List of T which would produce a very large collection in memory.
>>>>>
>>>>> Is there an easy way to accomplish this?  e.g.,, it would be nice to
>>>>> have a version of aggregate where the combination function can return a
>>>>> complete group that is added to the new RDD and an incomplete group which
>>>>> is passed to the next call of the reduce function.
>>>>>
>>>>> Thanks,
>>>>> RJ
>>>>>
>>>>
>>>>
>>>
>>>
>>
>

Re: Grouping runs of elements in a RDD

Posted by RJ Nowling <rn...@gmail.com>.

Thanks, Mohit.  It sounds like we're on the same page -- I used a similar
approach.

On Thu, Jul 2, 2015 at 12:27 PM, Mohit Jaggi <mo...@gmail.com> wrote:

> if you are joining successive lines together based on a predicate, then
> you are doing a "flatMap" not an "aggregate". you are on the right track
> with a multi-pass solution. i had the same challenge when i needed a
> sliding window over an RDD(see below).
>
> [ i had suggested that the sliding window API be moved to spark-core. not
> sure if that happened ]
>
> ----- previous posts ---
>
>
> http://spark.apache.org/docs/1.4.0/api/scala/index.html#org.apache.spark.mllib.rdd.RDDFunctions
>
> > On Fri, Jan 30, 2015 at 12:27 AM, Mohit Jaggi <mo...@gmail.com>
> > wrote:
> >
> >
> > http://mail-archives.apache.org/mod_mbox/spark-user/201405.mbox/%3CCALRVTpKN65rOLzbETC+Ddk4O+YJm+TfAF5DZ8EuCpL-2YHYPZA@mail.gmail.com%3E
> >
> > you can use the MLLib function or do the following (which is what I had
> > done):
> >
> > - in first pass over the data, using mapPartitionWithIndex, gather the
> > first item in each partition. you can use collect (or aggregator) for this.
> > “key” them by the partition index. at the end, you will have a map
> >    (partition index) --> first item
> > - in the second pass over the data, using mapPartitionWithIndex again,
> > look at two (or in the general case N items at a time, you can use scala’s
> > sliding iterator) items at a time and check the time difference(or any
> > sliding window computation). To this mapParitition, pass the map created in
> > previous step. You will need to use them to check the last item in this
> > partition.
> >
> > If you can tolerate a few inaccuracies then you can just do the second
> > step. You will miss the “boundaries” of the partitions but it might be
> > acceptable for your use case.
>
>
> On Tue, Jun 30, 2015 at 12:21 PM, RJ Nowling <rn...@gmail.com> wrote:
>
>> That's an interesting idea!  I hadn't considered that.  However, looking
>> at the Partitioner interface, I would need to know from looking at a single
>> key which doesn't fit my case, unfortunately.  For my case, I need to
>> compare successive pairs of keys.  (I'm trying to re-join lines that were
>> split prematurely.)
>>
>> On Tue, Jun 30, 2015 at 2:07 PM, Abhishek R. Singh <
>> abhishsi@tetrationanalytics.com> wrote:
>>
>>> could you use a custom partitioner to preserve boundaries such that all
>>> related tuples end up on the same partition?
>>>
>>> On Jun 30, 2015, at 12:00 PM, RJ Nowling <rn...@gmail.com> wrote:
>>>
>>> Thanks, Reynold.  I still need to handle incomplete groups that fall
>>> between partition boundaries. So, I need a two-pass approach. I came up
>>> with a somewhat hacky way to handle those using the partition indices and
>>> key-value pairs as a second pass after the first.
>>>
>>> OCaml's std library provides a function called group() that takes a
>>> break function that operators on pairs of successive elements.  It seems a
>>> similar approach could be used in Spark and would be more efficient than my
>>> approach with key-value pairs since you know the ordering of the partitions.
>>>
>>> Has this need been expressed by others?
>>>
>>> On Tue, Jun 30, 2015 at 1:03 PM, Reynold Xin <rx...@databricks.com>
>>> wrote:
>>>
>>>> Try mapPartitions, which gives you an iterator, and you can produce an
>>>> iterator back.
>>>>
>>>>
>>>> On Tue, Jun 30, 2015 at 11:01 AM, RJ Nowling <rn...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I have a problem where I have a RDD of elements:
>>>>>
>>>>> Item1 Item2 Item3 Item4 Item5 Item6 ...
>>>>>
>>>>> and I want to run a function over them to decide which runs of
>>>>> elements to group together:
>>>>>
>>>>> [Item1 Item2] [Item3] [Item4 Item5 Item6] ...
>>>>>
>>>>> Technically, I could use aggregate to do this, but I would have to use
>>>>> a List of List of T which would produce a very large collection in memory.
>>>>>
>>>>> Is there an easy way to accomplish this?  e.g.,, it would be nice to
>>>>> have a version of aggregate where the combination function can return a
>>>>> complete group that is added to the new RDD and an incomplete group which
>>>>> is passed to the next call of the reduce function.
>>>>>
>>>>> Thanks,
>>>>> RJ
>>>>>
>>>>
>>>>
>>>
>>>
>>
>

Re: Grouping runs of elements in a RDD

Posted by Mohit Jaggi <mo...@gmail.com>.

if you are joining successive lines together based on a predicate, then you
are doing a "flatMap" not an "aggregate". you are on the right track with a
multi-pass solution. i had the same challenge when i needed a sliding
window over an RDD(see below).

[ i had suggested that the sliding window API be moved to spark-core. not
sure if that happened ]

----- previous posts ---

http://spark.apache.org/docs/1.4.0/api/scala/index.html#org.apache.spark.mllib.rdd.RDDFunctions

> On Fri, Jan 30, 2015 at 12:27 AM, Mohit Jaggi <mo...@gmail.com>
> wrote:
>
>
> http://mail-archives.apache.org/mod_mbox/spark-user/201405.mbox/%3CCALRVTpKN65rOLzbETC+Ddk4O+YJm+TfAF5DZ8EuCpL-2YHYPZA@mail.gmail.com%3E
>
> you can use the MLLib function or do the following (which is what I had
> done):
>
> - in first pass over the data, using mapPartitionWithIndex, gather the
> first item in each partition. you can use collect (or aggregator) for this.
> “key” them by the partition index. at the end, you will have a map
>    (partition index) --> first item
> - in the second pass over the data, using mapPartitionWithIndex again,
> look at two (or in the general case N items at a time, you can use scala’s
> sliding iterator) items at a time and check the time difference(or any
> sliding window computation). To this mapParitition, pass the map created in
> previous step. You will need to use them to check the last item in this
> partition.
>
> If you can tolerate a few inaccuracies then you can just do the second
> step. You will miss the “boundaries” of the partitions but it might be
> acceptable for your use case.


On Tue, Jun 30, 2015 at 12:21 PM, RJ Nowling <rn...@gmail.com> wrote:

> That's an interesting idea!  I hadn't considered that.  However, looking
> at the Partitioner interface, I would need to know from looking at a single
> key which doesn't fit my case, unfortunately.  For my case, I need to
> compare successive pairs of keys.  (I'm trying to re-join lines that were
> split prematurely.)
>
> On Tue, Jun 30, 2015 at 2:07 PM, Abhishek R. Singh <
> abhishsi@tetrationanalytics.com> wrote:
>
>> could you use a custom partitioner to preserve boundaries such that all
>> related tuples end up on the same partition?
>>
>> On Jun 30, 2015, at 12:00 PM, RJ Nowling <rn...@gmail.com> wrote:
>>
>> Thanks, Reynold.  I still need to handle incomplete groups that fall
>> between partition boundaries. So, I need a two-pass approach. I came up
>> with a somewhat hacky way to handle those using the partition indices and
>> key-value pairs as a second pass after the first.
>>
>> OCaml's std library provides a function called group() that takes a break
>> function that operators on pairs of successive elements.  It seems a
>> similar approach could be used in Spark and would be more efficient than my
>> approach with key-value pairs since you know the ordering of the partitions.
>>
>> Has this need been expressed by others?
>>
>> On Tue, Jun 30, 2015 at 1:03 PM, Reynold Xin <rx...@databricks.com> wrote:
>>
>>> Try mapPartitions, which gives you an iterator, and you can produce an
>>> iterator back.
>>>
>>>
>>> On Tue, Jun 30, 2015 at 11:01 AM, RJ Nowling <rn...@gmail.com> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I have a problem where I have a RDD of elements:
>>>>
>>>> Item1 Item2 Item3 Item4 Item5 Item6 ...
>>>>
>>>> and I want to run a function over them to decide which runs of elements
>>>> to group together:
>>>>
>>>> [Item1 Item2] [Item3] [Item4 Item5 Item6] ...
>>>>
>>>> Technically, I could use aggregate to do this, but I would have to use
>>>> a List of List of T which would produce a very large collection in memory.
>>>>
>>>> Is there an easy way to accomplish this?  e.g.,, it would be nice to
>>>> have a version of aggregate where the combination function can return a
>>>> complete group that is added to the new RDD and an incomplete group which
>>>> is passed to the next call of the reduce function.
>>>>
>>>> Thanks,
>>>> RJ
>>>>
>>>
>>>
>>
>>
>

Re: Grouping runs of elements in a RDD

Posted by Mohit Jaggi <mo...@gmail.com>.

if you are joining successive lines together based on a predicate, then you
are doing a "flatMap" not an "aggregate". you are on the right track with a
multi-pass solution. i had the same challenge when i needed a sliding
window over an RDD(see below).

[ i had suggested that the sliding window API be moved to spark-core. not
sure if that happened ]

----- previous posts ---

http://spark.apache.org/docs/1.4.0/api/scala/index.html#org.apache.spark.mllib.rdd.RDDFunctions

> On Fri, Jan 30, 2015 at 12:27 AM, Mohit Jaggi <mo...@gmail.com>
> wrote:
>
>
> http://mail-archives.apache.org/mod_mbox/spark-user/201405.mbox/%3CCALRVTpKN65rOLzbETC+Ddk4O+YJm+TfAF5DZ8EuCpL-2YHYPZA@mail.gmail.com%3E
>
> you can use the MLLib function or do the following (which is what I had
> done):
>
> - in first pass over the data, using mapPartitionWithIndex, gather the
> first item in each partition. you can use collect (or aggregator) for this.
> “key” them by the partition index. at the end, you will have a map
>    (partition index) --> first item
> - in the second pass over the data, using mapPartitionWithIndex again,
> look at two (or in the general case N items at a time, you can use scala’s
> sliding iterator) items at a time and check the time difference(or any
> sliding window computation). To this mapParitition, pass the map created in
> previous step. You will need to use them to check the last item in this
> partition.
>
> If you can tolerate a few inaccuracies then you can just do the second
> step. You will miss the “boundaries” of the partitions but it might be
> acceptable for your use case.


On Tue, Jun 30, 2015 at 12:21 PM, RJ Nowling <rn...@gmail.com> wrote:

> That's an interesting idea!  I hadn't considered that.  However, looking
> at the Partitioner interface, I would need to know from looking at a single
> key which doesn't fit my case, unfortunately.  For my case, I need to
> compare successive pairs of keys.  (I'm trying to re-join lines that were
> split prematurely.)
>
> On Tue, Jun 30, 2015 at 2:07 PM, Abhishek R. Singh <
> abhishsi@tetrationanalytics.com> wrote:
>
>> could you use a custom partitioner to preserve boundaries such that all
>> related tuples end up on the same partition?
>>
>> On Jun 30, 2015, at 12:00 PM, RJ Nowling <rn...@gmail.com> wrote:
>>
>> Thanks, Reynold.  I still need to handle incomplete groups that fall
>> between partition boundaries. So, I need a two-pass approach. I came up
>> with a somewhat hacky way to handle those using the partition indices and
>> key-value pairs as a second pass after the first.
>>
>> OCaml's std library provides a function called group() that takes a break
>> function that operators on pairs of successive elements.  It seems a
>> similar approach could be used in Spark and would be more efficient than my
>> approach with key-value pairs since you know the ordering of the partitions.
>>
>> Has this need been expressed by others?
>>
>> On Tue, Jun 30, 2015 at 1:03 PM, Reynold Xin <rx...@databricks.com> wrote:
>>
>>> Try mapPartitions, which gives you an iterator, and you can produce an
>>> iterator back.
>>>
>>>
>>> On Tue, Jun 30, 2015 at 11:01 AM, RJ Nowling <rn...@gmail.com> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I have a problem where I have a RDD of elements:
>>>>
>>>> Item1 Item2 Item3 Item4 Item5 Item6 ...
>>>>
>>>> and I want to run a function over them to decide which runs of elements
>>>> to group together:
>>>>
>>>> [Item1 Item2] [Item3] [Item4 Item5 Item6] ...
>>>>
>>>> Technically, I could use aggregate to do this, but I would have to use
>>>> a List of List of T which would produce a very large collection in memory.
>>>>
>>>> Is there an easy way to accomplish this?  e.g.,, it would be nice to
>>>> have a version of aggregate where the combination function can return a
>>>> complete group that is added to the new RDD and an incomplete group which
>>>> is passed to the next call of the reduce function.
>>>>
>>>> Thanks,
>>>> RJ
>>>>
>>>
>>>
>>
>>
>

Re: Grouping runs of elements in a RDD

Posted by RJ Nowling <rn...@gmail.com>.

That's an interesting idea!  I hadn't considered that.  However, looking at
the Partitioner interface, I would need to know from looking at a single
key which doesn't fit my case, unfortunately.  For my case, I need to
compare successive pairs of keys.  (I'm trying to re-join lines that were
split prematurely.)

On Tue, Jun 30, 2015 at 2:07 PM, Abhishek R. Singh <
abhishsi@tetrationanalytics.com> wrote:

> could you use a custom partitioner to preserve boundaries such that all
> related tuples end up on the same partition?
>
> On Jun 30, 2015, at 12:00 PM, RJ Nowling <rn...@gmail.com> wrote:
>
> Thanks, Reynold.  I still need to handle incomplete groups that fall
> between partition boundaries. So, I need a two-pass approach. I came up
> with a somewhat hacky way to handle those using the partition indices and
> key-value pairs as a second pass after the first.
>
> OCaml's std library provides a function called group() that takes a break
> function that operators on pairs of successive elements.  It seems a
> similar approach could be used in Spark and would be more efficient than my
> approach with key-value pairs since you know the ordering of the partitions.
>
> Has this need been expressed by others?
>
> On Tue, Jun 30, 2015 at 1:03 PM, Reynold Xin <rx...@databricks.com> wrote:
>
>> Try mapPartitions, which gives you an iterator, and you can produce an
>> iterator back.
>>
>>
>> On Tue, Jun 30, 2015 at 11:01 AM, RJ Nowling <rn...@gmail.com> wrote:
>>
>>> Hi all,
>>>
>>> I have a problem where I have a RDD of elements:
>>>
>>> Item1 Item2 Item3 Item4 Item5 Item6 ...
>>>
>>> and I want to run a function over them to decide which runs of elements
>>> to group together:
>>>
>>> [Item1 Item2] [Item3] [Item4 Item5 Item6] ...
>>>
>>> Technically, I could use aggregate to do this, but I would have to use a
>>> List of List of T which would produce a very large collection in memory.
>>>
>>> Is there an easy way to accomplish this?  e.g.,, it would be nice to
>>> have a version of aggregate where the combination function can return a
>>> complete group that is added to the new RDD and an incomplete group which
>>> is passed to the next call of the reduce function.
>>>
>>> Thanks,
>>> RJ
>>>
>>
>>
>
>

Re: Grouping runs of elements in a RDD

Posted by RJ Nowling <rn...@gmail.com>.

That's an interesting idea!  I hadn't considered that.  However, looking at
the Partitioner interface, I would need to know from looking at a single
key which doesn't fit my case, unfortunately.  For my case, I need to
compare successive pairs of keys.  (I'm trying to re-join lines that were
split prematurely.)

On Tue, Jun 30, 2015 at 2:07 PM, Abhishek R. Singh <
abhishsi@tetrationanalytics.com> wrote:

> could you use a custom partitioner to preserve boundaries such that all
> related tuples end up on the same partition?
>
> On Jun 30, 2015, at 12:00 PM, RJ Nowling <rn...@gmail.com> wrote:
>
> Thanks, Reynold.  I still need to handle incomplete groups that fall
> between partition boundaries. So, I need a two-pass approach. I came up
> with a somewhat hacky way to handle those using the partition indices and
> key-value pairs as a second pass after the first.
>
> OCaml's std library provides a function called group() that takes a break
> function that operators on pairs of successive elements.  It seems a
> similar approach could be used in Spark and would be more efficient than my
> approach with key-value pairs since you know the ordering of the partitions.
>
> Has this need been expressed by others?
>
> On Tue, Jun 30, 2015 at 1:03 PM, Reynold Xin <rx...@databricks.com> wrote:
>
>> Try mapPartitions, which gives you an iterator, and you can produce an
>> iterator back.
>>
>>
>> On Tue, Jun 30, 2015 at 11:01 AM, RJ Nowling <rn...@gmail.com> wrote:
>>
>>> Hi all,
>>>
>>> I have a problem where I have a RDD of elements:
>>>
>>> Item1 Item2 Item3 Item4 Item5 Item6 ...
>>>
>>> and I want to run a function over them to decide which runs of elements
>>> to group together:
>>>
>>> [Item1 Item2] [Item3] [Item4 Item5 Item6] ...
>>>
>>> Technically, I could use aggregate to do this, but I would have to use a
>>> List of List of T which would produce a very large collection in memory.
>>>
>>> Is there an easy way to accomplish this?  e.g.,, it would be nice to
>>> have a version of aggregate where the combination function can return a
>>> complete group that is added to the new RDD and an incomplete group which
>>> is passed to the next call of the reduce function.
>>>
>>> Thanks,
>>> RJ
>>>
>>
>>
>
>

Re: Grouping runs of elements in a RDD

Posted by "Abhishek R. Singh" <ab...@tetrationanalytics.com>.

could you use a custom partitioner to preserve boundaries such that all related tuples end up on the same partition?

On Jun 30, 2015, at 12:00 PM, RJ Nowling <rn...@gmail.com> wrote:

> Thanks, Reynold.  I still need to handle incomplete groups that fall between partition boundaries. So, I need a two-pass approach. I came up with a somewhat hacky way to handle those using the partition indices and key-value pairs as a second pass after the first.
> 
> OCaml's std library provides a function called group() that takes a break function that operators on pairs of successive elements.  It seems a similar approach could be used in Spark and would be more efficient than my approach with key-value pairs since you know the ordering of the partitions.
> 
> Has this need been expressed by others?  
> 
> On Tue, Jun 30, 2015 at 1:03 PM, Reynold Xin <rx...@databricks.com> wrote:
> Try mapPartitions, which gives you an iterator, and you can produce an iterator back.
> 
> 
> On Tue, Jun 30, 2015 at 11:01 AM, RJ Nowling <rn...@gmail.com> wrote:
> Hi all,
> 
> I have a problem where I have a RDD of elements:
> 
> Item1 Item2 Item3 Item4 Item5 Item6 ...
> 
> and I want to run a function over them to decide which runs of elements to group together:
> 
> [Item1 Item2] [Item3] [Item4 Item5 Item6] ...
> 
> Technically, I could use aggregate to do this, but I would have to use a List of List of T which would produce a very large collection in memory.
> 
> Is there an easy way to accomplish this?  e.g.,, it would be nice to have a version of aggregate where the combination function can return a complete group that is added to the new RDD and an incomplete group which is passed to the next call of the reduce function.
> 
> Thanks,
> RJ
> 
>

Re: Grouping runs of elements in a RDD

Posted by "Abhishek R. Singh" <ab...@tetrationanalytics.com>.

could you use a custom partitioner to preserve boundaries such that all related tuples end up on the same partition?

On Jun 30, 2015, at 12:00 PM, RJ Nowling <rn...@gmail.com> wrote:

> Thanks, Reynold.  I still need to handle incomplete groups that fall between partition boundaries. So, I need a two-pass approach. I came up with a somewhat hacky way to handle those using the partition indices and key-value pairs as a second pass after the first.
> 
> OCaml's std library provides a function called group() that takes a break function that operators on pairs of successive elements.  It seems a similar approach could be used in Spark and would be more efficient than my approach with key-value pairs since you know the ordering of the partitions.
> 
> Has this need been expressed by others?  
> 
> On Tue, Jun 30, 2015 at 1:03 PM, Reynold Xin <rx...@databricks.com> wrote:
> Try mapPartitions, which gives you an iterator, and you can produce an iterator back.
> 
> 
> On Tue, Jun 30, 2015 at 11:01 AM, RJ Nowling <rn...@gmail.com> wrote:
> Hi all,
> 
> I have a problem where I have a RDD of elements:
> 
> Item1 Item2 Item3 Item4 Item5 Item6 ...
> 
> and I want to run a function over them to decide which runs of elements to group together:
> 
> [Item1 Item2] [Item3] [Item4 Item5 Item6] ...
> 
> Technically, I could use aggregate to do this, but I would have to use a List of List of T which would produce a very large collection in memory.
> 
> Is there an easy way to accomplish this?  e.g.,, it would be nice to have a version of aggregate where the combination function can return a complete group that is added to the new RDD and an incomplete group which is passed to the next call of the reduce function.
> 
> Thanks,
> RJ
> 
>

Re: Grouping runs of elements in a RDD

Posted by RJ Nowling <rn...@gmail.com>.

Thanks, Reynold.  I still need to handle incomplete groups that fall
between partition boundaries. So, I need a two-pass approach. I came up
with a somewhat hacky way to handle those using the partition indices and
key-value pairs as a second pass after the first.

OCaml's std library provides a function called group() that takes a break
function that operators on pairs of successive elements.  It seems a
similar approach could be used in Spark and would be more efficient than my
approach with key-value pairs since you know the ordering of the partitions.

Has this need been expressed by others?

On Tue, Jun 30, 2015 at 1:03 PM, Reynold Xin <rx...@databricks.com> wrote:

> Try mapPartitions, which gives you an iterator, and you can produce an
> iterator back.
>
>
> On Tue, Jun 30, 2015 at 11:01 AM, RJ Nowling <rn...@gmail.com> wrote:
>
>> Hi all,
>>
>> I have a problem where I have a RDD of elements:
>>
>> Item1 Item2 Item3 Item4 Item5 Item6 ...
>>
>> and I want to run a function over them to decide which runs of elements
>> to group together:
>>
>> [Item1 Item2] [Item3] [Item4 Item5 Item6] ...
>>
>> Technically, I could use aggregate to do this, but I would have to use a
>> List of List of T which would produce a very large collection in memory.
>>
>> Is there an easy way to accomplish this?  e.g.,, it would be nice to have
>> a version of aggregate where the combination function can return a complete
>> group that is added to the new RDD and an incomplete group which is passed
>> to the next call of the reduce function.
>>
>> Thanks,
>> RJ
>>
>
>

Re: Grouping runs of elements in a RDD

Posted by RJ Nowling <rn...@gmail.com>.

Thanks, Reynold.  I still need to handle incomplete groups that fall
between partition boundaries. So, I need a two-pass approach. I came up
with a somewhat hacky way to handle those using the partition indices and
key-value pairs as a second pass after the first.

OCaml's std library provides a function called group() that takes a break
function that operators on pairs of successive elements.  It seems a
similar approach could be used in Spark and would be more efficient than my
approach with key-value pairs since you know the ordering of the partitions.

Has this need been expressed by others?

On Tue, Jun 30, 2015 at 1:03 PM, Reynold Xin <rx...@databricks.com> wrote:

> Try mapPartitions, which gives you an iterator, and you can produce an
> iterator back.
>
>
> On Tue, Jun 30, 2015 at 11:01 AM, RJ Nowling <rn...@gmail.com> wrote:
>
>> Hi all,
>>
>> I have a problem where I have a RDD of elements:
>>
>> Item1 Item2 Item3 Item4 Item5 Item6 ...
>>
>> and I want to run a function over them to decide which runs of elements
>> to group together:
>>
>> [Item1 Item2] [Item3] [Item4 Item5 Item6] ...
>>
>> Technically, I could use aggregate to do this, but I would have to use a
>> List of List of T which would produce a very large collection in memory.
>>
>> Is there an easy way to accomplish this?  e.g.,, it would be nice to have
>> a version of aggregate where the combination function can return a complete
>> group that is added to the new RDD and an incomplete group which is passed
>> to the next call of the reduce function.
>>
>> Thanks,
>> RJ
>>
>
>

Re: Grouping runs of elements in a RDD

Posted by Reynold Xin <rx...@databricks.com>.

Try mapPartitions, which gives you an iterator, and you can produce an
iterator back.


On Tue, Jun 30, 2015 at 11:01 AM, RJ Nowling <rn...@gmail.com> wrote:

> Hi all,
>
> I have a problem where I have a RDD of elements:
>
> Item1 Item2 Item3 Item4 Item5 Item6 ...
>
> and I want to run a function over them to decide which runs of elements to
> group together:
>
> [Item1 Item2] [Item3] [Item4 Item5 Item6] ...
>
> Technically, I could use aggregate to do this, but I would have to use a
> List of List of T which would produce a very large collection in memory.
>
> Is there an easy way to accomplish this?  e.g.,, it would be nice to have
> a version of aggregate where the combination function can return a complete
> group that is added to the new RDD and an incomplete group which is passed
> to the next call of the reduce function.
>
> Thanks,
> RJ
>

Re: Grouping runs of elements in a RDD

Posted by Reynold Xin <rx...@databricks.com>.

Try mapPartitions, which gives you an iterator, and you can produce an
iterator back.


On Tue, Jun 30, 2015 at 11:01 AM, RJ Nowling <rn...@gmail.com> wrote:

> Hi all,
>
> I have a problem where I have a RDD of elements:
>
> Item1 Item2 Item3 Item4 Item5 Item6 ...
>
> and I want to run a function over them to decide which runs of elements to
> group together:
>
> [Item1 Item2] [Item3] [Item4 Item5 Item6] ...
>
> Technically, I could use aggregate to do this, but I would have to use a
> List of List of T which would produce a very large collection in memory.
>
> Is there an easy way to accomplish this?  e.g.,, it would be nice to have
> a version of aggregate where the combination function can return a complete
> group that is added to the new RDD and an incomplete group which is passed
> to the next call of the reduce function.
>
> Thanks,
> RJ
>