You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Imran Rajjad <ra...@gmail.com> on 2017/10/18 05:12:56 UTC

parition by multiple columns/keys

Hi,

I have a set of rows that are a result of a groupBy(col1,col2,col3).count().

Is it possible to map rows belong to unique combination inside an iterator?

e.g

col1 col2 col3
a      1      a1
a      1      a2
b      2      b1
b      2      b2

how can I separate rows with col1 and col2 = (a,1) and (b,2)?

regards,
Imran

-- 
I.R

Re: parition by multiple columns/keys

Posted by Imran Rajjad <ra...@gmail.com>.

strangely this is working only for very small dataset of rows.. for very
large datasets apparently the partitioning is not working. is there a limit
to the number of columns or rows when repartitioning according to multiple
columns?

regards,
Imran

On Wed, Oct 18, 2017 at 11:00 AM, Imran Rajjad <ra...@gmail.com> wrote:

> yes..I think I figured out something like below
>
> Serialized Java Class
> -----------------
> public class MyMapPartition implements Serializable,MapPartitionsFunction{
>  @Override
>  public Iterator call(Iterator iter) throws Exception {
>   ArrayList<Row> list = new ArrayList<Row>();
>   // ArrayNode array = mapper.createArrayNode();
>   Row row=null;
>   System.out.println("--------");
>   while(iter.hasNext()){
>
>    row=(Row) iter.next();
>    System.out.println(row);
>    list.add(row);
>   }
>   System.out.println(">>>>");
>   return list.iterator();
>  }
> }
>
> Unit Test
> -----------
> JavaRDD<Row> rdd = jsc.parallelize(Arrays.asList(
> RowFactory.create(11L,21L,1L)
>               ,RowFactory.create(11L,22L,2L)
>               ,RowFactory.create(11L,22L,1L)
>               ,RowFactory.create(12L,23L,3L)
>               ,RowFactory.create(12L,24L,3L)
>               ,RowFactory.create(12L,22L,4L)
>               ,RowFactory.create(13L,22L,4L)
>               ,RowFactory.create(14L,22L,4L)
>     ));
>   StructType structType = new StructType();
>   structType = structType.add("a", DataTypes.LongType, false)
>         .add("b", DataTypes.LongType, false)
>         .add("c", DataTypes.LongType, false);
>   ExpressionEncoder<Row> encoder = RowEncoder.apply(structType);
>
>
>   Dataset<Row> ds = spark.createDataFrame(rdd, encoder.schema());
>   ds.show();
>
>   MyMapPartition mp = new MyMapPartition ();
> //Iterator<Row>
>   //.repartition(new Column("a"),new Column("b"))
>    Dataset<Row> grouped = ds.groupBy("a", "b","c")
>     .count()
>     .repartition(new Column("a"),new Column("b"))
>     .mapPartitions(mp,encoder);
>
>   grouped.count();
>
> ---------------
>
> output
> --------
> --------
> [12,23,3,1]
> >>>>
> --------
> [14,22,4,1]
> >>>>
> --------
> [12,24,3,1]
> >>>>
> --------
> [12,22,4,1]
> >>>>
> --------
> [11,22,1,1]
> [11,22,2,1]
> >>>>
> --------
> [11,21,1,1]
> >>>>
> --------
> [13,22,4,1]
> >>>>
>
>
> On Wed, Oct 18, 2017 at 10:29 AM, ayan guha <gu...@gmail.com> wrote:
>
>> How or what you want to achieve? Ie are planning to do some aggregation
>> on group by c1,c2?
>>
>> On Wed, 18 Oct 2017 at 4:13 pm, Imran Rajjad <ra...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I have a set of rows that are a result of a
>>> groupBy(col1,col2,col3).count().
>>>
>>> Is it possible to map rows belong to unique combination inside an
>>> iterator?
>>>
>>> e.g
>>>
>>> col1 col2 col3
>>> a      1      a1
>>> a      1      a2
>>> b      2      b1
>>> b      2      b2
>>>
>>> how can I separate rows with col1 and col2 = (a,1) and (b,2)?
>>>
>>> regards,
>>> Imran
>>>
>>> --
>>> I.R
>>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>
>
>
> --
> I.R
>



-- 
I.R

Re: parition by multiple columns/keys

Posted by Imran Rajjad <ra...@gmail.com>.

yes..I think I figured out something like below

Serialized Java Class
-----------------
public class MyMapPartition implements Serializable,MapPartitionsFunction{
 @Override
 public Iterator call(Iterator iter) throws Exception {
  ArrayList<Row> list = new ArrayList<Row>();
  // ArrayNode array = mapper.createArrayNode();
  Row row=null;
  System.out.println("--------");
  while(iter.hasNext()){

   row=(Row) iter.next();
   System.out.println(row);
   list.add(row);
  }
  System.out.println(">>>>");
  return list.iterator();
 }
}

Unit Test
-----------
JavaRDD<Row> rdd =
jsc.parallelize(Arrays.asList(RowFactory.create(11L,21L,1L)
              ,RowFactory.create(11L,22L,2L)
              ,RowFactory.create(11L,22L,1L)
              ,RowFactory.create(12L,23L,3L)
              ,RowFactory.create(12L,24L,3L)
              ,RowFactory.create(12L,22L,4L)
              ,RowFactory.create(13L,22L,4L)
              ,RowFactory.create(14L,22L,4L)
    ));
  StructType structType = new StructType();
  structType = structType.add("a", DataTypes.LongType, false)
        .add("b", DataTypes.LongType, false)
        .add("c", DataTypes.LongType, false);
  ExpressionEncoder<Row> encoder = RowEncoder.apply(structType);


  Dataset<Row> ds = spark.createDataFrame(rdd, encoder.schema());
  ds.show();

  MyMapPartition mp = new MyMapPartition ();
//Iterator<Row>
  //.repartition(new Column("a"),new Column("b"))
   Dataset<Row> grouped = ds.groupBy("a", "b","c")
    .count()
    .repartition(new Column("a"),new Column("b"))
    .mapPartitions(mp,encoder);

  grouped.count();

---------------

output
--------
--------
[12,23,3,1]
>>>>
--------
[14,22,4,1]
>>>>
--------
[12,24,3,1]
>>>>
--------
[12,22,4,1]
>>>>
--------
[11,22,1,1]
[11,22,2,1]
>>>>
--------
[11,21,1,1]
>>>>
--------
[13,22,4,1]
>>>>


On Wed, Oct 18, 2017 at 10:29 AM, ayan guha <gu...@gmail.com> wrote:

> How or what you want to achieve? Ie are planning to do some aggregation on
> group by c1,c2?
>
> On Wed, 18 Oct 2017 at 4:13 pm, Imran Rajjad <ra...@gmail.com> wrote:
>
>> Hi,
>>
>> I have a set of rows that are a result of a groupBy(col1,col2,col3).count(
>> ).
>>
>> Is it possible to map rows belong to unique combination inside an
>> iterator?
>>
>> e.g
>>
>> col1 col2 col3
>> a      1      a1
>> a      1      a2
>> b      2      b1
>> b      2      b2
>>
>> how can I separate rows with col1 and col2 = (a,1) and (b,2)?
>>
>> regards,
>> Imran
>>
>> --
>> I.R
>>
> --
> Best Regards,
> Ayan Guha
>



-- 
I.R

Re: parition by multiple columns/keys

Posted by ayan guha <gu...@gmail.com>.

How or what you want to achieve? Ie are planning to do some aggregation on
group by c1,c2?

On Wed, 18 Oct 2017 at 4:13 pm, Imran Rajjad <ra...@gmail.com> wrote:

> Hi,
>
> I have a set of rows that are a result of a
> groupBy(col1,col2,col3).count().
>
> Is it possible to map rows belong to unique combination inside an iterator?
>
> e.g
>
> col1 col2 col3
> a      1      a1
> a      1      a2
> b      2      b1
> b      2      b2
>
> how can I separate rows with col1 and col2 = (a,1) and (b,2)?
>
> regards,
> Imran
>
> --
> I.R
>
-- 
Best Regards,
Ayan Guha