You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Imran Rajjad <ra...@gmail.com> on 2017/10/18 05:12:56 UTC
parition by multiple columns/keys
Hi,
I have a set of rows that are a result of a groupBy(col1,col2,col3).count().
Is it possible to map rows belong to unique combination inside an iterator?
e.g
col1 col2 col3
a 1 a1
a 1 a2
b 2 b1
b 2 b2
how can I separate rows with col1 and col2 = (a,1) and (b,2)?
regards,
Imran
--
I.R
Re: parition by multiple columns/keys
Posted by Imran Rajjad <ra...@gmail.com>.
strangely this is working only for very small dataset of rows.. for very
large datasets apparently the partitioning is not working. is there a limit
to the number of columns or rows when repartitioning according to multiple
columns?
regards,
Imran
On Wed, Oct 18, 2017 at 11:00 AM, Imran Rajjad <ra...@gmail.com> wrote:
> yes..I think I figured out something like below
>
> Serialized Java Class
> -----------------
> public class MyMapPartition implements Serializable,MapPartitionsFunction{
> @Override
> public Iterator call(Iterator iter) throws Exception {
> ArrayList<Row> list = new ArrayList<Row>();
> // ArrayNode array = mapper.createArrayNode();
> Row row=null;
> System.out.println("--------");
> while(iter.hasNext()){
>
> row=(Row) iter.next();
> System.out.println(row);
> list.add(row);
> }
> System.out.println(">>>>");
> return list.iterator();
> }
> }
>
> Unit Test
> -----------
> JavaRDD<Row> rdd = jsc.parallelize(Arrays.asList(
> RowFactory.create(11L,21L,1L)
> ,RowFactory.create(11L,22L,2L)
> ,RowFactory.create(11L,22L,1L)
> ,RowFactory.create(12L,23L,3L)
> ,RowFactory.create(12L,24L,3L)
> ,RowFactory.create(12L,22L,4L)
> ,RowFactory.create(13L,22L,4L)
> ,RowFactory.create(14L,22L,4L)
> ));
> StructType structType = new StructType();
> structType = structType.add("a", DataTypes.LongType, false)
> .add("b", DataTypes.LongType, false)
> .add("c", DataTypes.LongType, false);
> ExpressionEncoder<Row> encoder = RowEncoder.apply(structType);
>
>
> Dataset<Row> ds = spark.createDataFrame(rdd, encoder.schema());
> ds.show();
>
> MyMapPartition mp = new MyMapPartition ();
> //Iterator<Row>
> //.repartition(new Column("a"),new Column("b"))
> Dataset<Row> grouped = ds.groupBy("a", "b","c")
> .count()
> .repartition(new Column("a"),new Column("b"))
> .mapPartitions(mp,encoder);
>
> grouped.count();
>
> ---------------
>
> output
> --------
> --------
> [12,23,3,1]
> >>>>
> --------
> [14,22,4,1]
> >>>>
> --------
> [12,24,3,1]
> >>>>
> --------
> [12,22,4,1]
> >>>>
> --------
> [11,22,1,1]
> [11,22,2,1]
> >>>>
> --------
> [11,21,1,1]
> >>>>
> --------
> [13,22,4,1]
> >>>>
>
>
> On Wed, Oct 18, 2017 at 10:29 AM, ayan guha <gu...@gmail.com> wrote:
>
>> How or what you want to achieve? Ie are planning to do some aggregation
>> on group by c1,c2?
>>
>> On Wed, 18 Oct 2017 at 4:13 pm, Imran Rajjad <ra...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I have a set of rows that are a result of a
>>> groupBy(col1,col2,col3).count().
>>>
>>> Is it possible to map rows belong to unique combination inside an
>>> iterator?
>>>
>>> e.g
>>>
>>> col1 col2 col3
>>> a 1 a1
>>> a 1 a2
>>> b 2 b1
>>> b 2 b2
>>>
>>> how can I separate rows with col1 and col2 = (a,1) and (b,2)?
>>>
>>> regards,
>>> Imran
>>>
>>> --
>>> I.R
>>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>
>
>
> --
> I.R
>
--
I.R
Re: parition by multiple columns/keys
Posted by Imran Rajjad <ra...@gmail.com>.
yes..I think I figured out something like below
Serialized Java Class
-----------------
public class MyMapPartition implements Serializable,MapPartitionsFunction{
@Override
public Iterator call(Iterator iter) throws Exception {
ArrayList<Row> list = new ArrayList<Row>();
// ArrayNode array = mapper.createArrayNode();
Row row=null;
System.out.println("--------");
while(iter.hasNext()){
row=(Row) iter.next();
System.out.println(row);
list.add(row);
}
System.out.println(">>>>");
return list.iterator();
}
}
Unit Test
-----------
JavaRDD<Row> rdd =
jsc.parallelize(Arrays.asList(RowFactory.create(11L,21L,1L)
,RowFactory.create(11L,22L,2L)
,RowFactory.create(11L,22L,1L)
,RowFactory.create(12L,23L,3L)
,RowFactory.create(12L,24L,3L)
,RowFactory.create(12L,22L,4L)
,RowFactory.create(13L,22L,4L)
,RowFactory.create(14L,22L,4L)
));
StructType structType = new StructType();
structType = structType.add("a", DataTypes.LongType, false)
.add("b", DataTypes.LongType, false)
.add("c", DataTypes.LongType, false);
ExpressionEncoder<Row> encoder = RowEncoder.apply(structType);
Dataset<Row> ds = spark.createDataFrame(rdd, encoder.schema());
ds.show();
MyMapPartition mp = new MyMapPartition ();
//Iterator<Row>
//.repartition(new Column("a"),new Column("b"))
Dataset<Row> grouped = ds.groupBy("a", "b","c")
.count()
.repartition(new Column("a"),new Column("b"))
.mapPartitions(mp,encoder);
grouped.count();
---------------
output
--------
--------
[12,23,3,1]
>>>>
--------
[14,22,4,1]
>>>>
--------
[12,24,3,1]
>>>>
--------
[12,22,4,1]
>>>>
--------
[11,22,1,1]
[11,22,2,1]
>>>>
--------
[11,21,1,1]
>>>>
--------
[13,22,4,1]
>>>>
On Wed, Oct 18, 2017 at 10:29 AM, ayan guha <gu...@gmail.com> wrote:
> How or what you want to achieve? Ie are planning to do some aggregation on
> group by c1,c2?
>
> On Wed, 18 Oct 2017 at 4:13 pm, Imran Rajjad <ra...@gmail.com> wrote:
>
>> Hi,
>>
>> I have a set of rows that are a result of a groupBy(col1,col2,col3).count(
>> ).
>>
>> Is it possible to map rows belong to unique combination inside an
>> iterator?
>>
>> e.g
>>
>> col1 col2 col3
>> a 1 a1
>> a 1 a2
>> b 2 b1
>> b 2 b2
>>
>> how can I separate rows with col1 and col2 = (a,1) and (b,2)?
>>
>> regards,
>> Imran
>>
>> --
>> I.R
>>
> --
> Best Regards,
> Ayan Guha
>
--
I.R
Re: parition by multiple columns/keys
Posted by ayan guha <gu...@gmail.com>.
How or what you want to achieve? Ie are planning to do some aggregation on
group by c1,c2?
On Wed, 18 Oct 2017 at 4:13 pm, Imran Rajjad <ra...@gmail.com> wrote:
> Hi,
>
> I have a set of rows that are a result of a
> groupBy(col1,col2,col3).count().
>
> Is it possible to map rows belong to unique combination inside an iterator?
>
> e.g
>
> col1 col2 col3
> a 1 a1
> a 1 a2
> b 2 b1
> b 2 b2
>
> how can I separate rows with col1 and col2 = (a,1) and (b,2)?
>
> regards,
> Imran
>
> --
> I.R
>
--
Best Regards,
Ayan Guha