You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Gen <ge...@gmail.com> on 2014/10/20 11:52:47 UTC

Re: How to aggregate data in Apach Spark

Hi,

I will write the code in python

{code:title=test.py}
data = sc.textFile(...).map(...) ## Please make sure that the rdd is
like[[id, c1, c2, c3], [id, c1, c2, c3],...]
keypair = data.map(lambda l: ((l[0],l[1],l[2]), float(l[3])))
keypair = keypair.reduceByKey(add)
out = keypair.map(lambda l: list(l[0]) + [l[1]])
{code}


Kalyan wrote
> I have a distribute system on 3 nodes and my dataset is distributed among
> those nodes. for example, I have a test.csv file which is exist on all 3
> nodes and it contains 4 columns of
> 
> **row | id,  C1, C2,  C3
> ----------------------
> row1  | A1 , c1 , c2 ,2
> row2  | A1 , c1 , c2 ,1 
> row3  | A1 , c11, c2 ,1 
> row4  | A2 , c1 , c2 ,1 
> row5  | A2 , c1 , c2 ,1 
> row6  | A2 , c11, c2 ,1 
> row7  | A2 , c11, c21,1 
> row8  | A3 , c1 , c2 ,1
> row9  | A3 , c1 , c2 ,2
> row10 | A4 , c1 , c2 ,1
> 
> I need help, how to aggregate data set by id, c1,c2,c3 columns and output
> like this
> 
> **row | id,  C1, C2,  C3
> ----------------------
> row1  | A1 , c1 , c2 ,3
> row2  | A1 , c11, c2 ,1 
> row3  | A2 , c1 , c2 ,2 
> row4  | A2 , c11, c2 ,1 
> row5  | A2 , c11, c21,1 
> row6  | A3 , c1 , c2 ,3
> row7  | A4 , c1 , c2 ,1
> 
> Thanks 
> Kalyan





--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-aggregate-data-in-Apach-Spark-tp16764p16803.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: How to aggregate data in Apach Spark

Posted by Davies Liu <da...@databricks.com>.

You also could use Spark SQL:

from pyspark.sql import Row, SQLContext
row = Row('id', 'C1',  'C2', 'C3')
# convert each
data = sc.textFile("test.csv").map(lambda line: line.split(','))
sqlContext = SQLContext(sc)
rows = data.map(lambda r: row(*r))
sqlContext.inferSchema(rows).registerTempTable("data")
result = sqlContext.sql("select id, C1, C2, sum(C3) from data group by
id, C1, C2") # is SchemaRDD


On Mon, Oct 20, 2014 at 2:52 AM, Gen <ge...@gmail.com> wrote:
> Hi,
>
> I will write the code in python
>
> {code:title=test.py}
> data = sc.textFile(...).map(...) ## Please make sure that the rdd is
> like[[id, c1, c2, c3], [id, c1, c2, c3],...]
> keypair = data.map(lambda l: ((l[0],l[1],l[2]), float(l[3])))
> keypair = keypair.reduceByKey(add)
> out = keypair.map(lambda l: list(l[0]) + [l[1]])
> {code}
>
>
> Kalyan wrote
>> I have a distribute system on 3 nodes and my dataset is distributed among
>> those nodes. for example, I have a test.csv file which is exist on all 3
>> nodes and it contains 4 columns of
>>
>> **row | id,  C1, C2,  C3
>> ----------------------
>> row1  | A1 , c1 , c2 ,2
>> row2  | A1 , c1 , c2 ,1
>> row3  | A1 , c11, c2 ,1
>> row4  | A2 , c1 , c2 ,1
>> row5  | A2 , c1 , c2 ,1
>> row6  | A2 , c11, c2 ,1
>> row7  | A2 , c11, c21,1
>> row8  | A3 , c1 , c2 ,1
>> row9  | A3 , c1 , c2 ,2
>> row10 | A4 , c1 , c2 ,1
>>
>> I need help, how to aggregate data set by id, c1,c2,c3 columns and output
>> like this
>>
>> **row | id,  C1, C2,  C3
>> ----------------------
>> row1  | A1 , c1 , c2 ,3
>> row2  | A1 , c11, c2 ,1
>> row3  | A2 , c1 , c2 ,2
>> row4  | A2 , c11, c2 ,1
>> row5  | A2 , c11, c21,1
>> row6  | A3 , c1 , c2 ,3
>> row7  | A4 , c1 , c2 ,1
>>
>> Thanks
>> Kalyan
>
>
>
>
>
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-aggregate-data-in-Apach-Spark-tp16764p16803.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org