You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by ashensw <as...@wso2.com> on 2015/08/28 14:39:53 UTC

Calculating Min and Max Values using Spark Transformations?

Hi all,

I have a dataset which consist of large number of features(columns). It is
in csv format. So I loaded it into a spark dataframe. Then I converted it
into a JavaRDD<Row> Then using a spark transformation I converted that into
JavaRDD<String[]>. Then again converted it into a JavaRDD<double[]>. So now
I have a JavaRDD<double[]>. So is there any method to calculate max and min
values of each columns in this JavaRDD<double[]> ?  

Or Is there any way to access the array if I store max and min values to a
array inside the spark transformation class?

Thanks.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Calculating-Min-and-Max-Values-using-Spark-Transformations-tp24491.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Calculating Min and Max Values using Spark Transformations?

Posted by Asher Krim <ak...@hubspot.com>.

Yes, absolutely. Take a look at:
https://spark.apache.org/docs/1.4.1/mllib-statistics.html#summary-statistics

On Fri, Aug 28, 2015 at 8:39 AM, ashensw <as...@wso2.com> wrote:

> Hi all,
>
> I have a dataset which consist of large number of features(columns). It is
> in csv format. So I loaded it into a spark dataframe. Then I converted it
> into a JavaRDD<Row> Then using a spark transformation I converted that into
> JavaRDD<String[]>. Then again converted it into a JavaRDD<double[]>. So now
> I have a JavaRDD<double[]>. So is there any method to calculate max and min
> values of each columns in this JavaRDD<double[]> ?
>
> Or Is there any way to access the array if I store max and min values to a
> array inside the spark transformation class?
>
> Thanks.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Calculating-Min-and-Max-Values-using-Spark-Transformations-tp24491.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>

Re: Calculating Min and Max Values using Spark Transformations?

Posted by Ashen Weerathunga <as...@wso2.com>.

Thanks everyone for the help!


On Sat, Aug 29, 2015 at 2:55 AM, Alexey Grishchenko <pr...@gmail.com>
wrote:

> If the data is already in RDD, the easiest way to calculate min/max for
> each column would be an aggregate() function. It takes 2 functions as
> arguments - first is used to aggregate RDD values to your "accumulator",
> the second is used to merge two accumulators. This way both min and max for
> all the columns in your RDD would be calculated in a single pass over it.
> Here's an example in Python:
>
> def agg1(x,y):
>     if len(x) == 0: x = [y,y]
>     return [map(min,zip(x[0],y)),map(max,zip(x[1],y))]
>
> def agg2(x,y):
>     if len(x) == 0: x = y
>     return [map(min,zip(x[0],y[0])),map(max,zip(x[1],y[1]))]
>
> rdd  = sc.parallelize(xrange(100000), 5)
> rdd2 = rdd.map(lambda x: ([random.randint(1,100) for _ in xrange(15)]))
> rdd2.aggregate([], agg1, agg2)
>
> What personally I would do in your case depends on what else you want to
> do with the data. If you plan to run some more business logic on top of it
> and you're more comfortable with SQL, it might worth registering this
> DataFrame as a table and generating SQL query to it (generate a string with
> a series of min-max calls). But to solve your specific problem I'd load
> your file with textFile(), use map() transformation to split the string by
> comma and convert it to the array of doubles, and call aggregate() on top
> of it just like I've shown in the example above
>
> On Fri, Aug 28, 2015 at 6:15 PM, Burak Yavuz <br...@gmail.com> wrote:
>
>> Or you can just call describe() on the dataframe? In addition to min-max,
>> you'll also get the mean, and count of non-null and non-NA elements as well.
>>
>> Burak
>>
>> On Fri, Aug 28, 2015 at 10:09 AM, java8964 <ja...@hotmail.com> wrote:
>>
>>> Or RDD.max() and RDD.min() won't work for you?
>>>
>>> Yong
>>>
>>> ------------------------------
>>> Subject: Re: Calculating Min and Max Values using Spark Transformations?
>>> To: ashen@wso2.com
>>> CC: user@spark.apache.org
>>> From: jfchen@us.ibm.com
>>> Date: Fri, 28 Aug 2015 09:28:43 -0700
>>>
>>>
>>> If you already loaded csv data into a dataframe, why not register it as
>>> a table, and use Spark SQL
>>> to find max/min or any other aggregates? SELECT MAX(column_name) FROM
>>> dftable_name ... seems natural.
>>>
>>>
>>>
>>>
>>>
>>>    *JESSE CHEN*
>>>    Big Data Performance | IBM Analytics
>>>
>>>    Office:  408 463 2296
>>>    Mobile: 408 828 9068
>>>    Email:   jfchen@us.ibm.com
>>>
>>>
>>>
>>> [image: Inactive hide details for ashensw ---08/28/2015 05:40:07 AM---Hi
>>> all, I have a dataset which consist of large number of feature]ashensw
>>> ---08/28/2015 05:40:07 AM---Hi all, I have a dataset which consist of large
>>> number of features(columns). It is
>>>
>>> From: ashensw <as...@wso2.com>
>>> To: user@spark.apache.org
>>> Date: 08/28/2015 05:40 AM
>>> Subject: Calculating Min and Max Values using Spark Transformations?
>>>
>>> ------------------------------
>>>
>>>
>>>
>>> Hi all,
>>>
>>> I have a dataset which consist of large number of features(columns). It
>>> is
>>> in csv format. So I loaded it into a spark dataframe. Then I converted it
>>> into a JavaRDD<Row> Then using a spark transformation I converted that
>>> into
>>> JavaRDD<String[]>. Then again converted it into a JavaRDD<double[]>. So
>>> now
>>> I have a JavaRDD<double[]>. So is there any method to calculate max and
>>> min
>>> values of each columns in this JavaRDD<double[]> ?
>>>
>>> Or Is there any way to access the array if I store max and min values to
>>> a
>>> array inside the spark transformation class?
>>>
>>> Thanks.
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Calculating-Min-and-Max-Values-using-Spark-Transformations-tp24491.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>> For additional commands, e-mail: user-help@spark.apache.org
>>>
>>>
>>>
>>
>
>
> --
> Best regards, Alexey Grishchenko
>
> phone: +353 (87) 262-2154
> email: ProgrammerAG@gmail.com
> web:   http://0x0fff.com
>



-- 
*Ashen Weerathunga*
Software Engineer - Intern
WSO2 Inc.: http://wso2.com
lean.enterprise.middleware

Email: ashen@wso2.com
Mobile: +94 716042995 <94716042995>
LinkedIn:
*http://lk.linkedin.com/in/ashenweerathunga
<http://lk.linkedin.com/in/ashenweerathunga>*

Re: Calculating Min and Max Values using Spark Transformations?

Posted by Alexey Grishchenko <pr...@gmail.com>.

If the data is already in RDD, the easiest way to calculate min/max for
each column would be an aggregate() function. It takes 2 functions as
arguments - first is used to aggregate RDD values to your "accumulator",
the second is used to merge two accumulators. This way both min and max for
all the columns in your RDD would be calculated in a single pass over it.
Here's an example in Python:

def agg1(x,y):
    if len(x) == 0: x = [y,y]
    return [map(min,zip(x[0],y)),map(max,zip(x[1],y))]

def agg2(x,y):
    if len(x) == 0: x = y
    return [map(min,zip(x[0],y[0])),map(max,zip(x[1],y[1]))]

rdd  = sc.parallelize(xrange(100000), 5)
rdd2 = rdd.map(lambda x: ([random.randint(1,100) for _ in xrange(15)]))
rdd2.aggregate([], agg1, agg2)

What personally I would do in your case depends on what else you want to do
with the data. If you plan to run some more business logic on top of it and
you're more comfortable with SQL, it might worth registering this DataFrame
as a table and generating SQL query to it (generate a string with a series
of min-max calls). But to solve your specific problem I'd load your file
with textFile(), use map() transformation to split the string by comma and
convert it to the array of doubles, and call aggregate() on top of it just
like I've shown in the example above

On Fri, Aug 28, 2015 at 6:15 PM, Burak Yavuz <br...@gmail.com> wrote:

> Or you can just call describe() on the dataframe? In addition to min-max,
> you'll also get the mean, and count of non-null and non-NA elements as well.
>
> Burak
>
> On Fri, Aug 28, 2015 at 10:09 AM, java8964 <ja...@hotmail.com> wrote:
>
>> Or RDD.max() and RDD.min() won't work for you?
>>
>> Yong
>>
>> ------------------------------
>> Subject: Re: Calculating Min and Max Values using Spark Transformations?
>> To: ashen@wso2.com
>> CC: user@spark.apache.org
>> From: jfchen@us.ibm.com
>> Date: Fri, 28 Aug 2015 09:28:43 -0700
>>
>>
>> If you already loaded csv data into a dataframe, why not register it as a
>> table, and use Spark SQL
>> to find max/min or any other aggregates? SELECT MAX(column_name) FROM
>> dftable_name ... seems natural.
>>
>>
>>
>>
>>
>>    *JESSE CHEN*
>>    Big Data Performance | IBM Analytics
>>
>>    Office:  408 463 2296
>>    Mobile: 408 828 9068
>>    Email:   jfchen@us.ibm.com
>>
>>
>>
>> [image: Inactive hide details for ashensw ---08/28/2015 05:40:07 AM---Hi
>> all, I have a dataset which consist of large number of feature]ashensw
>> ---08/28/2015 05:40:07 AM---Hi all, I have a dataset which consist of large
>> number of features(columns). It is
>>
>> From: ashensw <as...@wso2.com>
>> To: user@spark.apache.org
>> Date: 08/28/2015 05:40 AM
>> Subject: Calculating Min and Max Values using Spark Transformations?
>>
>> ------------------------------
>>
>>
>>
>> Hi all,
>>
>> I have a dataset which consist of large number of features(columns). It is
>> in csv format. So I loaded it into a spark dataframe. Then I converted it
>> into a JavaRDD<Row> Then using a spark transformation I converted that
>> into
>> JavaRDD<String[]>. Then again converted it into a JavaRDD<double[]>. So
>> now
>> I have a JavaRDD<double[]>. So is there any method to calculate max and
>> min
>> values of each columns in this JavaRDD<double[]> ?
>>
>> Or Is there any way to access the array if I store max and min values to a
>> array inside the spark transformation class?
>>
>> Thanks.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Calculating-Min-and-Max-Values-using-Spark-Transformations-tp24491.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>>
>>
>


-- 
Best regards, Alexey Grishchenko

phone: +353 (87) 262-2154
email: ProgrammerAG@gmail.com
web:   http://0x0fff.com

Re: Calculating Min and Max Values using Spark Transformations?

Posted by Burak Yavuz <br...@gmail.com>.

Or you can just call describe() on the dataframe? In addition to min-max,
you'll also get the mean, and count of non-null and non-NA elements as well.

Burak

On Fri, Aug 28, 2015 at 10:09 AM, java8964 <ja...@hotmail.com> wrote:

> Or RDD.max() and RDD.min() won't work for you?
>
> Yong
>
> ------------------------------
> Subject: Re: Calculating Min and Max Values using Spark Transformations?
> To: ashen@wso2.com
> CC: user@spark.apache.org
> From: jfchen@us.ibm.com
> Date: Fri, 28 Aug 2015 09:28:43 -0700
>
>
> If you already loaded csv data into a dataframe, why not register it as a
> table, and use Spark SQL
> to find max/min or any other aggregates? SELECT MAX(column_name) FROM
> dftable_name ... seems natural.
>
>
>
>
>
>    *JESSE CHEN*
>    Big Data Performance | IBM Analytics
>
>    Office:  408 463 2296
>    Mobile: 408 828 9068
>    Email:   jfchen@us.ibm.com
>
>
>
> [image: Inactive hide details for ashensw ---08/28/2015 05:40:07 AM---Hi
> all, I have a dataset which consist of large number of feature]ashensw
> ---08/28/2015 05:40:07 AM---Hi all, I have a dataset which consist of large
> number of features(columns). It is
>
> From: ashensw <as...@wso2.com>
> To: user@spark.apache.org
> Date: 08/28/2015 05:40 AM
> Subject: Calculating Min and Max Values using Spark Transformations?
>
> ------------------------------
>
>
>
> Hi all,
>
> I have a dataset which consist of large number of features(columns). It is
> in csv format. So I loaded it into a spark dataframe. Then I converted it
> into a JavaRDD<Row> Then using a spark transformation I converted that into
> JavaRDD<String[]>. Then again converted it into a JavaRDD<double[]>. So now
> I have a JavaRDD<double[]>. So is there any method to calculate max and min
> values of each columns in this JavaRDD<double[]> ?
>
> Or Is there any way to access the array if I store max and min values to a
> array inside the spark transformation class?
>
> Thanks.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Calculating-Min-and-Max-Values-using-Spark-Transformations-tp24491.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>
>

RE: Calculating Min and Max Values using Spark Transformations?

Posted by java8964 <ja...@hotmail.com>.

Or RDD.max() and RDD.min() won't work for you?
Yong

Subject: Re: Calculating Min and Max Values using Spark Transformations?
To: ashen@wso2.com
CC: user@spark.apache.org
From: jfchen@us.ibm.com
Date: Fri, 28 Aug 2015 09:28:43 -0700

If you already loaded csv data into a dataframe, why not register it as a table, and use Spark SQL

to find max/min or any other aggregates? SELECT MAX(column_name) FROM dftable_name ... seems natural.

JESSE CHEN

Big Data Performance | IBM Analytics

Office:  408 463 2296

Mobile: 408 828 9068

Email:   jfchen@us.ibm.com

ashensw ---08/28/2015 05:40:07 AM---Hi all, I have a dataset which consist of large number of features(columns). It is

From:	ashensw <as...@wso2.com>

To:	user@spark.apache.org

Date:	08/28/2015 05:40 AM

Subject:	Calculating Min and Max Values using Spark Transformations?

Hi all,

I have a dataset which consist of large number of features(columns). It is

in csv format. So I loaded it into a spark dataframe. Then I converted it

into a JavaRDD<Row> Then using a spark transformation I converted that into

JavaRDD<String[]>. Then again converted it into a JavaRDD<double[]>. So now

I have a JavaRDD<double[]>. So is there any method to calculate max and min

values of each columns in this JavaRDD<double[]> ?  

Or Is there any way to access the array if I store max and min values to a

array inside the spark transformation class?

Thanks.

--

View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Calculating-Min-and-Max-Values-using-Spark-Transformations-tp24491.html

Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------

To unsubscribe, e-mail: user-unsubscribe@spark.apache.org

For additional commands, e-mail: user-help@spark.apache.org

Re: Calculating Min and Max Values using Spark Transformations?

Posted by Jesse F Chen <jf...@us.ibm.com>.

If you already loaded csv data into a dataframe, why not register it as a
table, and use Spark SQL
to find max/min or any other aggregates? SELECT MAX(column_name) FROM
dftable_name ... seems natural.

                                                                                                                                              
                                                                                                                                              
                                                                                                                                              
                                                                                                                                              
                                                                                                                                              
                                                                                                                                              
                                   JESSE CHEN                                                                                                 
                                   Big Data Performance | IBM Analytics                                                                       
                                                                                                                                              
                                   Office:  408 463 2296                                                                                      
                                   Mobile: 408 828 9068                                                                                       
                                   Email:   jfchen@us.ibm.com                                                                                 
                                                                                                                                              
                                                                                                                                              






From:	ashensw <as...@wso2.com>
To:	user@spark.apache.org
Date:	08/28/2015 05:40 AM
Subject:	Calculating Min and Max Values using Spark Transformations?



Hi all,

I have a dataset which consist of large number of features(columns). It is
in csv format. So I loaded it into a spark dataframe. Then I converted it
into a JavaRDD<Row> Then using a spark transformation I converted that into
JavaRDD<String[]>. Then again converted it into a JavaRDD<double[]>. So now
I have a JavaRDD<double[]>. So is there any method to calculate max and min
values of each columns in this JavaRDD<double[]> ?

Or Is there any way to access the array if I store max and min values to a
array inside the spark transformation class?

Thanks.



--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Calculating-Min-and-Max-Values-using-Spark-Transformations-tp24491.html

Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org