You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by satish chandra j <js...@gmail.com> on 2016/02/03 11:15:50 UTC

DataFrame First method is resulting different results in each iteration

HI All,
I have data in a emp_df (DataFrame) as mentioned below:

EmpId   Sal   DeptNo
001       100   10
002       120   20
003       130   10
004       140   20
005       150   10

ordrd_emp_df = emp_df.orderBy($"DeptNo",$"Sal".desc)  which results as
below:

DeptNo  Sal   EmpId
10         150   005
10         130   003
10         100   001
20         140   004
20         120   002

Now I want to pick highest paid EmpId of each DeptNo.,hence applied agg
First method as below

ordrd_emp_df.groupBy("DeptNo").agg($"DeptNo",first("EmpId").as("TopSal")).select($"DeptNo",$"TopSal")

Expected output is DeptNo  TopSal
                              10        005
                               20       004
But my output varies for each iteration such as

First Iteration results as  Dept  TopSal
                                      10     003
                                       20     004

Secnd Iteration results as Dept  TopSal
                                      10     005
                                      20     004

Third Iteration results as  Dept  TopSal
                                      10     003
                                      20     002

Not sure why output varies on each iteration as no change in code and
values in DataFrame

Please let me know if any inputs on this

Regards,
Satish Chandra J

Re: DataFrame First method is resulting different results in each iteration

Posted by Ali Tajeldin EDU <al...@gmail.com>.

Hi Satish,
  Take a look at the smvTopNRecs() function in the SMV package.  It does exactly what you are looking for.  It might be overkill to bring in all of SMV for just one function but you will also get a lot more than just DF helper functions (modular views, higher level graphs, dynamic loading of modules (coming soon), data/code sync). Ok, end of SMV plug :-)

http://tresamigossd.github.io/SMV/scaladocs/index.html#org.tresamigos.smv.SmvGroupedDataFunc (See SmvTopNRecs function at the end).
https://github.com/TresAmigosSD/SMV : SMV github page

For your specific example,
emp_df.smvGroupBy("DeptNo").smvTopNRecs(1, $"Sal".desc)

Two things to note:
1. Use "emp_df" and not the sorted "ordrd_emp_df" as the sort will be performed by smvTopNRecs internally.
2. Must use "smvGroupBy" instead of normal "groupBy" method on DataFrame as the result of standard "groupBy" hides the original DF and grouping column :-(

--
Ali 

On Feb 3, 2016, at 9:08 PM, Hemant Bhanawat <he...@gmail.com> wrote:

> Ahh.. missed that. 
> 
> I see that you have used "first" function. 'first' returns the first row it has found. On a single executor it may return the right results. But, on multiple executors, it will return the first row of any of the executor which may not be the first row when the results are combined. 
> 
> I believe, if you change your query like this, you will get the right results: 
> 
> ordrd_emp_df.groupBy("DeptNo").
>         agg($"DeptNo", max("Sal").as("HighestSal"))
> 
> But as you can see, you get the highest Sal and not the EmpId with highest Sal. For getting EmpId with highest Sal, you will have to change your query to add filters or add subqueries. See the following thread: 
> 
> http://stackoverflow.com/questions/6841605/get-top-1-row-of-each-group
> 
> Hemant Bhanawat
> SnappyData (http://snappydata.io/)
> 
> 
> On Wed, Feb 3, 2016 at 4:33 PM, satish chandra j <js...@gmail.com> wrote:
> Hi Hemant,
> My dataframe "ordrd_emd_df" consist data in order as I have applied oderBy in the first step
> And also tried having "orderBy" method before "groupBy" than also getting different results in each iteration
> 
> Regards,
> Satish Chandra
> 
> 
> On Wed, Feb 3, 2016 at 4:28 PM, Hemant Bhanawat <he...@gmail.com> wrote:
> Missing order by? 
> 
> Hemant Bhanawat
> SnappyData (http://snappydata.io/)
> 
> 
> On Wed, Feb 3, 2016 at 3:45 PM, satish chandra j <js...@gmail.com> wrote:
> HI All,
> I have data in a emp_df (DataFrame) as mentioned below:
> 
> EmpId   Sal   DeptNo 
> 001       100   10
> 002       120   20
> 003       130   10
> 004       140   20
> 005       150   10
> 
> ordrd_emp_df = emp_df.orderBy($"DeptNo",$"Sal".desc)  which results as below:
> 
> DeptNo  Sal   EmpId
> 10         150   005
> 10         130   003
> 10         100   001
> 20         140   004
> 20         120   002
> 
> Now I want to pick highest paid EmpId of each DeptNo.,hence applied agg First method as below
> 
> ordrd_emp_df.groupBy("DeptNo").agg($"DeptNo",first("EmpId").as("TopSal")).select($"DeptNo",$"TopSal")
> 
> Expected output is DeptNo  TopSal
>                               10        005
>                                20       004
> But my output varies for each iteration such as
> 
> First Iteration results as  Dept  TopSal
>                                       10     003
>                                        20     004
> 
> Secnd Iteration results as Dept  TopSal
>                                       10     005
>                                       20     004
> 
> Third Iteration results as  Dept  TopSal
>                                       10     003
>                                       20     002
> 
> Not sure why output varies on each iteration as no change in code and values in DataFrame
> 
> Please let me know if any inputs on this 
> 
> Regards,
> Satish Chandra J
> 
> 
>

Re: DataFrame First method is resulting different results in each iteration

Posted by Hemant Bhanawat <he...@gmail.com>.

Ahh.. missed that.

I see that you have used "first" function. 'first' returns the first row it
has found. On a single executor it may return the right results. But, on
multiple executors, it will return the first row of any of the executor
which may not be the first row when the results are combined.

I believe, if you change your query like this, you will get the right
results:

ordrd_emp_df.groupBy("DeptNo").
        agg($"DeptNo", max("Sal").as("HighestSal"))

But as you can see, you get the highest Sal and not the EmpId with highest
Sal. For getting EmpId with highest Sal, you will have to change your query
to add filters or add subqueries. See the following thread:

http://stackoverflow.com/questions/6841605/get-top-1-row-of-each-group

Hemant Bhanawat
SnappyData (http://snappydata.io/)


On Wed, Feb 3, 2016 at 4:33 PM, satish chandra j <js...@gmail.com>
wrote:

> Hi Hemant,
> My dataframe "ordrd_emd_df" consist data in order as I have applied oderBy
> in the first step
> And also tried having "orderBy" method before "groupBy" than also getting
> different results in each iteration
>
> Regards,
> Satish Chandra
>
>
> On Wed, Feb 3, 2016 at 4:28 PM, Hemant Bhanawat <he...@gmail.com>
> wrote:
>
>> Missing order by?
>>
>> Hemant Bhanawat
>> SnappyData (http://snappydata.io/)
>>
>>
>> On Wed, Feb 3, 2016 at 3:45 PM, satish chandra j <
>> jsatishchandra@gmail.com> wrote:
>>
>>> HI All,
>>> I have data in a emp_df (DataFrame) as mentioned below:
>>>
>>> EmpId   Sal   DeptNo
>>> 001       100   10
>>> 002       120   20
>>> 003       130   10
>>> 004       140   20
>>> 005       150   10
>>>
>>> ordrd_emp_df = emp_df.orderBy($"DeptNo",$"Sal".desc)  which results as
>>> below:
>>>
>>> DeptNo  Sal   EmpId
>>> 10         150   005
>>> 10         130   003
>>> 10         100   001
>>> 20         140   004
>>> 20         120   002
>>>
>>> Now I want to pick highest paid EmpId of each DeptNo.,hence applied agg
>>> First method as below
>>>
>>>
>>> ordrd_emp_df.groupBy("DeptNo").agg($"DeptNo",first("EmpId").as("TopSal")).select($"DeptNo",$"TopSal")
>>>
>>> Expected output is DeptNo  TopSal
>>>                               10        005
>>>                                20       004
>>> But my output varies for each iteration such as
>>>
>>> First Iteration results as  Dept  TopSal
>>>                                       10     003
>>>                                        20     004
>>>
>>> Secnd Iteration results as Dept  TopSal
>>>                                       10     005
>>>                                       20     004
>>>
>>> Third Iteration results as  Dept  TopSal
>>>                                       10     003
>>>                                       20     002
>>>
>>> Not sure why output varies on each iteration as no change in code and
>>> values in DataFrame
>>>
>>> Please let me know if any inputs on this
>>>
>>> Regards,
>>> Satish Chandra J
>>>
>>
>>
>

Re: DataFrame First method is resulting different results in each iteration

Posted by satish chandra j <js...@gmail.com>.

Hi Hemant,
My dataframe "ordrd_emd_df" consist data in order as I have applied oderBy
in the first step
And also tried having "orderBy" method before "groupBy" than also getting
different results in each iteration

Regards,
Satish Chandra


On Wed, Feb 3, 2016 at 4:28 PM, Hemant Bhanawat <he...@gmail.com>
wrote:

> Missing order by?
>
> Hemant Bhanawat
> SnappyData (http://snappydata.io/)
>
>
> On Wed, Feb 3, 2016 at 3:45 PM, satish chandra j <jsatishchandra@gmail.com
> > wrote:
>
>> HI All,
>> I have data in a emp_df (DataFrame) as mentioned below:
>>
>> EmpId   Sal   DeptNo
>> 001       100   10
>> 002       120   20
>> 003       130   10
>> 004       140   20
>> 005       150   10
>>
>> ordrd_emp_df = emp_df.orderBy($"DeptNo",$"Sal".desc)  which results as
>> below:
>>
>> DeptNo  Sal   EmpId
>> 10         150   005
>> 10         130   003
>> 10         100   001
>> 20         140   004
>> 20         120   002
>>
>> Now I want to pick highest paid EmpId of each DeptNo.,hence applied agg
>> First method as below
>>
>>
>> ordrd_emp_df.groupBy("DeptNo").agg($"DeptNo",first("EmpId").as("TopSal")).select($"DeptNo",$"TopSal")
>>
>> Expected output is DeptNo  TopSal
>>                               10        005
>>                                20       004
>> But my output varies for each iteration such as
>>
>> First Iteration results as  Dept  TopSal
>>                                       10     003
>>                                        20     004
>>
>> Secnd Iteration results as Dept  TopSal
>>                                       10     005
>>                                       20     004
>>
>> Third Iteration results as  Dept  TopSal
>>                                       10     003
>>                                       20     002
>>
>> Not sure why output varies on each iteration as no change in code and
>> values in DataFrame
>>
>> Please let me know if any inputs on this
>>
>> Regards,
>> Satish Chandra J
>>
>
>

Re: DataFrame First method is resulting different results in each iteration

Posted by Hemant Bhanawat <he...@gmail.com>.

Missing order by?

Hemant Bhanawat
SnappyData (http://snappydata.io/)

On Wed, Feb 3, 2016 at 3:45 PM, satish chandra j <js...@gmail.com>
wrote:

> HI All,
> I have data in a emp_df (DataFrame) as mentioned below:
>
> EmpId   Sal   DeptNo
> 001       100   10
> 002       120   20
> 003       130   10
> 004       140   20
> 005       150   10
>
> ordrd_emp_df = emp_df.orderBy($"DeptNo",$"Sal".desc)  which results as
> below:
>
> DeptNo  Sal   EmpId
> 10         150   005
> 10         130   003
> 10         100   001
> 20         140   004
> 20         120   002
>
> Now I want to pick highest paid EmpId of each DeptNo.,hence applied agg
> First method as below
>
>
> ordrd_emp_df.groupBy("DeptNo").agg($"DeptNo",first("EmpId").as("TopSal")).select($"DeptNo",$"TopSal")
>
> Expected output is DeptNo  TopSal
>                               10        005
>                                20       004
> But my output varies for each iteration such as
>
> First Iteration results as  Dept  TopSal
>                                       10     003
>                                        20     004
>
> Secnd Iteration results as Dept  TopSal
>                                       10     005
>                                       20     004
>
> Third Iteration results as  Dept  TopSal
>                                       10     003
>                                       20     002
>
> Not sure why output varies on each iteration as no change in code and
> values in DataFrame
>
> Please let me know if any inputs on this
>
> Regards,
> Satish Chandra J
>