You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by kushagra deep <ku...@gmail.com> on 2021/05/12 12:50:00 UTC

Merge two dataframes

Hi All,

I have two dataframes

df1

amount_6m
 100
 200
 300
 400
 500

And a second data df2 below

 amount_9m
  500
  600
  700
  800
  900

The number of rows is same in both dataframes.

Can I merge the two dataframes to achieve below df

df3

amount_6m | amount_9m
    100                   500
     200                  600
     300                  700
     400                  800
     500                  900

Thanks in advance

Reg,
Kushagra Deep

Re: Merge two dataframes

Posted by Mich Talebzadeh <mi...@gmail.com>.
Hi  Kushagra,


I believe you are referring to this warning below

WARN window.WindowExec: No Partition Defined for Window operation! Moving
all data to a single partition, this can cause serious performance
degradation.

I don't know an easy way around it. If the operation is only once you may
be able to live it. Let me think about what else can be done.

HTH

Mich


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 18 May 2021 at 20:51, kushagra deep <ku...@gmail.com>
wrote:

> Thanks  a lot Mich , this works though I have to test for scalability.
> I have one question though . If we dont specify any column in partitionBy
> will it shuffle all the records in one executor ? Because this is what
> seems to be happening.
>
>
> Thanks once again !
> Regards
> Kushagra Deep
>
> On Tue, May 18, 2021 at 10:48 PM Mich Talebzadeh <
> mich.talebzadeh@gmail.com> wrote:
>
>> Ok, this should hopefully work as it uses row_number.
>>
>> from pyspark.sql.window import Window
>> import pyspark.sql.functions as F
>> from pyspark.sql.functions import row_number
>>
>> def spark_session(appName):
>>   return SparkSession.builder \
>>         .appName(appName) \
>>         .enableHiveSupport() \
>>         .getOrCreate()
>> appName = "test"
>> spark =spark_session(appName)
>> ##
>> ## Get a DF first from csv files
>> ##
>> d1location="hdfs://rhes75:9000/tmp/df1.csv"
>> d2location="hdfs://rhes75:9000/tmp/df2.csv"
>>
>> df1 = spark.read.csv(d1location, header="true")
>> df1.printSchema()
>> df1.show()
>> df2 = spark.read.csv(d2location, header="true")
>> df2.printSchema()
>> df2.show()
>> df1 =
>> df1.select(F.row_number().over(Window.partitionBy().orderBy(df1['amount_6m'])).alias("row_num"),"amount_6m")
>> df1.show()
>> df2 =
>> df2.select(F.row_number().over(Window.partitionBy().orderBy(df2['amount_9m'])).alias("row_num"),"amount_9m")
>> df2.show()
>> df1.join(df2,"row_num","inner").select("amount_6m","amount_9m").show()
>>
>>
>> root
>>  |-- amount_6m: string (nullable = true)
>>
>> +---------+
>> |amount_6m|
>> +---------+
>> |      100|
>> |      200|
>> |      300|
>> |      400|
>> |      500 |
>> +---------+
>>
>> root
>>  |-- amount_9m: string (nullable = true)
>>
>> +---------+
>> |amount_9m|
>> +---------+
>> |      500|
>> |      600|
>> |      700|
>> |      800|
>> |      900|
>> +---------+
>>
>> +-------+---------+
>> |row_num|amount_6m|
>> +-------+---------+
>> |      1|      100|
>> |      2|      200|
>> |      3|      300|
>> |      4|      400|
>> |      5|      500 |
>> +-------+---------+
>>
>> +-------+---------+
>> |row_num|amount_9m|
>> +-------+---------+
>> |      1|      500|
>> |      2|      600|
>> |      3|      700|
>> |      4|      800|
>> |      5|      900|
>> +-------+---------+
>>
>> +---------+---------+
>> |amount_6m|amount_9m|
>> +---------+---------+
>> |      100|      500|
>> |      200|      600|
>> |      300|      700|
>> |      400|      800|
>> |     500 |      900|
>> +---------+---------+
>>
>> HTH
>>
>>
>>
>>    view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Tue, 18 May 2021 at 16:39, kushagra deep <ku...@gmail.com>
>> wrote:
>>
>>> The use case is to calculate PSI/CSI values . And yes the union is one
>>> to one row as you showed.
>>>
>>> On Tue, May 18, 2021, 20:39 Mich Talebzadeh <mi...@gmail.com>
>>> wrote:
>>>
>>>>
>>>> Hi Kushagra,
>>>>
>>>> A bit late on this but what is the business use case for this merge?
>>>>
>>>> You have two data frames each with one column and you want to UNION
>>>> them in a certain way but the correlation is not known. In other words this
>>>> UNION is as is?
>>>>
>>>>        amount_6m | amount_9m
>>>>        100             500
>>>>        200             600
>>>>
>>>> HTH
>>>>
>>>>
>>>> On Wed, 12 May 2021 at 13:51, kushagra deep <ku...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi All,
>>>>>
>>>>> I have two dataframes
>>>>>
>>>>> df1
>>>>>
>>>>> amount_6m
>>>>>  100
>>>>>  200
>>>>>  300
>>>>>  400
>>>>>  500
>>>>>
>>>>> And a second data df2 below
>>>>>
>>>>>  amount_9m
>>>>>   500
>>>>>   600
>>>>>   700
>>>>>   800
>>>>>   900
>>>>>
>>>>> The number of rows is same in both dataframes.
>>>>>
>>>>> Can I merge the two dataframes to achieve below df
>>>>>
>>>>> df3
>>>>>
>>>>> amount_6m | amount_9m
>>>>>     100                   500
>>>>>      200                  600
>>>>>      300                  700
>>>>>      400                  800
>>>>>      500                  900
>>>>>
>>>>> Thanks in advance
>>>>>
>>>>> Reg,
>>>>> Kushagra Deep
>>>>>
>>>>>

Re: Merge two dataframes

Posted by kushagra deep <ku...@gmail.com>.
Thanks  a lot Mich , this works though I have to test for scalability.
I have one question though . If we dont specify any column in partitionBy
will it shuffle all the records in one executor ? Because this is what
seems to be happening.


Thanks once again !
Regards
Kushagra Deep

On Tue, May 18, 2021 at 10:48 PM Mich Talebzadeh <mi...@gmail.com>
wrote:

> Ok, this should hopefully work as it uses row_number.
>
> from pyspark.sql.window import Window
> import pyspark.sql.functions as F
> from pyspark.sql.functions import row_number
>
> def spark_session(appName):
>   return SparkSession.builder \
>         .appName(appName) \
>         .enableHiveSupport() \
>         .getOrCreate()
> appName = "test"
> spark =spark_session(appName)
> ##
> ## Get a DF first from csv files
> ##
> d1location="hdfs://rhes75:9000/tmp/df1.csv"
> d2location="hdfs://rhes75:9000/tmp/df2.csv"
>
> df1 = spark.read.csv(d1location, header="true")
> df1.printSchema()
> df1.show()
> df2 = spark.read.csv(d2location, header="true")
> df2.printSchema()
> df2.show()
> df1 =
> df1.select(F.row_number().over(Window.partitionBy().orderBy(df1['amount_6m'])).alias("row_num"),"amount_6m")
> df1.show()
> df2 =
> df2.select(F.row_number().over(Window.partitionBy().orderBy(df2['amount_9m'])).alias("row_num"),"amount_9m")
> df2.show()
> df1.join(df2,"row_num","inner").select("amount_6m","amount_9m").show()
>
>
> root
>  |-- amount_6m: string (nullable = true)
>
> +---------+
> |amount_6m|
> +---------+
> |      100|
> |      200|
> |      300|
> |      400|
> |      500 |
> +---------+
>
> root
>  |-- amount_9m: string (nullable = true)
>
> +---------+
> |amount_9m|
> +---------+
> |      500|
> |      600|
> |      700|
> |      800|
> |      900|
> +---------+
>
> +-------+---------+
> |row_num|amount_6m|
> +-------+---------+
> |      1|      100|
> |      2|      200|
> |      3|      300|
> |      4|      400|
> |      5|      500 |
> +-------+---------+
>
> +-------+---------+
> |row_num|amount_9m|
> +-------+---------+
> |      1|      500|
> |      2|      600|
> |      3|      700|
> |      4|      800|
> |      5|      900|
> +-------+---------+
>
> +---------+---------+
> |amount_6m|amount_9m|
> +---------+---------+
> |      100|      500|
> |      200|      600|
> |      300|      700|
> |      400|      800|
> |     500 |      900|
> +---------+---------+
>
> HTH
>
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Tue, 18 May 2021 at 16:39, kushagra deep <ku...@gmail.com>
> wrote:
>
>> The use case is to calculate PSI/CSI values . And yes the union is one to
>> one row as you showed.
>>
>> On Tue, May 18, 2021, 20:39 Mich Talebzadeh <mi...@gmail.com>
>> wrote:
>>
>>>
>>> Hi Kushagra,
>>>
>>> A bit late on this but what is the business use case for this merge?
>>>
>>> You have two data frames each with one column and you want to UNION them
>>> in a certain way but the correlation is not known. In other words this
>>> UNION is as is?
>>>
>>>        amount_6m | amount_9m
>>>        100             500
>>>        200             600
>>>
>>> HTH
>>>
>>>
>>> On Wed, 12 May 2021 at 13:51, kushagra deep <ku...@gmail.com>
>>> wrote:
>>>
>>>> Hi All,
>>>>
>>>> I have two dataframes
>>>>
>>>> df1
>>>>
>>>> amount_6m
>>>>  100
>>>>  200
>>>>  300
>>>>  400
>>>>  500
>>>>
>>>> And a second data df2 below
>>>>
>>>>  amount_9m
>>>>   500
>>>>   600
>>>>   700
>>>>   800
>>>>   900
>>>>
>>>> The number of rows is same in both dataframes.
>>>>
>>>> Can I merge the two dataframes to achieve below df
>>>>
>>>> df3
>>>>
>>>> amount_6m | amount_9m
>>>>     100                   500
>>>>      200                  600
>>>>      300                  700
>>>>      400                  800
>>>>      500                  900
>>>>
>>>> Thanks in advance
>>>>
>>>> Reg,
>>>> Kushagra Deep
>>>>
>>>>

Re: Merge two dataframes

Posted by Mich Talebzadeh <mi...@gmail.com>.
Ok, this should hopefully work as it uses row_number.

from pyspark.sql.window import Window
import pyspark.sql.functions as F
from pyspark.sql.functions import row_number

def spark_session(appName):
  return SparkSession.builder \
        .appName(appName) \
        .enableHiveSupport() \
        .getOrCreate()
appName = "test"
spark =spark_session(appName)
##
## Get a DF first from csv files
##
d1location="hdfs://rhes75:9000/tmp/df1.csv"
d2location="hdfs://rhes75:9000/tmp/df2.csv"

df1 = spark.read.csv(d1location, header="true")
df1.printSchema()
df1.show()
df2 = spark.read.csv(d2location, header="true")
df2.printSchema()
df2.show()
df1 =
df1.select(F.row_number().over(Window.partitionBy().orderBy(df1['amount_6m'])).alias("row_num"),"amount_6m")
df1.show()
df2 =
df2.select(F.row_number().over(Window.partitionBy().orderBy(df2['amount_9m'])).alias("row_num"),"amount_9m")
df2.show()
df1.join(df2,"row_num","inner").select("amount_6m","amount_9m").show()


root
 |-- amount_6m: string (nullable = true)

+---------+
|amount_6m|
+---------+
|      100|
|      200|
|      300|
|      400|
|      500 |
+---------+

root
 |-- amount_9m: string (nullable = true)

+---------+
|amount_9m|
+---------+
|      500|
|      600|
|      700|
|      800|
|      900|
+---------+

+-------+---------+
|row_num|amount_6m|
+-------+---------+
|      1|      100|
|      2|      200|
|      3|      300|
|      4|      400|
|      5|      500 |
+-------+---------+

+-------+---------+
|row_num|amount_9m|
+-------+---------+
|      1|      500|
|      2|      600|
|      3|      700|
|      4|      800|
|      5|      900|
+-------+---------+

+---------+---------+
|amount_6m|amount_9m|
+---------+---------+
|      100|      500|
|      200|      600|
|      300|      700|
|      400|      800|
|     500 |      900|
+---------+---------+

HTH



   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 18 May 2021 at 16:39, kushagra deep <ku...@gmail.com>
wrote:

> The use case is to calculate PSI/CSI values . And yes the union is one to
> one row as you showed.
>
> On Tue, May 18, 2021, 20:39 Mich Talebzadeh <mi...@gmail.com>
> wrote:
>
>>
>> Hi Kushagra,
>>
>> A bit late on this but what is the business use case for this merge?
>>
>> You have two data frames each with one column and you want to UNION them
>> in a certain way but the correlation is not known. In other words this
>> UNION is as is?
>>
>>        amount_6m | amount_9m
>>        100             500
>>        200             600
>>
>> HTH
>>
>>
>> On Wed, 12 May 2021 at 13:51, kushagra deep <ku...@gmail.com>
>> wrote:
>>
>>> Hi All,
>>>
>>> I have two dataframes
>>>
>>> df1
>>>
>>> amount_6m
>>>  100
>>>  200
>>>  300
>>>  400
>>>  500
>>>
>>> And a second data df2 below
>>>
>>>  amount_9m
>>>   500
>>>   600
>>>   700
>>>   800
>>>   900
>>>
>>> The number of rows is same in both dataframes.
>>>
>>> Can I merge the two dataframes to achieve below df
>>>
>>> df3
>>>
>>> amount_6m | amount_9m
>>>     100                   500
>>>      200                  600
>>>      300                  700
>>>      400                  800
>>>      500                  900
>>>
>>> Thanks in advance
>>>
>>> Reg,
>>> Kushagra Deep
>>>
>>>

Re: Merge two dataframes

Posted by kushagra deep <ku...@gmail.com>.
The use case is to calculate PSI/CSI values . And yes the union is one to
one row as you showed.

On Tue, May 18, 2021, 20:39 Mich Talebzadeh <mi...@gmail.com>
wrote:

>
> Hi Kushagra,
>
> A bit late on this but what is the business use case for this merge?
>
> You have two data frames each with one column and you want to UNION them
> in a certain way but the correlation is not known. In other words this
> UNION is as is?
>
>        amount_6m | amount_9m
>        100             500
>        200             600
>
> HTH
>
>
> On Wed, 12 May 2021 at 13:51, kushagra deep <ku...@gmail.com>
> wrote:
>
>> Hi All,
>>
>> I have two dataframes
>>
>> df1
>>
>> amount_6m
>>  100
>>  200
>>  300
>>  400
>>  500
>>
>> And a second data df2 below
>>
>>  amount_9m
>>   500
>>   600
>>   700
>>   800
>>   900
>>
>> The number of rows is same in both dataframes.
>>
>> Can I merge the two dataframes to achieve below df
>>
>> df3
>>
>> amount_6m | amount_9m
>>     100                   500
>>      200                  600
>>      300                  700
>>      400                  800
>>      500                  900
>>
>> Thanks in advance
>>
>> Reg,
>> Kushagra Deep
>>
>>

Re: Merge two dataframes

Posted by Mich Talebzadeh <mi...@gmail.com>.
Hi Kushagra,

A bit late on this but what is the business use case for this merge?

You have two data frames each with one column and you want to UNION them in
a certain way but the correlation is not known. In other words this UNION
is as is?

       amount_6m | amount_9m
       100             500
       200             600

HTH


On Wed, 12 May 2021 at 13:51, kushagra deep <ku...@gmail.com>
wrote:

> Hi All,
>
> I have two dataframes
>
> df1
>
> amount_6m
>  100
>  200
>  300
>  400
>  500
>
> And a second data df2 below
>
>  amount_9m
>   500
>   600
>   700
>   800
>   900
>
> The number of rows is same in both dataframes.
>
> Can I merge the two dataframes to achieve below df
>
> df3
>
> amount_6m | amount_9m
>     100                   500
>      200                  600
>      300                  700
>      400                  800
>      500                  900
>
> Thanks in advance
>
> Reg,
> Kushagra Deep
>
>

Re: Merge two dataframes

Posted by ayan guha <gu...@gmail.com>.
Hi Kushagra

 I still think this is a bad idea. By definition data in a dataframe or rdd
is unordered, you are imposing an order where there is none, and if it
works it will be by chance. For example a simple repartition may disrupt
the row ordering. It is just too unpredictable.

I would suggest you fix upstream and add correct identifier to each of the
streams. It will for sure a much better solution.

On Wed, 19 May 2021 at 7:21 pm, Mich Talebzadeh <mi...@gmail.com>
wrote:

> That generation of row_number() has to be performed through a window call
> and I don't think there is any way around it without orderBy()
>
> df1 =
> df1.select(F.row_number().over(Window.partitionBy().orderBy(df1['amount_6m'])).alias("row_num"),"amount_6m")
>
> The problem is that without partitionBy() clause data will be skewed
> towards one executor.
>
> WARN window.WindowExec: No Partition Defined for Window operation! Moving
> all data to a single partition, this can cause serious performance
> degradation.
>
> Cheers
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Wed, 12 May 2021 at 17:33, Andrew Melo <an...@gmail.com> wrote:
>
>> Hi,
>>
>> In the case where the left and right hand side share a common parent like:
>>
>> df = spark.read.someDataframe().withColumn('rownum', row_number())
>> df1 = df.withColumn('c1', expensive_udf1('foo')).select('c1', 'rownum')
>> df2 = df.withColumn('c2', expensive_udf2('bar')).select('c2', 'rownum')
>> df_joined = df1.join(df2, 'rownum', 'inner')
>>
>> (or maybe replacing row_number() with monotonically_increasing_id()....)
>>
>> Is there some hint/optimization that can be done to let Spark know
>> that the left and right hand-sides of the join share the same
>> ordering, and a sort/hash merge doesn't need to be done?
>>
>> Thanks
>> Andrew
>>
>> On Wed, May 12, 2021 at 11:07 AM Sean Owen <sr...@gmail.com> wrote:
>> >
>> > Yeah I don't think that's going to work - you aren't guaranteed to get
>> 1, 2, 3, etc. I think row_number() might be what you need to generate a
>> join ID.
>> >
>> > RDD has a .zip method, but (unless I'm forgetting!) DataFrame does not.
>> You could .zip two RDDs you get from DataFrames and manually convert the
>> Rows back to a single Row and back to DataFrame.
>> >
>> >
>> > On Wed, May 12, 2021 at 10:47 AM kushagra deep <
>> kushagra94deep@gmail.com> wrote:
>> >>
>> >> Thanks Raghvendra
>> >>
>> >> Will the ids for corresponding columns  be same always ? Since
>> monotonic_increasing_id() returns a number based on partitionId and the row
>> number of the partition  ,will it be same for corresponding columns? Also
>> is it guaranteed that the two dataframes will be divided into logical spark
>> partitions with the same cardinality for each partition ?
>> >>
>> >> Reg,
>> >> Kushagra Deep
>> >>
>> >> On Wed, May 12, 2021, 21:00 Raghavendra Ganesh <
>> raghavendra.g@gmail.com> wrote:
>> >>>
>> >>> You can add an extra id column and perform an inner join.
>> >>>
>> >>> val df1_with_id = df1.withColumn("id", monotonically_increasing_id())
>> >>>
>> >>> val df2_with_id = df2.withColumn("id", monotonically_increasing_id())
>> >>>
>> >>> df1_with_id.join(df2_with_id, Seq("id"), "inner").drop("id").show()
>> >>>
>> >>> +---------+---------+
>> >>>
>> >>> |amount_6m|amount_9m|
>> >>>
>> >>> +---------+---------+
>> >>>
>> >>> |      100|      500|
>> >>>
>> >>> |      200|      600|
>> >>>
>> >>> |      300|      700|
>> >>>
>> >>> |      400|      800|
>> >>>
>> >>> |      500|      900|
>> >>>
>> >>> +---------+---------+
>> >>>
>> >>>
>> >>> --
>> >>> Raghavendra
>
>
>> >>>
>> >>>
>> >>> On Wed, May 12, 2021 at 6:20 PM kushagra deep <
>> kushagra94deep@gmail.com> wrote:
>> >>>>
>> >>>> Hi All,
>> >>>>
>> >>>> I have two dataframes
>> >>>>
>> >>>> df1
>> >>>>
>> >>>> amount_6m
>> >>>>  100
>> >>>>  200
>> >>>>  300
>> >>>>  400
>> >>>>  500
>> >>>>
>> >>>> And a second data df2 below
>> >>>>
>> >>>>  amount_9m
>> >>>>   500
>> >>>>   600
>> >>>>   700
>> >>>>   800
>> >>>>   900
>> >>>>
>> >>>> The number of rows is same in both dataframes.
>> >>>>
>> >>>> Can I merge the two dataframes to achieve below df
>> >>>>
>> >>>> df3
>> >>>>
>> >>>> amount_6m | amount_9m
>> >>>>     100                   500
>> >>>>      200                  600
>> >>>>      300                  700
>> >>>>      400                  800
>> >>>>      500                  900
>> >>>>
>> >>>> Thanks in advance
>> >>>>
>> >>>> Reg,
>> >>>> Kushagra Deep
>> >>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>
>> --
Best Regards,
Ayan Guha

Re: Merge two dataframes

Posted by Mich Talebzadeh <mi...@gmail.com>.
That generation of row_number() has to be performed through a window call
and I don't think there is any way around it without orderBy()

df1 =
df1.select(F.row_number().over(Window.partitionBy().orderBy(df1['amount_6m'])).alias("row_num"),"amount_6m")

The problem is that without partitionBy() clause data will be skewed
towards one executor.

WARN window.WindowExec: No Partition Defined for Window operation! Moving
all data to a single partition, this can cause serious performance
degradation.

Cheers


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Wed, 12 May 2021 at 17:33, Andrew Melo <an...@gmail.com> wrote:

> Hi,
>
> In the case where the left and right hand side share a common parent like:
>
> df = spark.read.someDataframe().withColumn('rownum', row_number())
> df1 = df.withColumn('c1', expensive_udf1('foo')).select('c1', 'rownum')
> df2 = df.withColumn('c2', expensive_udf2('bar')).select('c2', 'rownum')
> df_joined = df1.join(df2, 'rownum', 'inner')
>
> (or maybe replacing row_number() with monotonically_increasing_id()....)
>
> Is there some hint/optimization that can be done to let Spark know
> that the left and right hand-sides of the join share the same
> ordering, and a sort/hash merge doesn't need to be done?
>
> Thanks
> Andrew
>
> On Wed, May 12, 2021 at 11:07 AM Sean Owen <sr...@gmail.com> wrote:
> >
> > Yeah I don't think that's going to work - you aren't guaranteed to get
> 1, 2, 3, etc. I think row_number() might be what you need to generate a
> join ID.
> >
> > RDD has a .zip method, but (unless I'm forgetting!) DataFrame does not.
> You could .zip two RDDs you get from DataFrames and manually convert the
> Rows back to a single Row and back to DataFrame.
> >
> >
> > On Wed, May 12, 2021 at 10:47 AM kushagra deep <ku...@gmail.com>
> wrote:
> >>
> >> Thanks Raghvendra
> >>
> >> Will the ids for corresponding columns  be same always ? Since
> monotonic_increasing_id() returns a number based on partitionId and the row
> number of the partition  ,will it be same for corresponding columns? Also
> is it guaranteed that the two dataframes will be divided into logical spark
> partitions with the same cardinality for each partition ?
> >>
> >> Reg,
> >> Kushagra Deep
> >>
> >> On Wed, May 12, 2021, 21:00 Raghavendra Ganesh <ra...@gmail.com>
> wrote:
> >>>
> >>> You can add an extra id column and perform an inner join.
> >>>
> >>> val df1_with_id = df1.withColumn("id", monotonically_increasing_id())
> >>>
> >>> val df2_with_id = df2.withColumn("id", monotonically_increasing_id())
> >>>
> >>> df1_with_id.join(df2_with_id, Seq("id"), "inner").drop("id").show()
> >>>
> >>> +---------+---------+
> >>>
> >>> |amount_6m|amount_9m|
> >>>
> >>> +---------+---------+
> >>>
> >>> |      100|      500|
> >>>
> >>> |      200|      600|
> >>>
> >>> |      300|      700|
> >>>
> >>> |      400|      800|
> >>>
> >>> |      500|      900|
> >>>
> >>> +---------+---------+
> >>>
> >>>
> >>> --
> >>> Raghavendra
> >>>
> >>>
> >>> On Wed, May 12, 2021 at 6:20 PM kushagra deep <
> kushagra94deep@gmail.com> wrote:
> >>>>
> >>>> Hi All,
> >>>>
> >>>> I have two dataframes
> >>>>
> >>>> df1
> >>>>
> >>>> amount_6m
> >>>>  100
> >>>>  200
> >>>>  300
> >>>>  400
> >>>>  500
> >>>>
> >>>> And a second data df2 below
> >>>>
> >>>>  amount_9m
> >>>>   500
> >>>>   600
> >>>>   700
> >>>>   800
> >>>>   900
> >>>>
> >>>> The number of rows is same in both dataframes.
> >>>>
> >>>> Can I merge the two dataframes to achieve below df
> >>>>
> >>>> df3
> >>>>
> >>>> amount_6m | amount_9m
> >>>>     100                   500
> >>>>      200                  600
> >>>>      300                  700
> >>>>      400                  800
> >>>>      500                  900
> >>>>
> >>>> Thanks in advance
> >>>>
> >>>> Reg,
> >>>> Kushagra Deep
> >>>>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>

Re: Merge two dataframes

Posted by Andrew Melo <an...@gmail.com>.
On Mon, May 17, 2021 at 2:31 PM Lalwani, Jayesh <jl...@amazon.com> wrote:
>
> If the UDFs are computationally expensive, I wouldn't solve this problem with  UDFs at all. If they are working in an iterative manner, and assuming each iteration is independent of other iterations (yes, I know that's a big assumptiuon), I would think about exploding your dataframe to have a row per iteration, and working on each row separately, and then aggregating in the end. This allows you to scale your computation much better.

Ah, in this case, I mean "iterative" in the sense of the
"code/run/examine" sense of the word, not that the UDF itself is
performing an iterative computation.

>
> I know not all computations can be map-reducable like that. However, most can.
>
> Split and merge data workflows in Spark don't work like their DAG representations, unless you add costly caches. Without caching, each split will result in Spark rereading data from the source, even if the splits are getting merged together. The only way to avoid it is by caching at the split point, which depending on the amount of data can become costly. Also, joins result in shuffles. Avoiding splits and merges is better.
>
> To give you an example, we had an application that applied a series of rules to rows. The output required was a dataframe with an additional column that indicated which rule the row satisfied. In our initial implementation, we had a series of r one per rule. For N rules, we created N dataframes that had the rows that satisfied the rules. The we unioned the N data frames. Horrible performance that didn't scale with N. We reimplemented to add N Boolean columns; one per rule; that indicated if the rule was satisfied. We just kept adding the boolen columns to the dataframe. After iterating over the rules, we added another column that indicated out which rule was satisfied, and then dropped the Boolean columns. Much better performance that scaled with N. Spark read from datasource just once, and since there were no joins/unions, there was no shuffle

The hitch in your example, and what we're trying to avoid, is that if
you need to change one of these boolean columns, you end up needing to
recompute everything "afterwards" in the DAG (AFAICT), even if the
"latter" stages don't have a true dependency on the changed column. We
do explorations of very large physics datasets, and one of the
disadvantages of our bespoke analysis software is that any change to
the analysis code involves re-computing everything from scratch. A big
goal of mine is to make it so that what was changed is recomputed, and
no more, which will speed up the rate at which we can find new
physics.

Cheers
Andrew

>
> On 5/17/21, 2:56 PM, "Andrew Melo" <an...@gmail.com> wrote:
>
>     CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
>
>
>
>     In our case, these UDFs are quite expensive and worked on in an
>     iterative manner, so being able to cache the two "sides" of the graphs
>     independently will speed up the development cycle. Otherwise, if you
>     modify foo() here, then you have to recompute bar and baz, even though
>     they're unchanged.
>
>     df.withColumn('a', foo('x')).withColumn('b', bar('x')).withColumn('c', baz('x'))
>
>     Additionally, a longer goal would be to be able to persist/cache these
>     columns to disk so a downstream user could later mix and match several
>     (10s) of these columns together as their inputs w/o having to
>     explicitly compute them themselves.
>
>     Cheers
>     Andrew
>
>     On Mon, May 17, 2021 at 1:10 PM Sean Owen <sr...@gmail.com> wrote:
>     >
>     > Why join here - just add two columns to the DataFrame directly?
>     >
>     > On Mon, May 17, 2021 at 1:04 PM Andrew Melo <an...@gmail.com> wrote:
>     >>
>     >> Anyone have ideas about the below Q?
>     >>
>     >> It seems to me that given that "diamond" DAG, that spark could see
>     >> that the rows haven't been shuffled/filtered, it could do some type of
>     >> "zip join" to push them together, but I've not been able to get a plan
>     >> that doesn't do a hash/sort merge join
>     >>
>     >> Cheers
>     >> Andrew
>     >>
>
>     ---------------------------------------------------------------------
>     To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Re: Merge two dataframes

Posted by "Lalwani, Jayesh" <jl...@amazon.com.INVALID>.
If the UDFs are computationally expensive, I wouldn't solve this problem with  UDFs at all. If they are working in an iterative manner, and assuming each iteration is independent of other iterations (yes, I know that's a big assumptiuon), I would think about exploding your dataframe to have a row per iteration, and working on each row separately, and then aggregating in the end. This allows you to scale your computation much better. 

I know not all computations can be map-reducable like that. However, most can. 

Split and merge data workflows in Spark don't work like their DAG representations, unless you add costly caches. Without caching, each split will result in Spark rereading data from the source, even if the splits are getting merged together. The only way to avoid it is by caching at the split point, which depending on the amount of data can become costly. Also, joins result in shuffles. Avoiding splits and merges is better.

To give you an example, we had an application that applied a series of rules to rows. The output required was a dataframe with an additional column that indicated which rule the row satisfied. In our initial implementation, we had a series of r one per rule. For N rules, we created N dataframes that had the rows that satisfied the rules. The we unioned the N data frames. Horrible performance that didn't scale with N. We reimplemented to add N Boolean columns; one per rule; that indicated if the rule was satisfied. We just kept adding the boolen columns to the dataframe. After iterating over the rules, we added another column that indicated out which rule was satisfied, and then dropped the Boolean columns. Much better performance that scaled with N. Spark read from datasource just once, and since there were no joins/unions, there was no shuffle

On 5/17/21, 2:56 PM, "Andrew Melo" <an...@gmail.com> wrote:

    CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.



    In our case, these UDFs are quite expensive and worked on in an
    iterative manner, so being able to cache the two "sides" of the graphs
    independently will speed up the development cycle. Otherwise, if you
    modify foo() here, then you have to recompute bar and baz, even though
    they're unchanged.

    df.withColumn('a', foo('x')).withColumn('b', bar('x')).withColumn('c', baz('x'))

    Additionally, a longer goal would be to be able to persist/cache these
    columns to disk so a downstream user could later mix and match several
    (10s) of these columns together as their inputs w/o having to
    explicitly compute them themselves.

    Cheers
    Andrew

    On Mon, May 17, 2021 at 1:10 PM Sean Owen <sr...@gmail.com> wrote:
    >
    > Why join here - just add two columns to the DataFrame directly?
    >
    > On Mon, May 17, 2021 at 1:04 PM Andrew Melo <an...@gmail.com> wrote:
    >>
    >> Anyone have ideas about the below Q?
    >>
    >> It seems to me that given that "diamond" DAG, that spark could see
    >> that the rows haven't been shuffled/filtered, it could do some type of
    >> "zip join" to push them together, but I've not been able to get a plan
    >> that doesn't do a hash/sort merge join
    >>
    >> Cheers
    >> Andrew
    >>

    ---------------------------------------------------------------------
    To unsubscribe e-mail: user-unsubscribe@spark.apache.org



Re: Merge two dataframes

Posted by Andrew Melo <an...@gmail.com>.
In our case, these UDFs are quite expensive and worked on in an
iterative manner, so being able to cache the two "sides" of the graphs
independently will speed up the development cycle. Otherwise, if you
modify foo() here, then you have to recompute bar and baz, even though
they're unchanged.

df.withColumn('a', foo('x')).withColumn('b', bar('x')).withColumn('c', baz('x'))

Additionally, a longer goal would be to be able to persist/cache these
columns to disk so a downstream user could later mix and match several
(10s) of these columns together as their inputs w/o having to
explicitly compute them themselves.

Cheers
Andrew

On Mon, May 17, 2021 at 1:10 PM Sean Owen <sr...@gmail.com> wrote:
>
> Why join here - just add two columns to the DataFrame directly?
>
> On Mon, May 17, 2021 at 1:04 PM Andrew Melo <an...@gmail.com> wrote:
>>
>> Anyone have ideas about the below Q?
>>
>> It seems to me that given that "diamond" DAG, that spark could see
>> that the rows haven't been shuffled/filtered, it could do some type of
>> "zip join" to push them together, but I've not been able to get a plan
>> that doesn't do a hash/sort merge join
>>
>> Cheers
>> Andrew
>>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Re: Merge two dataframes

Posted by Sean Owen <sr...@gmail.com>.
Why join here - just add two columns to the DataFrame directly?

On Mon, May 17, 2021 at 1:04 PM Andrew Melo <an...@gmail.com> wrote:

> Anyone have ideas about the below Q?
>
> It seems to me that given that "diamond" DAG, that spark could see
> that the rows haven't been shuffled/filtered, it could do some type of
> "zip join" to push them together, but I've not been able to get a plan
> that doesn't do a hash/sort merge join
>
> Cheers
> Andrew
>
>

Re: Merge two dataframes

Posted by Andrew Melo <an...@gmail.com>.
Anyone have ideas about the below Q?

It seems to me that given that "diamond" DAG, that spark could see
that the rows haven't been shuffled/filtered, it could do some type of
"zip join" to push them together, but I've not been able to get a plan
that doesn't do a hash/sort merge join

Cheers
Andrew

On Wed, May 12, 2021 at 11:32 AM Andrew Melo <an...@gmail.com> wrote:
>
> Hi,
>
> In the case where the left and right hand side share a common parent like:
>
> df = spark.read.someDataframe().withColumn('rownum', row_number())
> df1 = df.withColumn('c1', expensive_udf1('foo')).select('c1', 'rownum')
> df2 = df.withColumn('c2', expensive_udf2('bar')).select('c2', 'rownum')
> df_joined = df1.join(df2, 'rownum', 'inner')
>
> (or maybe replacing row_number() with monotonically_increasing_id()....)
>
> Is there some hint/optimization that can be done to let Spark know
> that the left and right hand-sides of the join share the same
> ordering, and a sort/hash merge doesn't need to be done?
>
> Thanks
> Andrew
>
> On Wed, May 12, 2021 at 11:07 AM Sean Owen <sr...@gmail.com> wrote:
> >
> > Yeah I don't think that's going to work - you aren't guaranteed to get 1, 2, 3, etc. I think row_number() might be what you need to generate a join ID.
> >
> > RDD has a .zip method, but (unless I'm forgetting!) DataFrame does not. You could .zip two RDDs you get from DataFrames and manually convert the Rows back to a single Row and back to DataFrame.
> >
> >
> > On Wed, May 12, 2021 at 10:47 AM kushagra deep <ku...@gmail.com> wrote:
> >>
> >> Thanks Raghvendra
> >>
> >> Will the ids for corresponding columns  be same always ? Since monotonic_increasing_id() returns a number based on partitionId and the row number of the partition  ,will it be same for corresponding columns? Also is it guaranteed that the two dataframes will be divided into logical spark partitions with the same cardinality for each partition ?
> >>
> >> Reg,
> >> Kushagra Deep
> >>
> >> On Wed, May 12, 2021, 21:00 Raghavendra Ganesh <ra...@gmail.com> wrote:
> >>>
> >>> You can add an extra id column and perform an inner join.
> >>>
> >>> val df1_with_id = df1.withColumn("id", monotonically_increasing_id())
> >>>
> >>> val df2_with_id = df2.withColumn("id", monotonically_increasing_id())
> >>>
> >>> df1_with_id.join(df2_with_id, Seq("id"), "inner").drop("id").show()
> >>>
> >>> +---------+---------+
> >>>
> >>> |amount_6m|amount_9m|
> >>>
> >>> +---------+---------+
> >>>
> >>> |      100|      500|
> >>>
> >>> |      200|      600|
> >>>
> >>> |      300|      700|
> >>>
> >>> |      400|      800|
> >>>
> >>> |      500|      900|
> >>>
> >>> +---------+---------+
> >>>
> >>>
> >>> --
> >>> Raghavendra
> >>>
> >>>
> >>> On Wed, May 12, 2021 at 6:20 PM kushagra deep <ku...@gmail.com> wrote:
> >>>>
> >>>> Hi All,
> >>>>
> >>>> I have two dataframes
> >>>>
> >>>> df1
> >>>>
> >>>> amount_6m
> >>>>  100
> >>>>  200
> >>>>  300
> >>>>  400
> >>>>  500
> >>>>
> >>>> And a second data df2 below
> >>>>
> >>>>  amount_9m
> >>>>   500
> >>>>   600
> >>>>   700
> >>>>   800
> >>>>   900
> >>>>
> >>>> The number of rows is same in both dataframes.
> >>>>
> >>>> Can I merge the two dataframes to achieve below df
> >>>>
> >>>> df3
> >>>>
> >>>> amount_6m | amount_9m
> >>>>     100                   500
> >>>>      200                  600
> >>>>      300                  700
> >>>>      400                  800
> >>>>      500                  900
> >>>>
> >>>> Thanks in advance
> >>>>
> >>>> Reg,
> >>>> Kushagra Deep
> >>>>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Re: Merge two dataframes

Posted by Andrew Melo <an...@gmail.com>.
Hi,

In the case where the left and right hand side share a common parent like:

df = spark.read.someDataframe().withColumn('rownum', row_number())
df1 = df.withColumn('c1', expensive_udf1('foo')).select('c1', 'rownum')
df2 = df.withColumn('c2', expensive_udf2('bar')).select('c2', 'rownum')
df_joined = df1.join(df2, 'rownum', 'inner')

(or maybe replacing row_number() with monotonically_increasing_id()....)

Is there some hint/optimization that can be done to let Spark know
that the left and right hand-sides of the join share the same
ordering, and a sort/hash merge doesn't need to be done?

Thanks
Andrew

On Wed, May 12, 2021 at 11:07 AM Sean Owen <sr...@gmail.com> wrote:
>
> Yeah I don't think that's going to work - you aren't guaranteed to get 1, 2, 3, etc. I think row_number() might be what you need to generate a join ID.
>
> RDD has a .zip method, but (unless I'm forgetting!) DataFrame does not. You could .zip two RDDs you get from DataFrames and manually convert the Rows back to a single Row and back to DataFrame.
>
>
> On Wed, May 12, 2021 at 10:47 AM kushagra deep <ku...@gmail.com> wrote:
>>
>> Thanks Raghvendra
>>
>> Will the ids for corresponding columns  be same always ? Since monotonic_increasing_id() returns a number based on partitionId and the row number of the partition  ,will it be same for corresponding columns? Also is it guaranteed that the two dataframes will be divided into logical spark partitions with the same cardinality for each partition ?
>>
>> Reg,
>> Kushagra Deep
>>
>> On Wed, May 12, 2021, 21:00 Raghavendra Ganesh <ra...@gmail.com> wrote:
>>>
>>> You can add an extra id column and perform an inner join.
>>>
>>> val df1_with_id = df1.withColumn("id", monotonically_increasing_id())
>>>
>>> val df2_with_id = df2.withColumn("id", monotonically_increasing_id())
>>>
>>> df1_with_id.join(df2_with_id, Seq("id"), "inner").drop("id").show()
>>>
>>> +---------+---------+
>>>
>>> |amount_6m|amount_9m|
>>>
>>> +---------+---------+
>>>
>>> |      100|      500|
>>>
>>> |      200|      600|
>>>
>>> |      300|      700|
>>>
>>> |      400|      800|
>>>
>>> |      500|      900|
>>>
>>> +---------+---------+
>>>
>>>
>>> --
>>> Raghavendra
>>>
>>>
>>> On Wed, May 12, 2021 at 6:20 PM kushagra deep <ku...@gmail.com> wrote:
>>>>
>>>> Hi All,
>>>>
>>>> I have two dataframes
>>>>
>>>> df1
>>>>
>>>> amount_6m
>>>>  100
>>>>  200
>>>>  300
>>>>  400
>>>>  500
>>>>
>>>> And a second data df2 below
>>>>
>>>>  amount_9m
>>>>   500
>>>>   600
>>>>   700
>>>>   800
>>>>   900
>>>>
>>>> The number of rows is same in both dataframes.
>>>>
>>>> Can I merge the two dataframes to achieve below df
>>>>
>>>> df3
>>>>
>>>> amount_6m | amount_9m
>>>>     100                   500
>>>>      200                  600
>>>>      300                  700
>>>>      400                  800
>>>>      500                  900
>>>>
>>>> Thanks in advance
>>>>
>>>> Reg,
>>>> Kushagra Deep
>>>>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Re: Merge two dataframes

Posted by Sean Owen <sr...@gmail.com>.
Yeah I don't think that's going to work - you aren't guaranteed to get 1,
2, 3, etc. I think row_number() might be what you need to generate a join
ID.

RDD has a .zip method, but (unless I'm forgetting!) DataFrame does not. You
could .zip two RDDs you get from DataFrames and manually convert the Rows
back to a single Row and back to DataFrame.


On Wed, May 12, 2021 at 10:47 AM kushagra deep <ku...@gmail.com>
wrote:

> Thanks Raghvendra
>
> Will the ids for corresponding columns  be same always ? Since
> monotonic_increasing_id() returns a number based on partitionId and the row
> number of the partition  ,will it be same for corresponding columns? Also
> is it guaranteed that the two dataframes will be divided into logical spark
> partitions with the same cardinality for each partition ?
>
> Reg,
> Kushagra Deep
>
> On Wed, May 12, 2021, 21:00 Raghavendra Ganesh <ra...@gmail.com>
> wrote:
>
>> You can add an extra id column and perform an inner join.
>>
>> val df1_with_id = df1.withColumn("id", monotonically_increasing_id())
>>
>> val df2_with_id = df2.withColumn("id", monotonically_increasing_id())
>>
>> df1_with_id.join(df2_with_id, Seq("id"), "inner").drop("id").show()
>>
>> +---------+---------+
>>
>> |amount_6m|amount_9m|
>>
>> +---------+---------+
>>
>> |      100|      500|
>>
>> |      200|      600|
>>
>> |      300|      700|
>>
>> |      400|      800|
>>
>> |      500|      900|
>>
>> +---------+---------+
>>
>>
>> --
>> Raghavendra
>>
>>
>> On Wed, May 12, 2021 at 6:20 PM kushagra deep <ku...@gmail.com>
>> wrote:
>>
>>> Hi All,
>>>
>>> I have two dataframes
>>>
>>> df1
>>>
>>> amount_6m
>>>  100
>>>  200
>>>  300
>>>  400
>>>  500
>>>
>>> And a second data df2 below
>>>
>>>  amount_9m
>>>   500
>>>   600
>>>   700
>>>   800
>>>   900
>>>
>>> The number of rows is same in both dataframes.
>>>
>>> Can I merge the two dataframes to achieve below df
>>>
>>> df3
>>>
>>> amount_6m | amount_9m
>>>     100                   500
>>>      200                  600
>>>      300                  700
>>>      400                  800
>>>      500                  900
>>>
>>> Thanks in advance
>>>
>>> Reg,
>>> Kushagra Deep
>>>
>>>

Re: Merge two dataframes

Posted by kushagra deep <ku...@gmail.com>.
Thanks Raghvendra

Will the ids for corresponding columns  be same always ? Since
monotonic_increasing_id() returns a number based on partitionId and the row
number of the partition  ,will it be same for corresponding columns? Also
is it guaranteed that the two dataframes will be divided into logical spark
partitions with the same cardinality for each partition ?

Reg,
Kushagra Deep

On Wed, May 12, 2021, 21:00 Raghavendra Ganesh <ra...@gmail.com>
wrote:

> You can add an extra id column and perform an inner join.
>
> val df1_with_id = df1.withColumn("id", monotonically_increasing_id())
>
> val df2_with_id = df2.withColumn("id", monotonically_increasing_id())
>
> df1_with_id.join(df2_with_id, Seq("id"), "inner").drop("id").show()
>
> +---------+---------+
>
> |amount_6m|amount_9m|
>
> +---------+---------+
>
> |      100|      500|
>
> |      200|      600|
>
> |      300|      700|
>
> |      400|      800|
>
> |      500|      900|
>
> +---------+---------+
>
>
> --
> Raghavendra
>
>
> On Wed, May 12, 2021 at 6:20 PM kushagra deep <ku...@gmail.com>
> wrote:
>
>> Hi All,
>>
>> I have two dataframes
>>
>> df1
>>
>> amount_6m
>>  100
>>  200
>>  300
>>  400
>>  500
>>
>> And a second data df2 below
>>
>>  amount_9m
>>   500
>>   600
>>   700
>>   800
>>   900
>>
>> The number of rows is same in both dataframes.
>>
>> Can I merge the two dataframes to achieve below df
>>
>> df3
>>
>> amount_6m | amount_9m
>>     100                   500
>>      200                  600
>>      300                  700
>>      400                  800
>>      500                  900
>>
>> Thanks in advance
>>
>> Reg,
>> Kushagra Deep
>>
>>

Re: Merge two dataframes

Posted by Raghavendra Ganesh <ra...@gmail.com>.
You can add an extra id column and perform an inner join.

val df1_with_id = df1.withColumn("id", monotonically_increasing_id())

val df2_with_id = df2.withColumn("id", monotonically_increasing_id())

df1_with_id.join(df2_with_id, Seq("id"), "inner").drop("id").show()

+---------+---------+

|amount_6m|amount_9m|

+---------+---------+

|      100|      500|

|      200|      600|

|      300|      700|

|      400|      800|

|      500|      900|

+---------+---------+


--
Raghavendra


On Wed, May 12, 2021 at 6:20 PM kushagra deep <ku...@gmail.com>
wrote:

> Hi All,
>
> I have two dataframes
>
> df1
>
> amount_6m
>  100
>  200
>  300
>  400
>  500
>
> And a second data df2 below
>
>  amount_9m
>   500
>   600
>   700
>   800
>   900
>
> The number of rows is same in both dataframes.
>
> Can I merge the two dataframes to achieve below df
>
> df3
>
> amount_6m | amount_9m
>     100                   500
>      200                  600
>      300                  700
>      400                  800
>      500                  900
>
> Thanks in advance
>
> Reg,
> Kushagra Deep
>
>