You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Andy Davidson <An...@SantaCruzIntegration.com> on 2018/03/31 00:54:13 UTC

how to create all possible combinations from an array? how to join and explode row array?

I have a dataframe and execute  df.groupBy(³xyzy²).agg( collect_list(³abc²)

This produces a column of type array. Now for each row I want to create a
multiple pairs/tuples from the array so that I can create a contingency
table.  Any idea how I can transform my data so that call crosstab() ? The
join transformation operate on the entire dataframe. I need something at the
row array level?


Bellow is some sample python and describes what I would like my results to
be?

Kind regards

Andy


c1 = ["john", "bill", "sam"]
c2 = [['red', 'blue', 'red'], ['blue', 'red'], ['green']]
p = pd.DataFrame({"a":c1, "b":c2})

df = sqlContext.createDataFrame(p)
df.printSchema()
df.show()

root
 |-- a: string (nullable = true)
 |-- b: array (nullable = true)
 |    |-- element: string (containsNull = true)

+----+----------------+
|   a|               b|
+----+----------------+
|john|[red, blue, red]|
|bill   |     [blue, red]|
| sam|         [green]|
+----+----------------+


The output I am trying to create is. I could live with a crossJoin
(cartesian join) and add my own filtering if it makes the problem easier?


+----+----------------+
|  x1|    x2|
+----+----------------+
red  | blue
red  | red
blue | red
+----+----------------+





Re: how to create all possible combinations from an array? how to join and explode row array?

Posted by Andy Davidson <An...@SantaCruzIntegration.com>.
NICE!

Thanks Brandon


Andy.

From:  Brandon Geise <br...@gmail.com>
Date:  Friday, March 30, 2018 at 6:15 PM
To:  Andrew Davidson <An...@SantaCruzIntegration.com>, "user @spark"
<us...@spark.apache.org>
Subject:  Re: how to create all possible combinations from an array? how to
join and explode row array?

> Possibly instead of doing the initial grouping, just do a full outer join on
> zyzy.  This is in scala but should be easily convertible to python.
>  
> val data = Array(("john", "red"), ("john", "blue"), ("john", "red"), ("bill",
> "blue"), ("bill", "red"), ("sam", "green"))
>     val distData: DataFrame = spark.sparkContext.parallelize(data).toDF("a",
> "b")
>     distData.show()
> +----+-----+
> | a| b|
> +----+-----+
> |john| red|
> |john| blue|
> |john| red|
> |bill| blue|
> |bill| red|
> | sam|green|
> +----+-----+
>  
>  
> distData.as("tbl1").join(distData.as("tbl2"), Seq("a"),
> "fullouter").select("tbl1.b", "tbl2.b").distinct.show()
>  
> +-----+-----+
> | b| b|
> +-----+-----+
> | blue| red|
> | red| blue|
> | red| red|
> | blue| blue|
> |green|green|
> +-----+-----+
>  
>  
> 
> From: Andy Davidson <An...@SantaCruzIntegration.com>
> Date: Friday, March 30, 2018 at 8:58 PM
> To: Andy Davidson <An...@SantaCruzIntegration.com>, user
> <us...@spark.apache.org>
> Subject: Re: how to create all possible combinations from an array? how to
> join and explode row array?
> 
>  
> 
> I was a little sloppy when I created the sample output. Its missing a few
> pairs
> 
>  
> 
> Assume for a given row I have [a, b, c] I want to create something like the
> cartesian join
> 
>  
> 
> From: Andrew Davidson <Andy@SantaCruzIntegration.com
> <ma...@SantaCruzIntegration.com> >
> Date: Friday, March 30, 2018 at 5:54 PM
> To: "user @spark" <user@spark.apache.org <ma...@spark.apache.org> >
> Subject: how to create all possible combinations from an array? how to join
> and explode row array?
> 
>  
>> 
>> I have a dataframe and execute  df.groupBy(³xyzy²).agg( collect_list(³abc²)
>> 
>>  
>> 
>> This produces a column of type array. Now for each row I want to create a
>> multiple pairs/tuples from the array so that I can create a contingency
>> table.  Any idea how I can transform my data so that call crosstab() ? The
>> join transformation operate on the entire dataframe. I need something at the
>> row array level?
>> 
>> 
>> 
>> 
>> 
>> Bellow is some sample python and describes what I would like my results to
>> be?
>> 
>> 
>> 
>> Kind regards
>> 
>> 
>> 
>> Andy
>> 
>>  
>> 
>>  
>> 
>> c1 = ["john", "bill", "sam"]
>> 
>> c2 = [['red', 'blue', 'red'], ['blue', 'red'], ['green']]
>> 
>> p = pd.DataFrame({"a":c1, "b":c2})
>> 
>>  
>> 
>> df = sqlContext.createDataFrame(p)
>> 
>> df.printSchema()
>> 
>> df.show()
>> 
>>  
>> 
>> root
>> 
>>  |-- a: string (nullable = true)
>> 
>>  |-- b: array (nullable = true)
>> 
>>  |    |-- element: string (containsNull = true)
>> 
>>  
>> 
>> +----+----------------+
>> 
>> |   a|               b|
>> 
>> +----+----------------+
>> 
>> |john|[red, blue, red]|
>> 
>> |bill   |     [blue, red]|
>> 
>> | sam|         [green]|
>> 
>> +----+----------------+
>> 
>>  
>> 
>>  
>> 
>> The output I am trying to create is. I could live with a crossJoin (cartesian
>> join) and add my own filtering if it makes the problem easier?
>> 
>>  
>> 
>>  
>> 
>> +----+----------------+
>> 
>> |  x1|    x2|
>> 
>> +----+----------------+
>> 
>> red  | blue
>> 
>> red  | red
>> 
>> blue | red
>> 
>> +----+----------------+
>> 
>>  
>> 
>>  



Re: how to create all possible combinations from an array? how to join and explode row array?

Posted by Brandon Geise <br...@gmail.com>.
Possibly instead of doing the initial grouping, just do a full outer join on zyzy.  This is in scala but should be easily convertible to python.

 

val data = Array(("john", "red"), ("john", "blue"), ("john", "red"), ("bill", "blue"), ("bill", "red"), ("sam", "green"))

    val distData: DataFrame = spark.sparkContext.parallelize(data).toDF("a", "b")

    distData.show()

+----+-----+

| a| b|

+----+-----+

|john| red|

|john| blue|

|john| red|

|bill| blue|

|bill| red|

| sam|green|

+----+-----+

 

 

distData.as("tbl1").join(distData.as("tbl2"), Seq("a"), "fullouter").select("tbl1.b", "tbl2.b").distinct.show()

 

+-----+-----+

| b| b|

+-----+-----+

| blue| red|

| red| blue|

| red| red|

| blue| blue|

|green|green|

+-----+-----+

 

 

From: Andy Davidson <An...@SantaCruzIntegration.com>
Date: Friday, March 30, 2018 at 8:58 PM
To: Andy Davidson <An...@SantaCruzIntegration.com>, user <us...@spark.apache.org>
Subject: Re: how to create all possible combinations from an array? how to join and explode row array?

 

I was a little sloppy when I created the sample output. Its missing a few pairs

 

Assume for a given row I have [a, b, c] I want to create something like the cartesian join

 

From: Andrew Davidson <An...@SantaCruzIntegration.com>
Date: Friday, March 30, 2018 at 5:54 PM
To: "user @spark" <us...@spark.apache.org>
Subject: how to create all possible combinations from an array? how to join and explode row array?

 

I have a dataframe and execute  df.groupBy(“xyzy”).agg( collect_list(“abc”)

 

This produces a column of type array. Now for each row I want to create a multiple pairs/tuples from the array so that I can create a contingency table.  Any idea how I can transform my data so that call crosstab() ? The join transformation operate on the entire dataframe. I need something at the row array level?





Bellow is some sample python and describes what I would like my results to be?



Kind regards



Andy

 

 

c1 = ["john", "bill", "sam"]

c2 = [['red', 'blue', 'red'], ['blue', 'red'], ['green']]

p = pd.DataFrame({"a":c1, "b":c2})

 

df = sqlContext.createDataFrame(p)

df.printSchema()

df.show()

 

root

 |-- a: string (nullable = true)

 |-- b: array (nullable = true)

 |    |-- element: string (containsNull = true)

 

+----+----------------+

|   a|               b|

+----+----------------+

|john|[red, blue, red]|

|bill   |     [blue, red]|

| sam|         [green]|

+----+----------------+

 

 

The output I am trying to create is. I could live with a crossJoin (cartesian join) and add my own filtering if it makes the problem easier?

 

 

+----+----------------+

|  x1|    x2|

+----+----------------+

red  | blue

red  | red

blue | red

+----+----------------+

 

 


Re: how to create all possible combinations from an array? how to join and explode row array?

Posted by Yong Zhang <ja...@hotmail.com>.
What's wrong just using a UDF doing for loop in scala? You can change the for loop logic for what combination you want.


scala> spark.version
res4: String = 2.2.1

scala> aggDS.printSchema
root
 |-- name: string (nullable = true)
 |-- colors: array (nullable = true)
 |    |-- element: string (containsNull = true)


scala> aggDS.show(false)
+----+----------------+
|name|colors          |
+----+----------------+
|john|[red, blue, red]|
|bill|[blue, red]     |
|sam |[gree]          |
+----+----------------+

scala> import org.apache.spark.sql.functions.udf
import org.apache.spark.sql.functions.udf

scala> val loopUDF = udf { x: Seq[String] => for (a <- x; b <-x) yield (a,b) }
loopUDF: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,ArrayType(StructType(StructField(_1,StringType,true), StructField(_2,StringType,true)),true),Some(List(ArrayType(StringType,true))))

scala> aggDS.withColumn("newCol", loopUDF($"colors")).show(false)
+----+----------------+---------------------------------------------------------------------------------------------------------+
|name|colors          |newCol                                                                                                   |
+----+----------------+---------------------------------------------------------------------------------------------------------+
|john|[red, blue, red]|[[red,red], [red,blue], [red,red], [blue,red], [blue,blue], [blue,red], [red,red], [red,blue], [red,red]]|
|bill|[blue, red]     |[[blue,blue], [blue,red], [red,blue], [red,red]]                                                         |
|sam |[gree]          |[[gree,gree]]                                                                                            |
+----+----------------+-----------------------------------------------------------------

Yong


________________________________
From: Andy Davidson <An...@SantaCruzIntegration.com>
Sent: Friday, March 30, 2018 8:58 PM
To: Andy Davidson; user
Subject: Re: how to create all possible combinations from an array? how to join and explode row array?

I was a little sloppy when I created the sample output. Its missing a few pairs

Assume for a given row I have [a, b, c] I want to create something like the cartesian join

From: Andrew Davidson <An...@SantaCruzIntegration.com>>
Date: Friday, March 30, 2018 at 5:54 PM
To: "user @spark" <us...@spark.apache.org>>
Subject: how to create all possible combinations from an array? how to join and explode row array?

I have a dataframe and execute  df.groupBy(“xyzy”).agg( collect_list(“abc”)

This produces a column of type array. Now for each row I want to create a multiple pairs/tuples from the array so that I can create a contingency table.  Any idea how I can transform my data so that call crosstab() ? The join transformation operate on the entire dataframe. I need something at the row array level?


Bellow is some sample python and describes what I would like my results to be?

Kind regards

Andy


c1 = ["john", "bill", "sam"]
c2 = [['red', 'blue', 'red'], ['blue', 'red'], ['green']]
p = pd.DataFrame({"a":c1, "b":c2})

df = sqlContext.createDataFrame(p)
df.printSchema()
df.show()

root
 |-- a: string (nullable = true)
 |-- b: array (nullable = true)
 |    |-- element: string (containsNull = true)

+----+----------------+
|   a|               b|
+----+----------------+
|john|[red, blue, red]|
|bill   |     [blue, red]|
| sam|         [green]|
+----+----------------+


The output I am trying to create is. I could live with a crossJoin (cartesian join) and add my own filtering if it makes the problem easier?


+----+----------------+
|  x1|    x2|
+----+----------------+
red  | blue
red  | red
blue | red
+----+----------------+



Re: how to create all possible combinations from an array? how to join and explode row array?

Posted by Andy Davidson <An...@SantaCruzIntegration.com>.
I was a little sloppy when I created the sample output. Its missing a few
pairs

Assume for a given row I have [a, b, c] I want to create something like the
cartesian join

From:  Andrew Davidson <An...@SantaCruzIntegration.com>
Date:  Friday, March 30, 2018 at 5:54 PM
To:  "user @spark" <us...@spark.apache.org>
Subject:  how to create all possible combinations from an array? how to join
and explode row array?

> I have a dataframe and execute  df.groupBy(³xyzy²).agg( collect_list(³abc²)
> 
> This produces a column of type array. Now for each row I want to create a
> multiple pairs/tuples from the array so that I can create a contingency table.
> Any idea how I can transform my data so that call crosstab() ? The join
> transformation operate on the entire dataframe. I need something at the row
> array level?
> 
> 
> Bellow is some sample python and describes what I would like my results to be?
> 
> Kind regards
> 
> Andy
> 
> 
> c1 = ["john", "bill", "sam"]
> c2 = [['red', 'blue', 'red'], ['blue', 'red'], ['green']]
> p = pd.DataFrame({"a":c1, "b":c2})
> 
> df = sqlContext.createDataFrame(p)
> df.printSchema()
> df.show()
> 
> root
>  |-- a: string (nullable = true)
>  |-- b: array (nullable = true)
>  |    |-- element: string (containsNull = true)
> 
> +----+----------------+
> |   a|               b|
> +----+----------------+
> |john|[red, blue, red]|
> |bill   |     [blue, red]|
> | sam|         [green]|
> +----+----------------+
> 
> 
> The output I am trying to create is. I could live with a crossJoin (cartesian
> join) and add my own filtering if it makes the problem easier?
> 
> 
> +----+----------------+
> |  x1|    x2|
> +----+----------------+
> red  | blue
> red  | red
> blue | red
> +----+----------------+
> 
>