You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Andy Davidson <An...@SantaCruzIntegration.com> on 2018/03/31 00:54:13 UTC
how to create all possible combinations from an array? how to join
and explode row array?
I have a dataframe and execute df.groupBy(³xyzy²).agg( collect_list(³abc²)
This produces a column of type array. Now for each row I want to create a
multiple pairs/tuples from the array so that I can create a contingency
table. Any idea how I can transform my data so that call crosstab() ? The
join transformation operate on the entire dataframe. I need something at the
row array level?
Bellow is some sample python and describes what I would like my results to
be?
Kind regards
Andy
c1 = ["john", "bill", "sam"]
c2 = [['red', 'blue', 'red'], ['blue', 'red'], ['green']]
p = pd.DataFrame({"a":c1, "b":c2})
df = sqlContext.createDataFrame(p)
df.printSchema()
df.show()
root
|-- a: string (nullable = true)
|-- b: array (nullable = true)
| |-- element: string (containsNull = true)
+----+----------------+
| a| b|
+----+----------------+
|john|[red, blue, red]|
|bill | [blue, red]|
| sam| [green]|
+----+----------------+
The output I am trying to create is. I could live with a crossJoin
(cartesian join) and add my own filtering if it makes the problem easier?
+----+----------------+
| x1| x2|
+----+----------------+
red | blue
red | red
blue | red
+----+----------------+
Re: how to create all possible combinations from an array? how to
join and explode row array?
Posted by Andy Davidson <An...@SantaCruzIntegration.com>.
NICE!
Thanks Brandon
Andy.
From: Brandon Geise <br...@gmail.com>
Date: Friday, March 30, 2018 at 6:15 PM
To: Andrew Davidson <An...@SantaCruzIntegration.com>, "user @spark"
<us...@spark.apache.org>
Subject: Re: how to create all possible combinations from an array? how to
join and explode row array?
> Possibly instead of doing the initial grouping, just do a full outer join on
> zyzy. This is in scala but should be easily convertible to python.
>
> val data = Array(("john", "red"), ("john", "blue"), ("john", "red"), ("bill",
> "blue"), ("bill", "red"), ("sam", "green"))
> val distData: DataFrame = spark.sparkContext.parallelize(data).toDF("a",
> "b")
> distData.show()
> +----+-----+
> | a| b|
> +----+-----+
> |john| red|
> |john| blue|
> |john| red|
> |bill| blue|
> |bill| red|
> | sam|green|
> +----+-----+
>
>
> distData.as("tbl1").join(distData.as("tbl2"), Seq("a"),
> "fullouter").select("tbl1.b", "tbl2.b").distinct.show()
>
> +-----+-----+
> | b| b|
> +-----+-----+
> | blue| red|
> | red| blue|
> | red| red|
> | blue| blue|
> |green|green|
> +-----+-----+
>
>
>
> From: Andy Davidson <An...@SantaCruzIntegration.com>
> Date: Friday, March 30, 2018 at 8:58 PM
> To: Andy Davidson <An...@SantaCruzIntegration.com>, user
> <us...@spark.apache.org>
> Subject: Re: how to create all possible combinations from an array? how to
> join and explode row array?
>
>
>
> I was a little sloppy when I created the sample output. Its missing a few
> pairs
>
>
>
> Assume for a given row I have [a, b, c] I want to create something like the
> cartesian join
>
>
>
> From: Andrew Davidson <Andy@SantaCruzIntegration.com
> <ma...@SantaCruzIntegration.com> >
> Date: Friday, March 30, 2018 at 5:54 PM
> To: "user @spark" <user@spark.apache.org <ma...@spark.apache.org> >
> Subject: how to create all possible combinations from an array? how to join
> and explode row array?
>
>
>>
>> I have a dataframe and execute df.groupBy(³xyzy²).agg( collect_list(³abc²)
>>
>>
>>
>> This produces a column of type array. Now for each row I want to create a
>> multiple pairs/tuples from the array so that I can create a contingency
>> table. Any idea how I can transform my data so that call crosstab() ? The
>> join transformation operate on the entire dataframe. I need something at the
>> row array level?
>>
>>
>>
>>
>>
>> Bellow is some sample python and describes what I would like my results to
>> be?
>>
>>
>>
>> Kind regards
>>
>>
>>
>> Andy
>>
>>
>>
>>
>>
>> c1 = ["john", "bill", "sam"]
>>
>> c2 = [['red', 'blue', 'red'], ['blue', 'red'], ['green']]
>>
>> p = pd.DataFrame({"a":c1, "b":c2})
>>
>>
>>
>> df = sqlContext.createDataFrame(p)
>>
>> df.printSchema()
>>
>> df.show()
>>
>>
>>
>> root
>>
>> |-- a: string (nullable = true)
>>
>> |-- b: array (nullable = true)
>>
>> | |-- element: string (containsNull = true)
>>
>>
>>
>> +----+----------------+
>>
>> | a| b|
>>
>> +----+----------------+
>>
>> |john|[red, blue, red]|
>>
>> |bill | [blue, red]|
>>
>> | sam| [green]|
>>
>> +----+----------------+
>>
>>
>>
>>
>>
>> The output I am trying to create is. I could live with a crossJoin (cartesian
>> join) and add my own filtering if it makes the problem easier?
>>
>>
>>
>>
>>
>> +----+----------------+
>>
>> | x1| x2|
>>
>> +----+----------------+
>>
>> red | blue
>>
>> red | red
>>
>> blue | red
>>
>> +----+----------------+
>>
>>
>>
>>
Re: how to create all possible combinations from an array? how to
join and explode row array?
Posted by Brandon Geise <br...@gmail.com>.
Possibly instead of doing the initial grouping, just do a full outer join on zyzy. This is in scala but should be easily convertible to python.
val data = Array(("john", "red"), ("john", "blue"), ("john", "red"), ("bill", "blue"), ("bill", "red"), ("sam", "green"))
val distData: DataFrame = spark.sparkContext.parallelize(data).toDF("a", "b")
distData.show()
+----+-----+
| a| b|
+----+-----+
|john| red|
|john| blue|
|john| red|
|bill| blue|
|bill| red|
| sam|green|
+----+-----+
distData.as("tbl1").join(distData.as("tbl2"), Seq("a"), "fullouter").select("tbl1.b", "tbl2.b").distinct.show()
+-----+-----+
| b| b|
+-----+-----+
| blue| red|
| red| blue|
| red| red|
| blue| blue|
|green|green|
+-----+-----+
From: Andy Davidson <An...@SantaCruzIntegration.com>
Date: Friday, March 30, 2018 at 8:58 PM
To: Andy Davidson <An...@SantaCruzIntegration.com>, user <us...@spark.apache.org>
Subject: Re: how to create all possible combinations from an array? how to join and explode row array?
I was a little sloppy when I created the sample output. Its missing a few pairs
Assume for a given row I have [a, b, c] I want to create something like the cartesian join
From: Andrew Davidson <An...@SantaCruzIntegration.com>
Date: Friday, March 30, 2018 at 5:54 PM
To: "user @spark" <us...@spark.apache.org>
Subject: how to create all possible combinations from an array? how to join and explode row array?
I have a dataframe and execute df.groupBy(“xyzy”).agg( collect_list(“abc”)
This produces a column of type array. Now for each row I want to create a multiple pairs/tuples from the array so that I can create a contingency table. Any idea how I can transform my data so that call crosstab() ? The join transformation operate on the entire dataframe. I need something at the row array level?
Bellow is some sample python and describes what I would like my results to be?
Kind regards
Andy
c1 = ["john", "bill", "sam"]
c2 = [['red', 'blue', 'red'], ['blue', 'red'], ['green']]
p = pd.DataFrame({"a":c1, "b":c2})
df = sqlContext.createDataFrame(p)
df.printSchema()
df.show()
root
|-- a: string (nullable = true)
|-- b: array (nullable = true)
| |-- element: string (containsNull = true)
+----+----------------+
| a| b|
+----+----------------+
|john|[red, blue, red]|
|bill | [blue, red]|
| sam| [green]|
+----+----------------+
The output I am trying to create is. I could live with a crossJoin (cartesian join) and add my own filtering if it makes the problem easier?
+----+----------------+
| x1| x2|
+----+----------------+
red | blue
red | red
blue | red
+----+----------------+
Re: how to create all possible combinations from an array? how to
join and explode row array?
Posted by Yong Zhang <ja...@hotmail.com>.
What's wrong just using a UDF doing for loop in scala? You can change the for loop logic for what combination you want.
scala> spark.version
res4: String = 2.2.1
scala> aggDS.printSchema
root
|-- name: string (nullable = true)
|-- colors: array (nullable = true)
| |-- element: string (containsNull = true)
scala> aggDS.show(false)
+----+----------------+
|name|colors |
+----+----------------+
|john|[red, blue, red]|
|bill|[blue, red] |
|sam |[gree] |
+----+----------------+
scala> import org.apache.spark.sql.functions.udf
import org.apache.spark.sql.functions.udf
scala> val loopUDF = udf { x: Seq[String] => for (a <- x; b <-x) yield (a,b) }
loopUDF: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,ArrayType(StructType(StructField(_1,StringType,true), StructField(_2,StringType,true)),true),Some(List(ArrayType(StringType,true))))
scala> aggDS.withColumn("newCol", loopUDF($"colors")).show(false)
+----+----------------+---------------------------------------------------------------------------------------------------------+
|name|colors |newCol |
+----+----------------+---------------------------------------------------------------------------------------------------------+
|john|[red, blue, red]|[[red,red], [red,blue], [red,red], [blue,red], [blue,blue], [blue,red], [red,red], [red,blue], [red,red]]|
|bill|[blue, red] |[[blue,blue], [blue,red], [red,blue], [red,red]] |
|sam |[gree] |[[gree,gree]] |
+----+----------------+-----------------------------------------------------------------
Yong
________________________________
From: Andy Davidson <An...@SantaCruzIntegration.com>
Sent: Friday, March 30, 2018 8:58 PM
To: Andy Davidson; user
Subject: Re: how to create all possible combinations from an array? how to join and explode row array?
I was a little sloppy when I created the sample output. Its missing a few pairs
Assume for a given row I have [a, b, c] I want to create something like the cartesian join
From: Andrew Davidson <An...@SantaCruzIntegration.com>>
Date: Friday, March 30, 2018 at 5:54 PM
To: "user @spark" <us...@spark.apache.org>>
Subject: how to create all possible combinations from an array? how to join and explode row array?
I have a dataframe and execute df.groupBy(“xyzy”).agg( collect_list(“abc”)
This produces a column of type array. Now for each row I want to create a multiple pairs/tuples from the array so that I can create a contingency table. Any idea how I can transform my data so that call crosstab() ? The join transformation operate on the entire dataframe. I need something at the row array level?
Bellow is some sample python and describes what I would like my results to be?
Kind regards
Andy
c1 = ["john", "bill", "sam"]
c2 = [['red', 'blue', 'red'], ['blue', 'red'], ['green']]
p = pd.DataFrame({"a":c1, "b":c2})
df = sqlContext.createDataFrame(p)
df.printSchema()
df.show()
root
|-- a: string (nullable = true)
|-- b: array (nullable = true)
| |-- element: string (containsNull = true)
+----+----------------+
| a| b|
+----+----------------+
|john|[red, blue, red]|
|bill | [blue, red]|
| sam| [green]|
+----+----------------+
The output I am trying to create is. I could live with a crossJoin (cartesian join) and add my own filtering if it makes the problem easier?
+----+----------------+
| x1| x2|
+----+----------------+
red | blue
red | red
blue | red
+----+----------------+
Re: how to create all possible combinations from an array? how to
join and explode row array?
Posted by Andy Davidson <An...@SantaCruzIntegration.com>.
I was a little sloppy when I created the sample output. Its missing a few
pairs
Assume for a given row I have [a, b, c] I want to create something like the
cartesian join
From: Andrew Davidson <An...@SantaCruzIntegration.com>
Date: Friday, March 30, 2018 at 5:54 PM
To: "user @spark" <us...@spark.apache.org>
Subject: how to create all possible combinations from an array? how to join
and explode row array?
> I have a dataframe and execute df.groupBy(³xyzy²).agg( collect_list(³abc²)
>
> This produces a column of type array. Now for each row I want to create a
> multiple pairs/tuples from the array so that I can create a contingency table.
> Any idea how I can transform my data so that call crosstab() ? The join
> transformation operate on the entire dataframe. I need something at the row
> array level?
>
>
> Bellow is some sample python and describes what I would like my results to be?
>
> Kind regards
>
> Andy
>
>
> c1 = ["john", "bill", "sam"]
> c2 = [['red', 'blue', 'red'], ['blue', 'red'], ['green']]
> p = pd.DataFrame({"a":c1, "b":c2})
>
> df = sqlContext.createDataFrame(p)
> df.printSchema()
> df.show()
>
> root
> |-- a: string (nullable = true)
> |-- b: array (nullable = true)
> | |-- element: string (containsNull = true)
>
> +----+----------------+
> | a| b|
> +----+----------------+
> |john|[red, blue, red]|
> |bill | [blue, red]|
> | sam| [green]|
> +----+----------------+
>
>
> The output I am trying to create is. I could live with a crossJoin (cartesian
> join) and add my own filtering if it makes the problem easier?
>
>
> +----+----------------+
> | x1| x2|
> +----+----------------+
> red | blue
> red | red
> blue | red
> +----+----------------+
>
>