You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by kant kodali <ka...@gmail.com> on 2017/05/17 23:06:11 UTC
How to flatten struct into a dataframe?
Hi,
I have the following schema. And I am trying to put the structure below in
a data frame or dataset such that each in field inside a struct is a column
in a data frame.
I tried to follow this link
<http://stackoverflow.com/questions/38753898/how-to-flatten-a-struct-in-a-spark-dataframe>
and
did the following.
Dataset<Row> df = ds.select(functions.from_json(new Column("value").cast(
"string"), getSchema()).as("payload"));
Dataset<Row> df1 = df.select(df.col("payload.info"));
df1.printSchema();
root
|-- info: struct (nullable = true)
| |-- index: string (nullable = true)
| |-- type: string (nullable = true)
| |-- id: string (nullable = true)
| |-- name: string (nullable = true)
| |-- number: integer (nullable = true)
However I get the following
+--------------------+
| info|
+--------------------+
|[,mango,,fruit...|
|[,apple,,fruit...|
I just want the data frame in the format below. any ideas?
index | type | id | name | number
Thanks!
Re: How to flatten struct into a dataframe?
Posted by kant kodali <ka...@gmail.com>.
Bookmarked that blog post! It answers lot of my questions.
On Wed, May 17, 2017 at 4:25 PM, Michael Armbrust <mi...@databricks.com>
wrote:
> This blog post walks through ways to manipulate complex data
> <https://databricks.com/blog/2017/02/23/working-complex-data-formats-structured-streaming-apache-spark-2-1.html>
> .
>
> To flatten you can run df.selectExpr("payload.info.*")
>
> On Wed, May 17, 2017 at 4:06 PM, kant kodali <ka...@gmail.com> wrote:
>
>> Hi,
>>
>> I have the following schema. And I am trying to put the structure below
>> in a data frame or dataset such that each in field inside a struct is a
>> column in a data frame.
>> I tried to follow this link
>> <http://stackoverflow.com/questions/38753898/how-to-flatten-a-struct-in-a-spark-dataframe> and
>> did the following.
>>
>> Dataset<Row> df = ds.select(functions.from_json(new Column("value").cast(
>> "string"), getSchema()).as("payload"));
>>
>> Dataset<Row> df1 = df.select(df.col("payload.info"));
>> df1.printSchema();
>>
>>
>> root
>> |-- info: struct (nullable = true)
>> | |-- index: string (nullable = true)
>> | |-- type: string (nullable = true)
>> | |-- id: string (nullable = true)
>> | |-- name: string (nullable = true)
>> | |-- number: integer (nullable = true)
>>
>>
>> However I get the following
>>
>> +--------------------+
>> | info|
>> +--------------------+
>> |[,mango,,fruit...|
>> |[,apple,,fruit...|
>>
>> I just want the data frame in the format below. any ideas?
>>
>> index | type | id | name | number
>>
>> Thanks!
>>
>
>
Re: How to flatten struct into a dataframe?
Posted by Michael Armbrust <mi...@databricks.com>.
This blog post walks through ways to manipulate complex data
<https://databricks.com/blog/2017/02/23/working-complex-data-formats-structured-streaming-apache-spark-2-1.html>
.
To flatten you can run df.selectExpr("payload.info.*")
On Wed, May 17, 2017 at 4:06 PM, kant kodali <ka...@gmail.com> wrote:
> Hi,
>
> I have the following schema. And I am trying to put the structure below in
> a data frame or dataset such that each in field inside a struct is a column
> in a data frame.
> I tried to follow this link
> <http://stackoverflow.com/questions/38753898/how-to-flatten-a-struct-in-a-spark-dataframe> and
> did the following.
>
> Dataset<Row> df = ds.select(functions.from_json(new Column("value").cast("
> string"), getSchema()).as("payload"));
>
> Dataset<Row> df1 = df.select(df.col("payload.info"));
> df1.printSchema();
>
>
> root
> |-- info: struct (nullable = true)
> | |-- index: string (nullable = true)
> | |-- type: string (nullable = true)
> | |-- id: string (nullable = true)
> | |-- name: string (nullable = true)
> | |-- number: integer (nullable = true)
>
>
> However I get the following
>
> +--------------------+
> | info|
> +--------------------+
> |[,mango,,fruit...|
> |[,apple,,fruit...|
>
> I just want the data frame in the format below. any ideas?
>
> index | type | id | name | number
>
> Thanks!
>