You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Kristoffer Sjögren <st...@gmail.com> on 2016/11/18 14:32:01 UTC

DataFrame select non-existing column

Hi

We have evolved a DataFrame by adding a few columns but cannot write
select statements on these columns for older data that doesn't have
them since they fail with a AnalysisException with message "No such
struct field".

We also tried dropping columns but this doesn't work for nested columns.

Any non-hacky ways to get around this?

Cheers,
-Kristoffer

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: DataFrame select non-existing column

Posted by Muthu Jayakumar <ba...@gmail.com>.

Depending on your use case, 'df.withColumn("my_existing_or_new_col",
lit(0l))' could work?

On Fri, Nov 18, 2016 at 11:18 AM, Kristoffer Sjögren <st...@gmail.com>
wrote:

> Thanks for your answer. I have been searching the API for doing that
> but I could not find how to do it?
>
> Could you give me a code snippet?
>
> On Fri, Nov 18, 2016 at 8:03 PM, Mendelson, Assaf
> <As...@rsa.com> wrote:
> > You can always add the columns to old dataframes giving them null (or
> some literal) as a preprocessing.
> >
> > -----Original Message-----
> > From: Kristoffer Sjögren [mailto:stoffe@gmail.com]
> > Sent: Friday, November 18, 2016 4:32 PM
> > To: user
> > Subject: DataFrame select non-existing column
> >
> > Hi
> >
> > We have evolved a DataFrame by adding a few columns but cannot write
> select statements on these columns for older data that doesn't have them
> since they fail with a AnalysisException with message "No such struct
> field".
> >
> > We also tried dropping columns but this doesn't work for nested columns.
> >
> > Any non-hacky ways to get around this?
> >
> > Cheers,
> > -Kristoffer
> >
> > ---------------------------------------------------------------------
> > To unsubscribe e-mail: user-unsubscribe@spark.apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>

RE: DataFrame select non-existing column

Posted by "Mendelson, Assaf" <As...@rsa.com>.

The nested columns are in fact a syntactic sugar.
You basically have a column called pass. The type of this column is a struct which has a field called mobile.
After you read the parquet file you can check the schema (df.schema) and looking at what it has. Basically loop through the types and see if you have a pass column. Then take the type of that and check if it has a mobile element. This would be something like this (writing it without testing so it probably will have small mistakes):

df_schema = df.schema
found_pass = False
found_mobile = False
for f in df_schema:
    if f.name == 'pass':
        found_pass = True
        for g in f.dataType:
             if g.name == 'mobile':
                 found_mobile = True
        break

Assaf
-----Original Message-----
From: Kristoffer Sjögren [mailto:stoffe@gmail.com] 
Sent: Sunday, November 20, 2016 4:13 PM
To: Mendelson, Assaf
Cc: user
Subject: Re: DataFrame select non-existing column

The problem is that I do not know which data frames has the pass.mobile column. I just list a HDFS directory which contain the parquet files and some files has the column and some don't. I really don't want to have conditional logic that inspect the schema. But maybe that's the only option?

Maybe I misunderstand you, but the following code fails with the same error as before.

DataFrame dataFrame = ctx.read().parquet(localPath)
.select("pass")
.withColumn("mobile", col("pass.mobile"));


The flatten option works for my use case. But the problem is that there seems to be no way of dropping nested columns, i.e.
drop("pass.auction")


On Sun, Nov 20, 2016 at 10:55 AM, Mendelson, Assaf <As...@rsa.com> wrote:
> The issue is that you already have a struct called pass. What you did was add a new columned called "pass.mobile" instead of adding the element to pass - The schema for pass element is the same as before.
> When you do select pass.mobile, it finds the pass structure and checks for mobile in it.
>
> You can do it the other way around: set the name to be: pass_mobile. Add it as before with lit(0) for those that dataframes that do not have the mobile field and do something like withColumn("pass_mobile", df["pass.modile"]) for those that do.
> Another option is to use do something like df.select("pass.*") to flatten the pass structure and work on that (then you can do withColumn("mobile",...) instead of "pass.mobile") but this would change the schema.
>
>
> -----Original Message-----
> From: Kristoffer Sjögren [mailto:stoffe@gmail.com]
> Sent: Saturday, November 19, 2016 4:57 PM
> To: Mendelson, Assaf
> Cc: user
> Subject: Re: DataFrame select non-existing column
>
> Thanks. Here's my code example [1] and the printSchema() output [2].
>
> This code still fails with the following message: "No such struct field mobile in auction, geo"
>
> By looking at the schema, it seems that pass.mobile did not get nested, which is the way it needs to be for my use case. Is nested columns not supported by withColumn()?
>
> [1]
>
> DataFrame df = ctx.read().parquet(localPath).withColumn("pass.mobile", 
> lit(0L)); dataFrame.printSchema(); dataFrame.select("pass.mobile");
>
> [2]
>
> root
>  |-- pass: struct (nullable = true)
>  |    |-- auction: struct (nullable = true)
>  |    |    |-- id: integer (nullable = true)
>  |    |-- geo: struct (nullable = true)
>  |    |    |-- postalCode: string (nullable = true)
>  |-- pass.mobile: long (nullable = false)
>
> On Sat, Nov 19, 2016 at 7:45 AM, Mendelson, Assaf <As...@rsa.com> wrote:
>> In pyspark for example you would do something like:
>>
>> df.withColumn("newColName",pyspark.sql.functions.lit(None))
>>
>> Assaf.
>> -----Original Message-----
>> From: Kristoffer Sjögren [mailto:stoffe@gmail.com]
>> Sent: Friday, November 18, 2016 9:19 PM
>> To: Mendelson, Assaf
>> Cc: user
>> Subject: Re: DataFrame select non-existing column
>>
>> Thanks for your answer. I have been searching the API for doing that but I could not find how to do it?
>>
>> Could you give me a code snippet?
>>
>> On Fri, Nov 18, 2016 at 8:03 PM, Mendelson, Assaf <As...@rsa.com> wrote:
>>> You can always add the columns to old dataframes giving them null (or some literal) as a preprocessing.
>>>
>>> -----Original Message-----
>>> From: Kristoffer Sjögren [mailto:stoffe@gmail.com]
>>> Sent: Friday, November 18, 2016 4:32 PM
>>> To: user
>>> Subject: DataFrame select non-existing column
>>>
>>> Hi
>>>
>>> We have evolved a DataFrame by adding a few columns but cannot write select statements on these columns for older data that doesn't have them since they fail with a AnalysisException with message "No such struct field".
>>>
>>> We also tried dropping columns but this doesn't work for nested columns.
>>>
>>> Any non-hacky ways to get around this?
>>>
>>> Cheers,
>>> -Kristoffer
>>>
>>> --------------------------------------------------------------------
>>> - To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>>

Re: DataFrame select non-existing column

Posted by Kristoffer Sjögren <st...@gmail.com>.

The problem is that I do not know which data frames has the
pass.mobile column. I just list a HDFS directory which contain the
parquet files and some files has the column and some don't. I really
don't want to have conditional logic that inspect the schema. But
maybe that's the only option?

Maybe I misunderstand you, but the following code fails with the same
error as before.

DataFrame dataFrame = ctx.read().parquet(localPath)
.select("pass")
.withColumn("mobile", col("pass.mobile"));


The flatten option works for my use case. But the problem is that
there seems to be no way of dropping nested columns, i.e.
drop("pass.auction")


On Sun, Nov 20, 2016 at 10:55 AM, Mendelson, Assaf
<As...@rsa.com> wrote:
> The issue is that you already have a struct called pass. What you did was add a new columned called "pass.mobile" instead of adding the element to pass - The schema for pass element is the same as before.
> When you do select pass.mobile, it finds the pass structure and checks for mobile in it.
>
> You can do it the other way around: set the name to be: pass_mobile. Add it as before with lit(0) for those that dataframes that do not have the mobile field and do something like withColumn("pass_mobile", df["pass.modile"]) for those that do.
> Another option is to use do something like df.select("pass.*") to flatten the pass structure and work on that (then you can do withColumn("mobile",...) instead of "pass.mobile") but this would change the schema.
>
>
> -----Original Message-----
> From: Kristoffer Sjögren [mailto:stoffe@gmail.com]
> Sent: Saturday, November 19, 2016 4:57 PM
> To: Mendelson, Assaf
> Cc: user
> Subject: Re: DataFrame select non-existing column
>
> Thanks. Here's my code example [1] and the printSchema() output [2].
>
> This code still fails with the following message: "No such struct field mobile in auction, geo"
>
> By looking at the schema, it seems that pass.mobile did not get nested, which is the way it needs to be for my use case. Is nested columns not supported by withColumn()?
>
> [1]
>
> DataFrame df = ctx.read().parquet(localPath).withColumn("pass.mobile", lit(0L)); dataFrame.printSchema(); dataFrame.select("pass.mobile");
>
> [2]
>
> root
>  |-- pass: struct (nullable = true)
>  |    |-- auction: struct (nullable = true)
>  |    |    |-- id: integer (nullable = true)
>  |    |-- geo: struct (nullable = true)
>  |    |    |-- postalCode: string (nullable = true)
>  |-- pass.mobile: long (nullable = false)
>
> On Sat, Nov 19, 2016 at 7:45 AM, Mendelson, Assaf <As...@rsa.com> wrote:
>> In pyspark for example you would do something like:
>>
>> df.withColumn("newColName",pyspark.sql.functions.lit(None))
>>
>> Assaf.
>> -----Original Message-----
>> From: Kristoffer Sjögren [mailto:stoffe@gmail.com]
>> Sent: Friday, November 18, 2016 9:19 PM
>> To: Mendelson, Assaf
>> Cc: user
>> Subject: Re: DataFrame select non-existing column
>>
>> Thanks for your answer. I have been searching the API for doing that but I could not find how to do it?
>>
>> Could you give me a code snippet?
>>
>> On Fri, Nov 18, 2016 at 8:03 PM, Mendelson, Assaf <As...@rsa.com> wrote:
>>> You can always add the columns to old dataframes giving them null (or some literal) as a preprocessing.
>>>
>>> -----Original Message-----
>>> From: Kristoffer Sjögren [mailto:stoffe@gmail.com]
>>> Sent: Friday, November 18, 2016 4:32 PM
>>> To: user
>>> Subject: DataFrame select non-existing column
>>>
>>> Hi
>>>
>>> We have evolved a DataFrame by adding a few columns but cannot write select statements on these columns for older data that doesn't have them since they fail with a AnalysisException with message "No such struct field".
>>>
>>> We also tried dropping columns but this doesn't work for nested columns.
>>>
>>> Any non-hacky ways to get around this?
>>>
>>> Cheers,
>>> -Kristoffer
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

RE: DataFrame select non-existing column

Posted by "Mendelson, Assaf" <As...@rsa.com>.

The issue is that you already have a struct called pass. What you did was add a new columned called "pass.mobile" instead of adding the element to pass - The schema for pass element is the same as before.
When you do select pass.mobile, it finds the pass structure and checks for mobile in it.

You can do it the other way around: set the name to be: pass_mobile. Add it as before with lit(0) for those that dataframes that do not have the mobile field and do something like withColumn("pass_mobile", df["pass.modile"]) for those that do.
Another option is to use do something like df.select("pass.*") to flatten the pass structure and work on that (then you can do withColumn("mobile",...) instead of "pass.mobile") but this would change the schema.

-----Original Message-----
From: Kristoffer Sjögren [mailto:stoffe@gmail.com] 
Sent: Saturday, November 19, 2016 4:57 PM
To: Mendelson, Assaf
Cc: user
Subject: Re: DataFrame select non-existing column

Thanks. Here's my code example [1] and the printSchema() output [2].

This code still fails with the following message: "No such struct field mobile in auction, geo"

By looking at the schema, it seems that pass.mobile did not get nested, which is the way it needs to be for my use case. Is nested columns not supported by withColumn()?

[1]

DataFrame df = ctx.read().parquet(localPath).withColumn("pass.mobile", lit(0L)); dataFrame.printSchema(); dataFrame.select("pass.mobile");

[2]

root
 |-- pass: struct (nullable = true)
 |    |-- auction: struct (nullable = true)
 |    |    |-- id: integer (nullable = true)
 |    |-- geo: struct (nullable = true)
 |    |    |-- postalCode: string (nullable = true)
 |-- pass.mobile: long (nullable = false)

On Sat, Nov 19, 2016 at 7:45 AM, Mendelson, Assaf <As...@rsa.com> wrote:
> In pyspark for example you would do something like:
>
> df.withColumn("newColName",pyspark.sql.functions.lit(None))
>
> Assaf.
> -----Original Message-----
> From: Kristoffer Sjögren [mailto:stoffe@gmail.com]
> Sent: Friday, November 18, 2016 9:19 PM
> To: Mendelson, Assaf
> Cc: user
> Subject: Re: DataFrame select non-existing column
>
> Thanks for your answer. I have been searching the API for doing that but I could not find how to do it?
>
> Could you give me a code snippet?
>
> On Fri, Nov 18, 2016 at 8:03 PM, Mendelson, Assaf <As...@rsa.com> wrote:
>> You can always add the columns to old dataframes giving them null (or some literal) as a preprocessing.
>>
>> -----Original Message-----
>> From: Kristoffer Sjögren [mailto:stoffe@gmail.com]
>> Sent: Friday, November 18, 2016 4:32 PM
>> To: user
>> Subject: DataFrame select non-existing column
>>
>> Hi
>>
>> We have evolved a DataFrame by adding a few columns but cannot write select statements on these columns for older data that doesn't have them since they fail with a AnalysisException with message "No such struct field".
>>
>> We also tried dropping columns but this doesn't work for nested columns.
>>
>> Any non-hacky ways to get around this?
>>
>> Cheers,
>> -Kristoffer
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>

Re: DataFrame select non-existing column

Posted by Kristoffer Sjögren <st...@gmail.com>.

Thanks. Here's my code example [1] and the printSchema() output [2].

This code still fails with the following message: "No such struct
field mobile in auction, geo"

By looking at the schema, it seems that pass.mobile did not get
nested, which is the way it needs to be for my use case. Is nested
columns not supported by withColumn()?

[1]

DataFrame df = ctx.read().parquet(localPath).withColumn("pass.mobile", lit(0L));
dataFrame.printSchema();
dataFrame.select("pass.mobile");

[2]

root
 |-- pass: struct (nullable = true)
 |    |-- auction: struct (nullable = true)
 |    |    |-- id: integer (nullable = true)
 |    |-- geo: struct (nullable = true)
 |    |    |-- postalCode: string (nullable = true)
 |-- pass.mobile: long (nullable = false)

On Sat, Nov 19, 2016 at 7:45 AM, Mendelson, Assaf
<As...@rsa.com> wrote:
> In pyspark for example you would do something like:
>
> df.withColumn("newColName",pyspark.sql.functions.lit(None))
>
> Assaf.
> -----Original Message-----
> From: Kristoffer Sjögren [mailto:stoffe@gmail.com]
> Sent: Friday, November 18, 2016 9:19 PM
> To: Mendelson, Assaf
> Cc: user
> Subject: Re: DataFrame select non-existing column
>
> Thanks for your answer. I have been searching the API for doing that but I could not find how to do it?
>
> Could you give me a code snippet?
>
> On Fri, Nov 18, 2016 at 8:03 PM, Mendelson, Assaf <As...@rsa.com> wrote:
>> You can always add the columns to old dataframes giving them null (or some literal) as a preprocessing.
>>
>> -----Original Message-----
>> From: Kristoffer Sjögren [mailto:stoffe@gmail.com]
>> Sent: Friday, November 18, 2016 4:32 PM
>> To: user
>> Subject: DataFrame select non-existing column
>>
>> Hi
>>
>> We have evolved a DataFrame by adding a few columns but cannot write select statements on these columns for older data that doesn't have them since they fail with a AnalysisException with message "No such struct field".
>>
>> We also tried dropping columns but this doesn't work for nested columns.
>>
>> Any non-hacky ways to get around this?
>>
>> Cheers,
>> -Kristoffer
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

RE: DataFrame select non-existing column

Posted by "Mendelson, Assaf" <As...@rsa.com>.

In pyspark for example you would do something like:

df.withColumn("newColName",pyspark.sql.functions.lit(None))

Assaf.
-----Original Message-----
From: Kristoffer Sjögren [mailto:stoffe@gmail.com] 
Sent: Friday, November 18, 2016 9:19 PM
To: Mendelson, Assaf
Cc: user
Subject: Re: DataFrame select non-existing column

Thanks for your answer. I have been searching the API for doing that but I could not find how to do it?

Could you give me a code snippet?

On Fri, Nov 18, 2016 at 8:03 PM, Mendelson, Assaf <As...@rsa.com> wrote:
> You can always add the columns to old dataframes giving them null (or some literal) as a preprocessing.
>
> -----Original Message-----
> From: Kristoffer Sjögren [mailto:stoffe@gmail.com]
> Sent: Friday, November 18, 2016 4:32 PM
> To: user
> Subject: DataFrame select non-existing column
>
> Hi
>
> We have evolved a DataFrame by adding a few columns but cannot write select statements on these columns for older data that doesn't have them since they fail with a AnalysisException with message "No such struct field".
>
> We also tried dropping columns but this doesn't work for nested columns.
>
> Any non-hacky ways to get around this?
>
> Cheers,
> -Kristoffer
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>

Re: DataFrame select non-existing column

Posted by Kristoffer Sjögren <st...@gmail.com>.

Thanks for your answer. I have been searching the API for doing that
but I could not find how to do it?

Could you give me a code snippet?

On Fri, Nov 18, 2016 at 8:03 PM, Mendelson, Assaf
<As...@rsa.com> wrote:
> You can always add the columns to old dataframes giving them null (or some literal) as a preprocessing.
>
> -----Original Message-----
> From: Kristoffer Sjögren [mailto:stoffe@gmail.com]
> Sent: Friday, November 18, 2016 4:32 PM
> To: user
> Subject: DataFrame select non-existing column
>
> Hi
>
> We have evolved a DataFrame by adding a few columns but cannot write select statements on these columns for older data that doesn't have them since they fail with a AnalysisException with message "No such struct field".
>
> We also tried dropping columns but this doesn't work for nested columns.
>
> Any non-hacky ways to get around this?
>
> Cheers,
> -Kristoffer
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

RE: DataFrame select non-existing column

Posted by "Mendelson, Assaf" <As...@rsa.com>.

You can always add the columns to old dataframes giving them null (or some literal) as a preprocessing.

-----Original Message-----
From: Kristoffer Sjögren [mailto:stoffe@gmail.com] 
Sent: Friday, November 18, 2016 4:32 PM
To: user
Subject: DataFrame select non-existing column

Hi

We have evolved a DataFrame by adding a few columns but cannot write select statements on these columns for older data that doesn't have them since they fail with a AnalysisException with message "No such struct field".

We also tried dropping columns but this doesn't work for nested columns.

Any non-hacky ways to get around this?

Cheers,
-Kristoffer

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org