You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Mike Trienis <mi...@orcsol.com> on 2015/07/17 20:55:25 UTC

Data frames select and where clause dependency

I'd like to understand why the where field must exist in the select clause.

For example, the following select statement works fine

   - df.select("field1", "filter_field").filter(df("filter_field") ===
   "value").show()

However, the next one fails with the error "in operator !Filter
(filter_field#60 = value);"

   - df.select("field1").filter(df("filter_field") === "value").show()

As a work-around, it seems that I can do the following

   - df.select("field1", "filter_field").filter(df("filter_field") ===
   "value").drop("filter_field").show()


Thanks, Mike.

Re: Data frames select and where clause dependency

Posted by Mike Trienis <mi...@orcsol.com>.
Definitely, thanks Mohammed.

On Mon, Jul 20, 2015 at 5:47 PM, Mohammed Guller <mo...@glassbeam.com>
wrote:

>  Thanks, Harish.
>
>
>
> Mike – this would be a cleaner version for your use case:
>
> df.filter(df("filter_field") === "value").select("field1").show()
>
>
>
> Mohammed
>
>
>
> *From:* Harish Butani [mailto:rhbutani.spark@gmail.com]
> *Sent:* Monday, July 20, 2015 5:37 PM
> *To:* Mohammed Guller
> *Cc:* Michael Armbrust; Mike Trienis; user@spark.apache.org
>
> *Subject:* Re: Data frames select and where clause dependency
>
>
>
> Yes via:  org.apache.spark.sql.catalyst.optimizer.ColumnPruning
>
> See DefaultOptimizer.batches for list of logical rewrites.
>
>
>
> You can see the optimized plan by printing: df.queryExecution.optimizedPlan
>
>
>
> On Mon, Jul 20, 2015 at 5:22 PM, Mohammed Guller <mo...@glassbeam.com>
> wrote:
>
> Michael,
>
> How would the Catalyst optimizer optimize this version?
>
> df.filter(df("filter_field") === "value").select("field1").show()
>
> Would it still read all the columns in df or would it read only
> “filter_field” and “field1” since only two columns are used (assuming other
> columns from df are not used anywhere else)?
>
>
>
> Mohammed
>
>
>
> *From:* Michael Armbrust [mailto:michael@databricks.com]
> *Sent:* Friday, July 17, 2015 1:39 PM
> *To:* Mike Trienis
> *Cc:* user@spark.apache.org
> *Subject:* Re: Data frames select and where clause dependency
>
>
>
> Each operation on a dataframe is completely independent and doesn't know
> what operations happened before it.  When you do a selection, you are
> removing other columns from the dataframe and so the filter has nothing to
> operate on.
>
>
>
> On Fri, Jul 17, 2015 at 11:55 AM, Mike Trienis <mi...@orcsol.com>
> wrote:
>
> I'd like to understand why the where field must exist in the select
> clause.
>
>
>
> For example, the following select statement works fine
>
>    - df.select("field1", "filter_field").filter(df("filter_field") ===
>    "value").show()
>
>  However, the next one fails with the error "in operator !Filter
> (filter_field#60 = value);"
>
>    - df.select("field1").filter(df("filter_field") === "value").show()
>
>  As a work-around, it seems that I can do the following
>
>    - df.select("field1", "filter_field").filter(df("filter_field") ===
>    "value").drop("filter_field").show()
>
>
>
> Thanks, Mike.
>
>
>
>
>

RE: Data frames select and where clause dependency

Posted by Mohammed Guller <mo...@glassbeam.com>.
Thanks, Harish.

Mike – this would be a cleaner version for your use case:
df.filter(df("filter_field") === "value").select("field1").show()

Mohammed

From: Harish Butani [mailto:rhbutani.spark@gmail.com]
Sent: Monday, July 20, 2015 5:37 PM
To: Mohammed Guller
Cc: Michael Armbrust; Mike Trienis; user@spark.apache.org
Subject: Re: Data frames select and where clause dependency

Yes via:  org.apache.spark.sql.catalyst.optimizer.ColumnPruning
See DefaultOptimizer.batches for list of logical rewrites.

You can see the optimized plan by printing: df.queryExecution.optimizedPlan

On Mon, Jul 20, 2015 at 5:22 PM, Mohammed Guller <mo...@glassbeam.com>> wrote:
Michael,
How would the Catalyst optimizer optimize this version?
df.filter(df("filter_field") === "value").select("field1").show()
Would it still read all the columns in df or would it read only “filter_field” and “field1” since only two columns are used (assuming other columns from df are not used anywhere else)?

Mohammed

From: Michael Armbrust [mailto:michael@databricks.com<ma...@databricks.com>]
Sent: Friday, July 17, 2015 1:39 PM
To: Mike Trienis
Cc: user@spark.apache.org<ma...@spark.apache.org>
Subject: Re: Data frames select and where clause dependency

Each operation on a dataframe is completely independent and doesn't know what operations happened before it.  When you do a selection, you are removing other columns from the dataframe and so the filter has nothing to operate on.

On Fri, Jul 17, 2015 at 11:55 AM, Mike Trienis <mi...@orcsol.com>> wrote:
I'd like to understand why the where field must exist in the select clause.

For example, the following select statement works fine

  *   df.select("field1", "filter_field").filter(df("filter_field") === "value").show()
However, the next one fails with the error "in operator !Filter (filter_field#60 = value);"

  *   df.select("field1").filter(df("filter_field") === "value").show()
As a work-around, it seems that I can do the following

  *   df.select("field1", "filter_field").filter(df("filter_field") === "value").drop("filter_field").show()

Thanks, Mike.



Re: Data frames select and where clause dependency

Posted by Harish Butani <rh...@gmail.com>.
Yes via:  org.apache.spark.sql.catalyst.optimizer.ColumnPruning
See DefaultOptimizer.batches for list of logical rewrites.

You can see the optimized plan by printing: df.queryExecution.optimizedPlan

On Mon, Jul 20, 2015 at 5:22 PM, Mohammed Guller <mo...@glassbeam.com>
wrote:

>  Michael,
>
> How would the Catalyst optimizer optimize this version?
>
> df.filter(df("filter_field") === "value").select("field1").show()
>
> Would it still read all the columns in df or would it read only
> “filter_field” and “field1” since only two columns are used (assuming other
> columns from df are not used anywhere else)?
>
>
>
> Mohammed
>
>
>
> *From:* Michael Armbrust [mailto:michael@databricks.com]
> *Sent:* Friday, July 17, 2015 1:39 PM
> *To:* Mike Trienis
> *Cc:* user@spark.apache.org
> *Subject:* Re: Data frames select and where clause dependency
>
>
>
> Each operation on a dataframe is completely independent and doesn't know
> what operations happened before it.  When you do a selection, you are
> removing other columns from the dataframe and so the filter has nothing to
> operate on.
>
>
>
> On Fri, Jul 17, 2015 at 11:55 AM, Mike Trienis <mi...@orcsol.com>
> wrote:
>
> I'd like to understand why the where field must exist in the select
> clause.
>
>
>
> For example, the following select statement works fine
>
>    - df.select("field1", "filter_field").filter(df("filter_field") ===
>    "value").show()
>
>  However, the next one fails with the error "in operator !Filter
> (filter_field#60 = value);"
>
>    - df.select("field1").filter(df("filter_field") === "value").show()
>
>  As a work-around, it seems that I can do the following
>
>    - df.select("field1", "filter_field").filter(df("filter_field") ===
>    "value").drop("filter_field").show()
>
>
>
> Thanks, Mike.
>
>
>

RE: Data frames select and where clause dependency

Posted by Mohammed Guller <mo...@glassbeam.com>.
Michael,
How would the Catalyst optimizer optimize this version?
df.filter(df("filter_field") === "value").select("field1").show()
Would it still read all the columns in df or would it read only “filter_field” and “field1” since only two columns are used (assuming other columns from df are not used anywhere else)?

Mohammed

From: Michael Armbrust [mailto:michael@databricks.com]
Sent: Friday, July 17, 2015 1:39 PM
To: Mike Trienis
Cc: user@spark.apache.org
Subject: Re: Data frames select and where clause dependency

Each operation on a dataframe is completely independent and doesn't know what operations happened before it.  When you do a selection, you are removing other columns from the dataframe and so the filter has nothing to operate on.

On Fri, Jul 17, 2015 at 11:55 AM, Mike Trienis <mi...@orcsol.com>> wrote:
I'd like to understand why the where field must exist in the select clause.

For example, the following select statement works fine

  *   df.select("field1", "filter_field").filter(df("filter_field") === "value").show()
However, the next one fails with the error "in operator !Filter (filter_field#60 = value);"

  *   df.select("field1").filter(df("filter_field") === "value").show()
As a work-around, it seems that I can do the following

  *   df.select("field1", "filter_field").filter(df("filter_field") === "value").drop("filter_field").show()

Thanks, Mike.


Re: Data frames select and where clause dependency

Posted by Michael Armbrust <mi...@databricks.com>.
Each operation on a dataframe is completely independent and doesn't know
what operations happened before it.  When you do a selection, you are
removing other columns from the dataframe and so the filter has nothing to
operate on.

On Fri, Jul 17, 2015 at 11:55 AM, Mike Trienis <mi...@orcsol.com>
wrote:

> I'd like to understand why the where field must exist in the select
> clause.
>
> For example, the following select statement works fine
>
>    - df.select("field1", "filter_field").filter(df("filter_field") ===
>    "value").show()
>
> However, the next one fails with the error "in operator !Filter
> (filter_field#60 = value);"
>
>    - df.select("field1").filter(df("filter_field") === "value").show()
>
> As a work-around, it seems that I can do the following
>
>    - df.select("field1", "filter_field").filter(df("filter_field") ===
>    "value").drop("filter_field").show()
>
>
> Thanks, Mike.
>