You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by "Jain, Nishit" <nj...@underarmour.com> on 2016/11/03 15:53:10 UTC

How do I convert a data frame to broadcast variable?

I have a lookup table in HANA database. I want to create a spark broadcast variable for it.
What would be the suggested approach? Should I read it as an data frame and convert data frame into broadcast variable?

Thanks,
Nishit

Re: How do I convert a data frame to broadcast variable?

Posted by "Jain, Nishit" <nj...@underarmour.com>.

Awesome, thanks Silvio!

From: Silvio Fiorito <si...@granturing.com>>
Date: Thursday, November 3, 2016 at 12:26 PM
To: "Jain, Nishit" <nj...@underarmour.com>>, Denny Lee <de...@gmail.com>>, "user@spark.apache.org<ma...@spark.apache.org>" <us...@spark.apache.org>>
Subject: Re: How do I convert a data frame to broadcast variable?

Hi Nishit,

Yes the JDBC connector supports predicate pushdown and column pruning. So any selection you make on the dataframe will get materialized in the query sent via JDBC.

You should be able to verify this by looking at the physical query plan:

val df = sqlContext.jdbc(....).select($"col1", $"col2")

df.explain(true)

Or if you can easily log queries submitted to your database then you can view the specific query.

Thanks,

Silvio

________________________________
From: Jain, Nishit <nj...@underarmour.com>>
Sent: Thursday, November 3, 2016 12:32:48 PM
To: Denny Lee; user@spark.apache.org<ma...@spark.apache.org>
Subject: Re: How do I convert a data frame to broadcast variable?

Thanks Denny! That does help. I will give that a shot.

Question: If I am going this route, I am wondering how can I only read few columns of a table (not whole table) from JDBC as data frame.
This function from data frame reader does not give an option to read only certain columns:
def jdbc(url: String, table: String, predicates: Array[String], connectionProperties: Properties): DataFrame

On the other hand if I want to create a JDBCRdd I can specify a select query (instead of full table):
new JdbcRDD(sc: SparkContext, getConnection: () ⇒ Connection, sql: String, lowerBound: Long, upperBound: Long, numPartitions: Int, mapRow: (ResultSet) ⇒ T = JdbcRDD.resultSetToObjectArray)(implicit arg0: ClassTag[T])

May be if I do df.select(col1, col2)  on data frame created via a table, will spark be smart enough to fetch only two columns not entire table?
Any way to test this?

From: Denny Lee <de...@gmail.com>>
Date: Thursday, November 3, 2016 at 10:59 AM
To: "Jain, Nishit" <nj...@underarmour.com>>, "user@spark.apache.org<ma...@spark.apache.org>" <us...@spark.apache.org>>
Subject: Re: How do I convert a data frame to broadcast variable?

If you're able to read the data in as a DataFrame, perhaps you can use a BroadcastHashJoin so that way you can join to that table presuming its small enough to distributed?  Here's a handy guide on a BroadcastHashJoin: https://docs.cloud.databricks.com/docs/latest/databricks_guide/index.html#04%20SQL,%20DataFrames%20%26%20Datasets/05%20BroadcastHashJoin%20-%20scala.html

HTH!

On Thu, Nov 3, 2016 at 8:53 AM Jain, Nishit <nj...@underarmour.com>> wrote:
I have a lookup table in HANA database. I want to create a spark broadcast variable for it.
What would be the suggested approach? Should I read it as an data frame and convert data frame into broadcast variable?

Thanks,
Nishit

Re: How do I convert a data frame to broadcast variable?

Posted by Silvio Fiorito <si...@granturing.com>.

Hi Nishit,


Yes the JDBC connector supports predicate pushdown and column pruning. So any selection you make on the dataframe will get materialized in the query sent via JDBC.


You should be able to verify this by looking at the physical query plan:


val df = sqlContext.jdbc(....).select($"col1", $"col2")

df.explain(true)


Or if you can easily log queries submitted to your database then you can view the specific query.


Thanks,

Silvio

________________________________
From: Jain, Nishit <nj...@underarmour.com>
Sent: Thursday, November 3, 2016 12:32:48 PM
To: Denny Lee; user@spark.apache.org
Subject: Re: How do I convert a data frame to broadcast variable?

Thanks Denny! That does help. I will give that a shot.

Question: If I am going this route, I am wondering how can I only read few columns of a table (not whole table) from JDBC as data frame.
This function from data frame reader does not give an option to read only certain columns:
def jdbc(url: String, table: String, predicates: Array[String], connectionProperties: Properties): DataFrame

On the other hand if I want to create a JDBCRdd I can specify a select query (instead of full table):
new JdbcRDD(sc: SparkContext, getConnection: () ? Connection, sql: String, lowerBound: Long, upperBound: Long, numPartitions: Int, mapRow: (ResultSet) ? T = JdbcRDD.resultSetToObjectArray)(implicit arg0: ClassTag[T])

May be if I do df.select(col1, col2)  on data frame created via a table, will spark be smart enough to fetch only two columns not entire table?
Any way to test this?


From: Denny Lee <de...@gmail.com>>
Date: Thursday, November 3, 2016 at 10:59 AM
To: "Jain, Nishit" <nj...@underarmour.com>>, "user@spark.apache.org<ma...@spark.apache.org>" <us...@spark.apache.org>>
Subject: Re: How do I convert a data frame to broadcast variable?

If you're able to read the data in as a DataFrame, perhaps you can use a BroadcastHashJoin so that way you can join to that table presuming its small enough to distributed?  Here's a handy guide on a BroadcastHashJoin: https://docs.cloud.databricks.com/docs/latest/databricks_guide/index.html#04%20SQL,%20DataFrames%20%26%20Datasets/05%20BroadcastHashJoin%20-%20scala.html

HTH!


On Thu, Nov 3, 2016 at 8:53 AM Jain, Nishit <nj...@underarmour.com>> wrote:
I have a lookup table in HANA database. I want to create a spark broadcast variable for it.
What would be the suggested approach? Should I read it as an data frame and convert data frame into broadcast variable?

Thanks,
Nishit

Re: How do I convert a data frame to broadcast variable?

Posted by "Jain, Nishit" <nj...@underarmour.com>.

Thanks Denny! That does help. I will give that a shot.

Question: If I am going this route, I am wondering how can I only read few columns of a table (not whole table) from JDBC as data frame.
This function from data frame reader does not give an option to read only certain columns:
def jdbc(url: String, table: String, predicates: Array[String], connectionProperties: Properties): DataFrame

On the other hand if I want to create a JDBCRdd I can specify a select query (instead of full table):
new JdbcRDD(sc: SparkContext, getConnection: () ⇒ Connection, sql: String, lowerBound: Long, upperBound: Long, numPartitions: Int, mapRow: (ResultSet) ⇒ T = JdbcRDD.resultSetToObjectArray)(implicit arg0: ClassTag[T])

May be if I do df.select(col1, col2)  on data frame created via a table, will spark be smart enough to fetch only two columns not entire table?
Any way to test this?


From: Denny Lee <de...@gmail.com>>
Date: Thursday, November 3, 2016 at 10:59 AM
To: "Jain, Nishit" <nj...@underarmour.com>>, "user@spark.apache.org<ma...@spark.apache.org>" <us...@spark.apache.org>>
Subject: Re: How do I convert a data frame to broadcast variable?

If you're able to read the data in as a DataFrame, perhaps you can use a BroadcastHashJoin so that way you can join to that table presuming its small enough to distributed?  Here's a handy guide on a BroadcastHashJoin: https://docs.cloud.databricks.com/docs/latest/databricks_guide/index.html#04%20SQL,%20DataFrames%20%26%20Datasets/05%20BroadcastHashJoin%20-%20scala.html

HTH!


On Thu, Nov 3, 2016 at 8:53 AM Jain, Nishit <nj...@underarmour.com>> wrote:
I have a lookup table in HANA database. I want to create a spark broadcast variable for it.
What would be the suggested approach? Should I read it as an data frame and convert data frame into broadcast variable?

Thanks,
Nishit

Re: How do I convert a data frame to broadcast variable?

Posted by Denny Lee <de...@gmail.com>.

If you're able to read the data in as a DataFrame, perhaps you can use a
BroadcastHashJoin so that way you can join to that table presuming its
small enough to distributed?  Here's a handy guide on a BroadcastHashJoin:
https://docs.cloud.databricks.com/docs/latest/databricks_guide/index.html#04%20SQL,%20DataFrames%20%26%20Datasets/05%20BroadcastHashJoin%20-%20scala.html

HTH!

On Thu, Nov 3, 2016 at 8:53 AM Jain, Nishit <nj...@underarmour.com> wrote:

> I have a lookup table in HANA database. I want to create a spark broadcast
> variable for it.
> What would be the suggested approach? Should I read it as an data frame
> and convert data frame into broadcast variable?
>
> Thanks,
> Nishit
>