You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by "☼ R Nair (रविशंकर नायर)" <ra...@gmail.com> on 2017/12/03 06:39:40 UTC

Question on using pseudo columns in spark jdbc options

Hi all,

I am using a query to fetch data from MYSQL as follows:

var df = spark.read.
format("jdbc").
option("url", "jdbc:mysql://10.0.0.192:3306/retail_db").
option("driver" ,"com.mysql.jdbc.Driver").
option("user", "retail_dba").
option("password", "cloudera").
option("dbtable", "orders").
option("partitionColumn", "order_id").
option("lowerBound", "1").
option("upperBound", "68883").
option("numPartitions", "4").
load()

Question is, can I use a pseudo column (like ROWNUM in Oracle or
RRN(employeeno) in DB2) in option where I specify the "partitionColumn" ?
If not, can we specify a partition column which is not a primary key ?

Best,
Ravion

Re: Question on using pseudo columns in spark jdbc options

Posted by "☼ R Nair (रविशंकर नायर)" <ra...@gmail.com>.

It works perfectly. You can use pseudo columns like ROWNUM in Oracle and
RRN in DB2. To avoid skewing you can apply the great coalesce
function...Spark is sparkling..

Best,


On Thu, Dec 7, 2017 at 2:20 PM, Tomasz Dudek <me...@gmail.com>
wrote:

> Hey Ravion,
>
> yes, you can obviously specify other column than a primary key. Be aware
> though, that if the key range is not spread evenly (for example in your
> code, if there's a "gap" in primary keys and no row has id between 0 and
> 17220) some of the executors may not assist in loading data (because
> "SELECT * FROM orders WHERE order_id IS BETWEEN 0 AND 17220 will return an
> empty result). I think you might want to repartition afterwards to ensure
> that df is evenly distributed(<--- could somebody confirm my last sentence?
> I don't want to mislead and I am not sure).
>
> The first question - could you just check and provide us the answer? :)
>
> Cheers,
> Tomasz
>
> 2017-12-03 7:39 GMT+01:00 ☼ R Nair (रविशंकर नायर) <
> ravishankar.nair@gmail.com>:
>
>> Hi all,
>>
>> I am using a query to fetch data from MYSQL as follows:
>>
>> var df = spark.read.
>> format("jdbc").
>> option("url", "jdbc:mysql://10.0.0.192:3306/retail_db").
>> option("driver" ,"com.mysql.jdbc.Driver").
>> option("user", "retail_dba").
>> option("password", "cloudera").
>> option("dbtable", "orders").
>> option("partitionColumn", "order_id").
>> option("lowerBound", "1").
>> option("upperBound", "68883").
>> option("numPartitions", "4").
>> load()
>>
>> Question is, can I use a pseudo column (like ROWNUM in Oracle or
>> RRN(employeeno) in DB2) in option where I specify the "partitionColumn" ?
>> If not, can we specify a partition column which is not a primary key ?
>>
>> Best,
>> Ravion
>>
>>
>>
>>
>


--

Re: Question on using pseudo columns in spark jdbc options

Posted by Tomasz Dudek <me...@gmail.com>.

Hey Ravion,

yes, you can obviously specify other column than a primary key. Be aware
though, that if the key range is not spread evenly (for example in your
code, if there's a "gap" in primary keys and no row has id between 0 and
17220) some of the executors may not assist in loading data (because
"SELECT * FROM orders WHERE order_id IS BETWEEN 0 AND 17220 will return an
empty result). I think you might want to repartition afterwards to ensure
that df is evenly distributed(<--- could somebody confirm my last sentence?
I don't want to mislead and I am not sure).

The first question - could you just check and provide us the answer? :)

Cheers,
Tomasz

2017-12-03 7:39 GMT+01:00 ☼ R Nair (रविशंकर नायर) <
ravishankar.nair@gmail.com>:

> Hi all,
>
> I am using a query to fetch data from MYSQL as follows:
>
> var df = spark.read.
> format("jdbc").
> option("url", "jdbc:mysql://10.0.0.192:3306/retail_db").
> option("driver" ,"com.mysql.jdbc.Driver").
> option("user", "retail_dba").
> option("password", "cloudera").
> option("dbtable", "orders").
> option("partitionColumn", "order_id").
> option("lowerBound", "1").
> option("upperBound", "68883").
> option("numPartitions", "4").
> load()
>
> Question is, can I use a pseudo column (like ROWNUM in Oracle or
> RRN(employeeno) in DB2) in option where I specify the "partitionColumn" ?
> If not, can we specify a partition column which is not a primary key ?
>
> Best,
> Ravion
>
>
>
>