You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Marek Wiewiorka <ma...@gmail.com> on 2015/03/22 16:44:44 UTC

lower&upperBound not working/spark 1.3

Hi All - I try to use the new SQLContext API for populating DataFrame from
jdbc data source.
like this:

val jdbcDF = sqlContext.jdbc(url =
"jdbc:postgresql://localhost:5430/dbname?user=user&password=111", table =
"se_staging.exp_table3" ,columnName="cs_id",lowerBound=1 ,upperBound =
10000, numPartitions=12 )

No matter how I set lower and upper bounds I always get all the rows from
my table.
The API is marked as experimental so I assume there might by some bugs in
it but
did anybody come across a similar issue?

Thanks!

Re: lower&upperBound not working/spark 1.3

Posted by Sathish Kumaran Vairavelu <vs...@gmail.com>.

Hi

I am also facing with same issue. Is it possible to view actual query
passed to the database. Has anyone tried that? Also, what if we don't give
upper and lower bound partition. Would we end up in data skew ?

Thanks

Sathish
On Sun, Jun 14, 2015 at 5:02 AM Sujeevan <su...@gmail.com> wrote:

> I also thought that it is an issue. After investigating it further, found
> out this. https://issues.apache.org/jira/browse/SPARK-6800
>
> Here is the updated documentation of
> *org.apache.spark.sql.jdbc.JDBCRelation#columnPartition* method
>
>
> "Notice that lowerBound and upperBound are just used to decide the
> partition stride, not for filtering the rows in table. So all rows in the
> table will be partitioned and returned."
>
> So filter has to be added manually in the query.
>
> val jdbcDF = sqlContext.jdbc(url =
> "jdbc:postgresql://localhost:5430/dbname?user=user&password=111", table =
> "(select * from se_staging.exp_table3 where cs_id >= 1 and cs_id <= 10000)
> as exp_table3new" ,columnName="cs_id",lowerBound=1 ,upperBound = 10000,
> numPartitions=12 )
>
>
>
>
> Best Regards,
>
> Sujeevan. N
>
> On Mon, Mar 23, 2015 at 4:02 AM, Ted Yu <yu...@gmail.com> wrote:
>
>> I went over JDBCRelation#columnPartition() but didn't find obvious clue
>> (you can add more logging to confirm that the partitions were generated
>> correctly).
>>
>> Looks like the issue may be somewhere else.
>>
>> Cheers
>>
>> On Sun, Mar 22, 2015 at 12:47 PM, Marek Wiewiorka <
>> marek.wiewiorka@gmail.com> wrote:
>>
>>> ...I even tried setting upper/lower bounds to the same value like 1 or
>>> 10 with the same result.
>>> cs_id is a column of the cardinality ~5*10^6
>>> So this is not the case here.
>>>
>>> Regards,
>>> Marek
>>>
>>> 2015-03-22 20:30 GMT+01:00 Ted Yu <yu...@gmail.com>:
>>>
>>>> From javadoc of JDBCRelation#columnPartition():
>>>>    * Given a partitioning schematic (a column of integral type, a
>>>> number of
>>>>    * partitions, and upper and lower bounds on the column's value),
>>>> generate
>>>>
>>>> In your example, 1 and 10000 are for the value of cs_id column.
>>>>
>>>> Looks like all the values in that column fall within the range of 1 and
>>>> 1000.
>>>>
>>>> Cheers
>>>>
>>>> On Sun, Mar 22, 2015 at 8:44 AM, Marek Wiewiorka <
>>>> marek.wiewiorka@gmail.com> wrote:
>>>>
>>>>> Hi All - I try to use the new SQLContext API for populating DataFrame
>>>>> from jdbc data source.
>>>>> like this:
>>>>>
>>>>> val jdbcDF = sqlContext.jdbc(url =
>>>>> "jdbc:postgresql://localhost:5430/dbname?user=user&password=111", table =
>>>>> "se_staging.exp_table3" ,columnName="cs_id",lowerBound=1 ,upperBound =
>>>>> 10000, numPartitions=12 )
>>>>>
>>>>> No matter how I set lower and upper bounds I always get all the rows
>>>>> from my table.
>>>>> The API is marked as experimental so I assume there might by some bugs
>>>>> in it but
>>>>> did anybody come across a similar issue?
>>>>>
>>>>> Thanks!
>>>>>
>>>>
>>>>
>>>
>>
>

Re: lower&upperBound not working/spark 1.3

Posted by Sujeevan <su...@gmail.com>.

I also thought that it is an issue. After investigating it further, found
out this. https://issues.apache.org/jira/browse/SPARK-6800

Here is the updated documentation of
*org.apache.spark.sql.jdbc.JDBCRelation#columnPartition* method


"Notice that lowerBound and upperBound are just used to decide the
partition stride, not for filtering the rows in table. So all rows in the
table will be partitioned and returned."

So filter has to be added manually in the query.

val jdbcDF = sqlContext.jdbc(url =
"jdbc:postgresql://localhost:5430/dbname?user=user&password=111", table =
"(select * from se_staging.exp_table3 where cs_id >= 1 and cs_id <= 10000)
as exp_table3new" ,columnName="cs_id",lowerBound=1 ,upperBound = 10000,
numPartitions=12 )




Best Regards,

Sujeevan. N

On Mon, Mar 23, 2015 at 4:02 AM, Ted Yu <yu...@gmail.com> wrote:

> I went over JDBCRelation#columnPartition() but didn't find obvious clue
> (you can add more logging to confirm that the partitions were generated
> correctly).
>
> Looks like the issue may be somewhere else.
>
> Cheers
>
> On Sun, Mar 22, 2015 at 12:47 PM, Marek Wiewiorka <
> marek.wiewiorka@gmail.com> wrote:
>
>> ...I even tried setting upper/lower bounds to the same value like 1 or 10
>> with the same result.
>> cs_id is a column of the cardinality ~5*10^6
>> So this is not the case here.
>>
>> Regards,
>> Marek
>>
>> 2015-03-22 20:30 GMT+01:00 Ted Yu <yu...@gmail.com>:
>>
>>> From javadoc of JDBCRelation#columnPartition():
>>>    * Given a partitioning schematic (a column of integral type, a number
>>> of
>>>    * partitions, and upper and lower bounds on the column's value),
>>> generate
>>>
>>> In your example, 1 and 10000 are for the value of cs_id column.
>>>
>>> Looks like all the values in that column fall within the range of 1 and
>>> 1000.
>>>
>>> Cheers
>>>
>>> On Sun, Mar 22, 2015 at 8:44 AM, Marek Wiewiorka <
>>> marek.wiewiorka@gmail.com> wrote:
>>>
>>>> Hi All - I try to use the new SQLContext API for populating DataFrame
>>>> from jdbc data source.
>>>> like this:
>>>>
>>>> val jdbcDF = sqlContext.jdbc(url =
>>>> "jdbc:postgresql://localhost:5430/dbname?user=user&password=111", table =
>>>> "se_staging.exp_table3" ,columnName="cs_id",lowerBound=1 ,upperBound =
>>>> 10000, numPartitions=12 )
>>>>
>>>> No matter how I set lower and upper bounds I always get all the rows
>>>> from my table.
>>>> The API is marked as experimental so I assume there might by some bugs
>>>> in it but
>>>> did anybody come across a similar issue?
>>>>
>>>> Thanks!
>>>>
>>>
>>>
>>
>

Re: lower&upperBound not working/spark 1.3

Posted by Ted Yu <yu...@gmail.com>.

I went over JDBCRelation#columnPartition() but didn't find obvious clue
(you can add more logging to confirm that the partitions were generated
correctly).

Looks like the issue may be somewhere else.

Cheers

On Sun, Mar 22, 2015 at 12:47 PM, Marek Wiewiorka <marek.wiewiorka@gmail.com
> wrote:

> ...I even tried setting upper/lower bounds to the same value like 1 or 10
> with the same result.
> cs_id is a column of the cardinality ~5*10^6
> So this is not the case here.
>
> Regards,
> Marek
>
> 2015-03-22 20:30 GMT+01:00 Ted Yu <yu...@gmail.com>:
>
>> From javadoc of JDBCRelation#columnPartition():
>>    * Given a partitioning schematic (a column of integral type, a number
>> of
>>    * partitions, and upper and lower bounds on the column's value),
>> generate
>>
>> In your example, 1 and 10000 are for the value of cs_id column.
>>
>> Looks like all the values in that column fall within the range of 1 and
>> 1000.
>>
>> Cheers
>>
>> On Sun, Mar 22, 2015 at 8:44 AM, Marek Wiewiorka <
>> marek.wiewiorka@gmail.com> wrote:
>>
>>> Hi All - I try to use the new SQLContext API for populating DataFrame
>>> from jdbc data source.
>>> like this:
>>>
>>> val jdbcDF = sqlContext.jdbc(url =
>>> "jdbc:postgresql://localhost:5430/dbname?user=user&password=111", table =
>>> "se_staging.exp_table3" ,columnName="cs_id",lowerBound=1 ,upperBound =
>>> 10000, numPartitions=12 )
>>>
>>> No matter how I set lower and upper bounds I always get all the rows
>>> from my table.
>>> The API is marked as experimental so I assume there might by some bugs
>>> in it but
>>> did anybody come across a similar issue?
>>>
>>> Thanks!
>>>
>>
>>
>

Re: lower&upperBound not working/spark 1.3

Posted by Marek Wiewiorka <ma...@gmail.com>.

...I even tried setting upper/lower bounds to the same value like 1 or 10
with the same result.
cs_id is a column of the cardinality ~5*10^6
So this is not the case here.

Regards,
Marek

2015-03-22 20:30 GMT+01:00 Ted Yu <yu...@gmail.com>:

> From javadoc of JDBCRelation#columnPartition():
>    * Given a partitioning schematic (a column of integral type, a number of
>    * partitions, and upper and lower bounds on the column's value),
> generate
>
> In your example, 1 and 10000 are for the value of cs_id column.
>
> Looks like all the values in that column fall within the range of 1 and
> 1000.
>
> Cheers
>
> On Sun, Mar 22, 2015 at 8:44 AM, Marek Wiewiorka <
> marek.wiewiorka@gmail.com> wrote:
>
>> Hi All - I try to use the new SQLContext API for populating DataFrame
>> from jdbc data source.
>> like this:
>>
>> val jdbcDF = sqlContext.jdbc(url =
>> "jdbc:postgresql://localhost:5430/dbname?user=user&password=111", table =
>> "se_staging.exp_table3" ,columnName="cs_id",lowerBound=1 ,upperBound =
>> 10000, numPartitions=12 )
>>
>> No matter how I set lower and upper bounds I always get all the rows from
>> my table.
>> The API is marked as experimental so I assume there might by some bugs in
>> it but
>> did anybody come across a similar issue?
>>
>> Thanks!
>>
>
>

Re: lower&upperBound not working/spark 1.3

Posted by Ted Yu <yu...@gmail.com>.

>From javadoc of JDBCRelation#columnPartition():
   * Given a partitioning schematic (a column of integral type, a number of
   * partitions, and upper and lower bounds on the column's value), generate

In your example, 1 and 10000 are for the value of cs_id column.

Looks like all the values in that column fall within the range of 1 and
1000.

Cheers

On Sun, Mar 22, 2015 at 8:44 AM, Marek Wiewiorka <ma...@gmail.com>
wrote:

> Hi All - I try to use the new SQLContext API for populating DataFrame from
> jdbc data source.
> like this:
>
> val jdbcDF = sqlContext.jdbc(url =
> "jdbc:postgresql://localhost:5430/dbname?user=user&password=111", table =
> "se_staging.exp_table3" ,columnName="cs_id",lowerBound=1 ,upperBound =
> 10000, numPartitions=12 )
>
> No matter how I set lower and upper bounds I always get all the rows from
> my table.
> The API is marked as experimental so I assume there might by some bugs in
> it but
> did anybody come across a similar issue?
>
> Thanks!
>