You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Ramakrishna Rayudu <ra...@gmail.com> on 2022/11/17 16:26:11 UTC

[Spark SQL]: Is it possible that spark SQL appends "SELECT 1 " to the query

Hi Team,

I am facing one issue. Can you please help me on this.

<https://stackoverflow.com/>

   1.


<https://stackoverflow.com/posts/74477662/timeline>

We are connecting Tera data from spark SQL with below API

Dataset<Row> jdbcDF = spark.read().jdbc(connectionUrl, tableQuery,
connectionProperties);

when we execute above logic on large table with million rows every
time we are seeing below

extra query is executing every time as this resulting performance hit on DB.

This below information we got from DBA. We dont have any logs on SPARK SQL.

SELECT 1 FROM ONE_MILLION_ROWS_TABLE;

1
1
1
1
1
1
1
1
1

Can you please clarify why this query is executing or is there any chance
that this type of query is executing from our code it self while check for
rows count from dataframe.

Please provide me your inputs on this.


Thanks,

Rama

pyspark read.csv() doesn't respect locale when reading float

Posted by "Weiand, Markus" <ma...@bertelsmann.de>.
Hello!

I want to read csv files with pyspark using (spark_session).read.csv().
There is a whole bunch of nice options, especially an option "locale", nut nonetheless a decimal comma instead of a decimal point is not understood when reading float/double input even when the locale is set to 'de-DE'. I am using spark 3.2.0.
Of course I can read the column as string and write my own float-reader, but this will be inefficient in python.
And a simple csv generated by Excel will have decimal commas if written in Germany (with German localized Excel).

Markus


Re: [Spark SQL]: Is it possible that spark SQL appends "SELECT 1 " to the query

Posted by Sean Owen <sr...@gmail.com>.
Taking this of list

Start here:
https://github.com/apache/spark/blob/70ec696bce7012b25ed6d8acec5e2f3b3e127f11/sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala#L144
Look at subclasses of JdbcDialect too, like TeradataDialect.
Note that you are using an old unsupported version, too; that's a link to
master.

On Fri, Nov 18, 2022 at 5:50 AM Ramakrishna Rayudu <
ramakrishna560.rayudu@gmail.com> wrote:

> Hi Sean,
>
> Can you please let me know what is query spark internally fires for
> getting count on dataframe.
>
> Long count=dataframe.count();
>
> Is this
>
> SELECT 1 FROM ( QUERY) SUB_TABL
>
> and suming up the all 1s in the response.
> Or directly
>
> SELECT COUNT(*) FROM (QUERY)
> SUB_TABL
>
> Can you please what is approch spark will follow.
>
>
> Thanks,
> Ramakrishna Rayudu
>
> On Fri, Nov 18, 2022, 8:13 AM Ramakrishna Rayudu <
> ramakrishna560.rayudu@gmail.com> wrote:
>
>> Sure I will test with latest spark and let you the result.
>>
>> Thanks,
>> Rama
>>
>> On Thu, Nov 17, 2022, 11:16 PM Sean Owen <sr...@gmail.com> wrote:
>>
>>> Weird, does Teradata not support LIMIT n? looking at the Spark source
>>> code suggests it won't. The syntax is "SELECT TOP"? I wonder if that's why
>>> the generic query that seems to test existence loses the LIMIT.
>>> But, that "SELECT 1" test seems to be used for MySQL, Postgres, so I'm
>>> still not sure where it's coming from or if it's coming from Spark. You're
>>> using the teradata dialect I assume. Can you use the latest Spark to test?
>>>
>>> On Thu, Nov 17, 2022 at 11:31 AM Ramakrishna Rayudu <
>>> ramakrishna560.rayudu@gmail.com> wrote:
>>>
>>>> Yes I am sure that we are not generating this kind of queries. Okay
>>>> then problem is  LIMIT is not coming up in query. Can you please suggest me
>>>> any direction.
>>>>
>>>> Thanks,
>>>> Rama
>>>>
>>>> On Thu, Nov 17, 2022, 10:56 PM Sean Owen <sr...@gmail.com> wrote:
>>>>
>>>>> Hm, the existence queries even in 2.4.x had LIMIT 1. Are you sure
>>>>> nothing else is generating or changing those queries?
>>>>>
>>>>> On Thu, Nov 17, 2022 at 11:20 AM Ramakrishna Rayudu <
>>>>> ramakrishna560.rayudu@gmail.com> wrote:
>>>>>
>>>>>> We are using spark 2.4.4 version.
>>>>>> I can see two types of queries in DB logs.
>>>>>>
>>>>>> SELECT 1 FROM (INPUT_QUERY) SPARK_GEN_SUB_0
>>>>>>
>>>>>> SELECT * FROM (INPUT_QUERY) SPARK_GEN_SUB_0 WHERE 1=0
>>>>>>
>>>>>> When we see `SELECT *` which ending up with `Where 1=0`  but query
>>>>>> starts with `SELECT 1` there is no where condition.
>>>>>>
>>>>>> Thanks,
>>>>>> Rama
>>>>>>
>>>>>> On Thu, Nov 17, 2022, 10:39 PM Sean Owen <sr...@gmail.com> wrote:
>>>>>>
>>>>>>> Hm, actually that doesn't look like the queries that Spark uses to
>>>>>>> test existence, which will be "SELECT 1 ... LIMIT 1" or "SELECT * ... WHERE
>>>>>>> 1=0" depending on the dialect. What version, and are you sure something
>>>>>>> else is not sending those queries?
>>>>>>>
>>>>>>> On Thu, Nov 17, 2022 at 11:02 AM Ramakrishna Rayudu <
>>>>>>> ramakrishna560.rayudu@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi Sean,
>>>>>>>>
>>>>>>>> Thanks for your response I think it has the performance impact
>>>>>>>> because if the query return one million rows then in the response It's self
>>>>>>>> we will one million rows unnecessarily like below.
>>>>>>>>
>>>>>>>> 1
>>>>>>>> 1
>>>>>>>> 1
>>>>>>>> 1
>>>>>>>> .
>>>>>>>> .
>>>>>>>> 1
>>>>>>>>
>>>>>>>>
>>>>>>>> Its impact the performance. Can we any alternate solution for this.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Rama
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Nov 17, 2022, 10:17 PM Sean Owen <sr...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> This is a query to check the existence of the table upfront.
>>>>>>>>> It is nearly a no-op query; can it have a perf impact?
>>>>>>>>>
>>>>>>>>> On Thu, Nov 17, 2022 at 10:42 AM Ramakrishna Rayudu <
>>>>>>>>> ramakrishna560.rayudu@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Team,
>>>>>>>>>>
>>>>>>>>>> I am facing one issue. Can you please help me on this.
>>>>>>>>>>
>>>>>>>>>> <https://stackoverflow.com/>
>>>>>>>>>>
>>>>>>>>>>    1.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> <https://stackoverflow.com/posts/74477662/timeline>
>>>>>>>>>>
>>>>>>>>>> We are connecting Tera data from spark SQL with below API
>>>>>>>>>>
>>>>>>>>>> Dataset<Row> jdbcDF = spark.read().jdbc(connectionUrl, tableQuery, connectionProperties);
>>>>>>>>>>
>>>>>>>>>> when we execute above logic on large table with million rows every time we are seeing below
>>>>>>>>>>
>>>>>>>>>> extra query is executing every time as this resulting performance hit on DB.
>>>>>>>>>>
>>>>>>>>>> This below information we got from DBA. We dont have any logs on
>>>>>>>>>> SPARK SQL.
>>>>>>>>>>
>>>>>>>>>> SELECT 1 FROM ONE_MILLION_ROWS_TABLE;
>>>>>>>>>>
>>>>>>>>>> 1
>>>>>>>>>> 1
>>>>>>>>>> 1
>>>>>>>>>> 1
>>>>>>>>>> 1
>>>>>>>>>> 1
>>>>>>>>>> 1
>>>>>>>>>> 1
>>>>>>>>>> 1
>>>>>>>>>>
>>>>>>>>>> Can you please clarify why this query is executing or is there
>>>>>>>>>> any chance that this type of query is executing from our code it self while
>>>>>>>>>> check for rows count from dataframe.
>>>>>>>>>>
>>>>>>>>>> Please provide me your inputs on this.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>>
>>>>>>>>>> Rama
>>>>>>>>>>
>>>>>>>>>

Re: [Spark SQL]: Is it possible that spark SQL appends "SELECT 1 " to the query

Posted by Sean Owen <sr...@gmail.com>.
Weird, does Teradata not support LIMIT n? looking at the Spark source code
suggests it won't. The syntax is "SELECT TOP"? I wonder if that's why the
generic query that seems to test existence loses the LIMIT.
But, that "SELECT 1" test seems to be used for MySQL, Postgres, so I'm
still not sure where it's coming from or if it's coming from Spark. You're
using the teradata dialect I assume. Can you use the latest Spark to test?

On Thu, Nov 17, 2022 at 11:31 AM Ramakrishna Rayudu <
ramakrishna560.rayudu@gmail.com> wrote:

> Yes I am sure that we are not generating this kind of queries. Okay then
> problem is  LIMIT is not coming up in query. Can you please suggest me any
> direction.
>
> Thanks,
> Rama
>
> On Thu, Nov 17, 2022, 10:56 PM Sean Owen <sr...@gmail.com> wrote:
>
>> Hm, the existence queries even in 2.4.x had LIMIT 1. Are you sure nothing
>> else is generating or changing those queries?
>>
>> On Thu, Nov 17, 2022 at 11:20 AM Ramakrishna Rayudu <
>> ramakrishna560.rayudu@gmail.com> wrote:
>>
>>> We are using spark 2.4.4 version.
>>> I can see two types of queries in DB logs.
>>>
>>> SELECT 1 FROM (INPUT_QUERY) SPARK_GEN_SUB_0
>>>
>>> SELECT * FROM (INPUT_QUERY) SPARK_GEN_SUB_0 WHERE 1=0
>>>
>>> When we see `SELECT *` which ending up with `Where 1=0`  but query
>>> starts with `SELECT 1` there is no where condition.
>>>
>>> Thanks,
>>> Rama
>>>
>>> On Thu, Nov 17, 2022, 10:39 PM Sean Owen <sr...@gmail.com> wrote:
>>>
>>>> Hm, actually that doesn't look like the queries that Spark uses to test
>>>> existence, which will be "SELECT 1 ... LIMIT 1" or "SELECT * ... WHERE 1=0"
>>>> depending on the dialect. What version, and are you sure something else is
>>>> not sending those queries?
>>>>
>>>> On Thu, Nov 17, 2022 at 11:02 AM Ramakrishna Rayudu <
>>>> ramakrishna560.rayudu@gmail.com> wrote:
>>>>
>>>>> Hi Sean,
>>>>>
>>>>> Thanks for your response I think it has the performance impact because
>>>>> if the query return one million rows then in the response It's self we will
>>>>> one million rows unnecessarily like below.
>>>>>
>>>>> 1
>>>>> 1
>>>>> 1
>>>>> 1
>>>>> .
>>>>> .
>>>>> 1
>>>>>
>>>>>
>>>>> Its impact the performance. Can we any alternate solution for this.
>>>>>
>>>>> Thanks,
>>>>> Rama
>>>>>
>>>>>
>>>>> On Thu, Nov 17, 2022, 10:17 PM Sean Owen <sr...@gmail.com> wrote:
>>>>>
>>>>>> This is a query to check the existence of the table upfront.
>>>>>> It is nearly a no-op query; can it have a perf impact?
>>>>>>
>>>>>> On Thu, Nov 17, 2022 at 10:42 AM Ramakrishna Rayudu <
>>>>>> ramakrishna560.rayudu@gmail.com> wrote:
>>>>>>
>>>>>>> Hi Team,
>>>>>>>
>>>>>>> I am facing one issue. Can you please help me on this.
>>>>>>>
>>>>>>> <https://stackoverflow.com/>
>>>>>>>
>>>>>>>    1.
>>>>>>>
>>>>>>>
>>>>>>> <https://stackoverflow.com/posts/74477662/timeline>
>>>>>>>
>>>>>>> We are connecting Tera data from spark SQL with below API
>>>>>>>
>>>>>>> Dataset<Row> jdbcDF = spark.read().jdbc(connectionUrl, tableQuery, connectionProperties);
>>>>>>>
>>>>>>> when we execute above logic on large table with million rows every time we are seeing below
>>>>>>>
>>>>>>> extra query is executing every time as this resulting performance hit on DB.
>>>>>>>
>>>>>>> This below information we got from DBA. We dont have any logs on
>>>>>>> SPARK SQL.
>>>>>>>
>>>>>>> SELECT 1 FROM ONE_MILLION_ROWS_TABLE;
>>>>>>>
>>>>>>> 1
>>>>>>> 1
>>>>>>> 1
>>>>>>> 1
>>>>>>> 1
>>>>>>> 1
>>>>>>> 1
>>>>>>> 1
>>>>>>> 1
>>>>>>>
>>>>>>> Can you please clarify why this query is executing or is there any
>>>>>>> chance that this type of query is executing from our code it self while
>>>>>>> check for rows count from dataframe.
>>>>>>>
>>>>>>> Please provide me your inputs on this.
>>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Rama
>>>>>>>
>>>>>>

Re: [Spark SQL]: Is it possible that spark SQL appends "SELECT 1 " to the query

Posted by Sean Owen <sr...@gmail.com>.
Hm, the existence queries even in 2.4.x had LIMIT 1. Are you sure nothing
else is generating or changing those queries?

On Thu, Nov 17, 2022 at 11:20 AM Ramakrishna Rayudu <
ramakrishna560.rayudu@gmail.com> wrote:

> We are using spark 2.4.4 version.
> I can see two types of queries in DB logs.
>
> SELECT 1 FROM (INPUT_QUERY) SPARK_GEN_SUB_0
>
> SELECT * FROM (INPUT_QUERY) SPARK_GEN_SUB_0 WHERE 1=0
>
> When we see `SELECT *` which ending up with `Where 1=0`  but query starts
> with `SELECT 1` there is no where condition.
>
> Thanks,
> Rama
>
> On Thu, Nov 17, 2022, 10:39 PM Sean Owen <sr...@gmail.com> wrote:
>
>> Hm, actually that doesn't look like the queries that Spark uses to test
>> existence, which will be "SELECT 1 ... LIMIT 1" or "SELECT * ... WHERE 1=0"
>> depending on the dialect. What version, and are you sure something else is
>> not sending those queries?
>>
>> On Thu, Nov 17, 2022 at 11:02 AM Ramakrishna Rayudu <
>> ramakrishna560.rayudu@gmail.com> wrote:
>>
>>> Hi Sean,
>>>
>>> Thanks for your response I think it has the performance impact because
>>> if the query return one million rows then in the response It's self we will
>>> one million rows unnecessarily like below.
>>>
>>> 1
>>> 1
>>> 1
>>> 1
>>> .
>>> .
>>> 1
>>>
>>>
>>> Its impact the performance. Can we any alternate solution for this.
>>>
>>> Thanks,
>>> Rama
>>>
>>>
>>> On Thu, Nov 17, 2022, 10:17 PM Sean Owen <sr...@gmail.com> wrote:
>>>
>>>> This is a query to check the existence of the table upfront.
>>>> It is nearly a no-op query; can it have a perf impact?
>>>>
>>>> On Thu, Nov 17, 2022 at 10:42 AM Ramakrishna Rayudu <
>>>> ramakrishna560.rayudu@gmail.com> wrote:
>>>>
>>>>> Hi Team,
>>>>>
>>>>> I am facing one issue. Can you please help me on this.
>>>>>
>>>>> <https://stackoverflow.com/>
>>>>>
>>>>>    1.
>>>>>
>>>>>
>>>>> <https://stackoverflow.com/posts/74477662/timeline>
>>>>>
>>>>> We are connecting Tera data from spark SQL with below API
>>>>>
>>>>> Dataset<Row> jdbcDF = spark.read().jdbc(connectionUrl, tableQuery, connectionProperties);
>>>>>
>>>>> when we execute above logic on large table with million rows every time we are seeing below
>>>>>
>>>>> extra query is executing every time as this resulting performance hit on DB.
>>>>>
>>>>> This below information we got from DBA. We dont have any logs on SPARK
>>>>> SQL.
>>>>>
>>>>> SELECT 1 FROM ONE_MILLION_ROWS_TABLE;
>>>>>
>>>>> 1
>>>>> 1
>>>>> 1
>>>>> 1
>>>>> 1
>>>>> 1
>>>>> 1
>>>>> 1
>>>>> 1
>>>>>
>>>>> Can you please clarify why this query is executing or is there any
>>>>> chance that this type of query is executing from our code it self while
>>>>> check for rows count from dataframe.
>>>>>
>>>>> Please provide me your inputs on this.
>>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Rama
>>>>>
>>>>

Re: [Spark SQL]: Is it possible that spark SQL appends "SELECT 1 " to the query

Posted by Ramakrishna Rayudu <ra...@gmail.com>.
We are using spark 2.4.4 version.
I can see two types of queries in DB logs.

SELECT 1 FROM (INPUT_QUERY) SPARK_GEN_SUB_0

SELECT * FROM (INPUT_QUERY) SPARK_GEN_SUB_0 WHERE 1=0

When we see `SELECT *` which ending up with `Where 1=0`  but query starts
with `SELECT 1` there is no where condition.

Thanks,
Rama

On Thu, Nov 17, 2022, 10:39 PM Sean Owen <sr...@gmail.com> wrote:

> Hm, actually that doesn't look like the queries that Spark uses to test
> existence, which will be "SELECT 1 ... LIMIT 1" or "SELECT * ... WHERE 1=0"
> depending on the dialect. What version, and are you sure something else is
> not sending those queries?
>
> On Thu, Nov 17, 2022 at 11:02 AM Ramakrishna Rayudu <
> ramakrishna560.rayudu@gmail.com> wrote:
>
>> Hi Sean,
>>
>> Thanks for your response I think it has the performance impact because if
>> the query return one million rows then in the response It's self we will
>> one million rows unnecessarily like below.
>>
>> 1
>> 1
>> 1
>> 1
>> .
>> .
>> 1
>>
>>
>> Its impact the performance. Can we any alternate solution for this.
>>
>> Thanks,
>> Rama
>>
>>
>> On Thu, Nov 17, 2022, 10:17 PM Sean Owen <sr...@gmail.com> wrote:
>>
>>> This is a query to check the existence of the table upfront.
>>> It is nearly a no-op query; can it have a perf impact?
>>>
>>> On Thu, Nov 17, 2022 at 10:42 AM Ramakrishna Rayudu <
>>> ramakrishna560.rayudu@gmail.com> wrote:
>>>
>>>> Hi Team,
>>>>
>>>> I am facing one issue. Can you please help me on this.
>>>>
>>>> <https://stackoverflow.com/>
>>>>
>>>>    1.
>>>>
>>>>
>>>> <https://stackoverflow.com/posts/74477662/timeline>
>>>>
>>>> We are connecting Tera data from spark SQL with below API
>>>>
>>>> Dataset<Row> jdbcDF = spark.read().jdbc(connectionUrl, tableQuery, connectionProperties);
>>>>
>>>> when we execute above logic on large table with million rows every time we are seeing below
>>>>
>>>> extra query is executing every time as this resulting performance hit on DB.
>>>>
>>>> This below information we got from DBA. We dont have any logs on SPARK
>>>> SQL.
>>>>
>>>> SELECT 1 FROM ONE_MILLION_ROWS_TABLE;
>>>>
>>>> 1
>>>> 1
>>>> 1
>>>> 1
>>>> 1
>>>> 1
>>>> 1
>>>> 1
>>>> 1
>>>>
>>>> Can you please clarify why this query is executing or is there any
>>>> chance that this type of query is executing from our code it self while
>>>> check for rows count from dataframe.
>>>>
>>>> Please provide me your inputs on this.
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Rama
>>>>
>>>

Re: [Spark SQL]: Is it possible that spark SQL appends "SELECT 1 " to the query

Posted by Sean Owen <sr...@gmail.com>.
Hm, actually that doesn't look like the queries that Spark uses to test
existence, which will be "SELECT 1 ... LIMIT 1" or "SELECT * ... WHERE 1=0"
depending on the dialect. What version, and are you sure something else is
not sending those queries?

On Thu, Nov 17, 2022 at 11:02 AM Ramakrishna Rayudu <
ramakrishna560.rayudu@gmail.com> wrote:

> Hi Sean,
>
> Thanks for your response I think it has the performance impact because if
> the query return one million rows then in the response It's self we will
> one million rows unnecessarily like below.
>
> 1
> 1
> 1
> 1
> .
> .
> 1
>
>
> Its impact the performance. Can we any alternate solution for this.
>
> Thanks,
> Rama
>
>
> On Thu, Nov 17, 2022, 10:17 PM Sean Owen <sr...@gmail.com> wrote:
>
>> This is a query to check the existence of the table upfront.
>> It is nearly a no-op query; can it have a perf impact?
>>
>> On Thu, Nov 17, 2022 at 10:42 AM Ramakrishna Rayudu <
>> ramakrishna560.rayudu@gmail.com> wrote:
>>
>>> Hi Team,
>>>
>>> I am facing one issue. Can you please help me on this.
>>>
>>> <https://stackoverflow.com/>
>>>
>>>    1.
>>>
>>>
>>> <https://stackoverflow.com/posts/74477662/timeline>
>>>
>>> We are connecting Tera data from spark SQL with below API
>>>
>>> Dataset<Row> jdbcDF = spark.read().jdbc(connectionUrl, tableQuery, connectionProperties);
>>>
>>> when we execute above logic on large table with million rows every time we are seeing below
>>>
>>> extra query is executing every time as this resulting performance hit on DB.
>>>
>>> This below information we got from DBA. We dont have any logs on SPARK
>>> SQL.
>>>
>>> SELECT 1 FROM ONE_MILLION_ROWS_TABLE;
>>>
>>> 1
>>> 1
>>> 1
>>> 1
>>> 1
>>> 1
>>> 1
>>> 1
>>> 1
>>>
>>> Can you please clarify why this query is executing or is there any
>>> chance that this type of query is executing from our code it self while
>>> check for rows count from dataframe.
>>>
>>> Please provide me your inputs on this.
>>>
>>>
>>> Thanks,
>>>
>>> Rama
>>>
>>

Re: [Spark SQL]: Is it possible that spark SQL appends "SELECT 1 " to the query

Posted by Sean Owen <sr...@gmail.com>.
This is a query to check the existence of the table upfront.
It is nearly a no-op query; can it have a perf impact?

On Thu, Nov 17, 2022 at 10:42 AM Ramakrishna Rayudu <
ramakrishna560.rayudu@gmail.com> wrote:

> Hi Team,
>
> I am facing one issue. Can you please help me on this.
>
> <https://stackoverflow.com/>
>
>    1.
>
>
> <https://stackoverflow.com/posts/74477662/timeline>
>
> We are connecting Tera data from spark SQL with below API
>
> Dataset<Row> jdbcDF = spark.read().jdbc(connectionUrl, tableQuery, connectionProperties);
>
> when we execute above logic on large table with million rows every time we are seeing below
>
> extra query is executing every time as this resulting performance hit on DB.
>
> This below information we got from DBA. We dont have any logs on SPARK SQL.
>
> SELECT 1 FROM ONE_MILLION_ROWS_TABLE;
>
> 1
> 1
> 1
> 1
> 1
> 1
> 1
> 1
> 1
>
> Can you please clarify why this query is executing or is there any chance
> that this type of query is executing from our code it self while check for
> rows count from dataframe.
>
> Please provide me your inputs on this.
>
>
> Thanks,
>
> Rama
>