You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Saurabh Gulati <sa...@fedex.com.INVALID> on 2022/02/22 15:33:51 UTC

Re: [EXTERNAL] Re: Need to make WHERE clause compulsory in Spark SQL

Thanks Sean for your response.

@Mich Talebzadeh<ma...@gmail.com> We run all workloads on GKE as docker containers. So to answer your questions, Hive is running in a container as K8S service and spark thrift-server in another container as a service and Superset in a third container.

We use Spark on GKE setup to run thrift-server which spawns workers depending on the load. For buckets we use gcs.


TIA
Saurabh
________________________________
From: Mich Talebzadeh <mi...@gmail.com>
Sent: 22 February 2022 16:05
To: Saurabh Gulati <sa...@fedex.com.invalid>
Cc: user@spark.apache.org <us...@spark.apache.org>
Subject: [EXTERNAL] Re: Need to make WHERE clause compulsory in Spark SQL

Caution! This email originated outside of FedEx. Please do not open attachments or click links from an unknown or suspicious origin.

Is your hive on prem with external tables in cloud storage?

Where is your spark running from and what cloud buckets are you using?

HTH

On Tue, 22 Feb 2022 at 12:36, Saurabh Gulati <sa...@fedex.com.invalid> wrote:
Hello,
We are trying to setup Spark as the execution engine for exposing our data stored in lake. We have hive metastore running along with Spark thrift server and are using Superset as the UI.

We save all tables as External tables in hive metastore with storge being on Cloud.

We see that right now when users run a query in Superset SQL Lab it scans the whole table. What we want is to limit the data scan by setting something like hive.mapred.mode=strict in spark, so that user gets an exception if they don't specify a partition column.

We tried setting spark.hadoop.hive.mapred.mode=strict in spark-defaults.conf in thrift server  but it still scans the whole table.
Also tried setting hive.mapred.mode=strict in hive-defaults.conf for metastore container.

We use Spark 3.2 with hive-metastore version 3.1.2

Is there a way in spark settings to make it happen.


TIA
Saurabh
--



 [https://docs.google.com/uc?export=download&id=1-q7RFGRfLMObPuQPWSd9sl_H1UPNFaIZ&revid=0B1BiUVX33unjMWtVUWpINWFCd0ZQTlhTRHpGckh4Wlg4RG80PQ]   view my Linkedin profile<https://urldefense.com/v3/__https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/__;!!AhGNFqKB8wRZstQ!UkIXXdMGzZQ1fweFWq7S_xng9u_1Pjbpz9cBjBrs_ajvgZ05vnA7VLJ1gTZbg4rhI9Q$>


 https://en.everybodywiki.com/Mich_Talebzadeh<https://urldefense.com/v3/__https://en.everybodywiki.com/Mich_Talebzadeh__;!!AhGNFqKB8wRZstQ!UkIXXdMGzZQ1fweFWq7S_xng9u_1Pjbpz9cBjBrs_ajvgZ05vnA7VLJ1gTZbyZfziHU$>



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

Re: [EXTERNAL] Re: Need to make WHERE clause compulsory in Spark SQL

Posted by Saurabh Gulati <sa...@fedex.com.INVALID>.

Hi Gourav,
We use auto-scaling containers in GKE for running the Spark thriftserver.
________________________________
From: Gourav Sengupta <go...@gmail.com>
Sent: 07 March 2022 14:36
To: Saurabh Gulati <sa...@fedex.com>
Cc: Mich Talebzadeh <mi...@gmail.com>; Kidong Lee <my...@gmail.com>; user@spark.apache.org <us...@spark.apache.org>
Subject: Re: [EXTERNAL] Re: Need to make WHERE clause compulsory in Spark SQL

Hi,

are all users using the same cluster of data proc?

Regards,
Gourav

On Mon, Mar 7, 2022 at 9:28 AM Saurabh Gulati <sa...@fedex.com>> wrote:
Thanks for the response, Gourav.

Queries range from simple to large joins. We expose the data to our analytics users so that they can develop their models and they use superset as the SQL interface for testing.

Hive-metastore will not do a full scan if we specify the partitioning column.
But that's something users might/do forget, so we were thinking of enforcing a way to make sure people do specify partitioning column in their queries.

The only way we see for now is to parse the query in superset to check if partition column is being used. But we are not sure of a way which will work for all types of queries.

For example, we can parse the SQL and see if count (where) == count( partition_column ), but this may not work for complex queries.

Regards
Saurabh
________________________________
From: Gourav Sengupta <go...@gmail.com>>
Sent: 05 March 2022 11:06
To: Saurabh Gulati <sa...@fedex.com.invalid>
Cc: Mich Talebzadeh <mi...@gmail.com>>; Kidong Lee <my...@gmail.com>>; user@spark.apache.org<ma...@spark.apache.org> <us...@spark.apache.org>>
Subject: Re: [EXTERNAL] Re: Need to make WHERE clause compulsory in Spark SQL

Hi,

I completely agree with Saurabh, the use of BQ with SPARK does not make sense at all, if you are trying to cut down your costs. I think that costs do matter to a few people at the end.

Saurabh, is there any chance you can see what actual queries are hitting the thrift server? Using hive metastore is something that I have been doing in AWS EMR for the last 5 years and for sure it does not cause full table scan.

Hi Sean,
for some reason, I am not able to receive any emails from the spark user group. My account should be a very old one, is there any chance you can kindly have a look into it and kindly let me know if there is something blocking me? I will be sincerely obliged.

Regards,
Gourav Sengupta

On Tue, Feb 22, 2022 at 3:58 PM Saurabh Gulati <sa...@fedex.com.invalid> wrote:
Hey Mich,
We use spark 3.2 now. We are using BQ but migrating away because:

  *   Its not reflective of our current lake structure with all deltas/history tables/models outputs etc
  *   Its pretty expensive to load everything in BQ and essentially it will be a copy of all data in gcs. External tables in BQ didnt work for us. Currently we store only latest snapshots in BQ. This breaks idempotency of models which need to time travel and run in the past.
  *   We might move to a different cloud provider in future so we want to be cloud agnostic.

So we need to have an execution engine which has the same overview of data as we have in gcs.
We tried presto but performance was similar and presto didn't support auto scaling.

TIA
Saurabh
________________________________
From: Mich Talebzadeh <mi...@gmail.com>>
Sent: 22 February 2022 16:49
To: Kidong Lee <my...@gmail.com>>; Saurabh Gulati <sa...@fedex.com>>
Cc: user@spark.apache.org<ma...@spark.apache.org> <us...@spark.apache.org>>
Subject: Re: [EXTERNAL] Re: Need to make WHERE clause compulsory in Spark SQL

Ok interesting.

I am surprised why you are not using BigQuery and using Hive. My assumption is that your Spark is version 3.1.1 with standard GKE on auto-scaler. What benefits are you getting from Using Hive here? As you have your hive tables on gs buckets, you can easily download your hive tables into BigQuery and run spark on BigQuery?

HTH

On Tue, 22 Feb 2022 at 15:34, Saurabh Gulati <sa...@fedex.com>> wrote:
Thanks Sean for your response.

@Mich Talebzadeh<ma...@gmail.com> We run all workloads on GKE as docker containers. So to answer your questions, Hive is running in a container as K8S service and spark thrift-server in another container as a service and Superset in a third container.

We use Spark on GKE setup to run thrift-server which spawns workers depending on the load. For buckets we use gcs.

TIA
Saurabh
________________________________
From: Mich Talebzadeh <mi...@gmail.com>>
Sent: 22 February 2022 16:05
To: Saurabh Gulati <sa...@fedex.com.invalid>
Cc: user@spark.apache.org<ma...@spark.apache.org> <us...@spark.apache.org>>
Subject: [EXTERNAL] Re: Need to make WHERE clause compulsory in Spark SQL

Caution! This email originated outside of FedEx. Please do not open attachments or click links from an unknown or suspicious origin.

Is your hive on prem with external tables in cloud storage?

Where is your spark running from and what cloud buckets are you using?

HTH

On Tue, 22 Feb 2022 at 12:36, Saurabh Gulati <sa...@fedex.com.invalid> wrote:
Hello,
We are trying to setup Spark as the execution engine for exposing our data stored in lake. We have hive metastore running along with Spark thrift server and are using Superset as the UI.

We save all tables as External tables in hive metastore with storge being on Cloud.

We see that right now when users run a query in Superset SQL Lab it scans the whole table. What we want is to limit the data scan by setting something like hive.mapred.mode=strict in spark, so that user gets an exception if they don't specify a partition column.

We tried setting spark.hadoop.hive.mapred.mode=strict in spark-defaults.conf in thrift server  but it still scans the whole table.
Also tried setting hive.mapred.mode=strict in hive-defaults.conf for metastore container.

We use Spark 3.2 with hive-metastore version 3.1.2

Is there a way in spark settings to make it happen.

TIA
Saurabh
--

 [https://docs.google.com/uc?export=download&id=1-q7RFGRfLMObPuQPWSd9sl_H1UPNFaIZ&revid=0B1BiUVX33unjMWtVUWpINWFCd0ZQTlhTRHpGckh4Wlg4RG80PQ]   view my Linkedin profile<https://urldefense.com/v3/__https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/__;!!AhGNFqKB8wRZstQ!UkIXXdMGzZQ1fweFWq7S_xng9u_1Pjbpz9cBjBrs_ajvgZ05vnA7VLJ1gTZbg4rhI9Q$>

 https://en.everybodywiki.com/Mich_Talebzadeh<https://urldefense.com/v3/__https://en.everybodywiki.com/Mich_Talebzadeh__;!!AhGNFqKB8wRZstQ!UkIXXdMGzZQ1fweFWq7S_xng9u_1Pjbpz9cBjBrs_ajvgZ05vnA7VLJ1gTZbyZfziHU$>

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

--

 [https://docs.google.com/uc?export=download&id=1-q7RFGRfLMObPuQPWSd9sl_H1UPNFaIZ&revid=0B1BiUVX33unjMWtVUWpINWFCd0ZQTlhTRHpGckh4Wlg4RG80PQ]   view my Linkedin profile<https://urldefense.com/v3/__https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/__;!!AhGNFqKB8wRZstQ!UfWFYiHTzKJJ-nH2Y6pF-eCjmx8xGyjsfI-JPuBBWC9NBHgiSu40dsH_v7mqVIOOMCo$>

 https://en.everybodywiki.com/Mich_Talebzadeh<https://urldefense.com/v3/__https://en.everybodywiki.com/Mich_Talebzadeh__;!!AhGNFqKB8wRZstQ!UfWFYiHTzKJJ-nH2Y6pF-eCjmx8xGyjsfI-JPuBBWC9NBHgiSu40dsH_v7mq9Bt4EX4$>

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

Re: [EXTERNAL] Re: Need to make WHERE clause compulsory in Spark SQL

Posted by Gourav Sengupta <go...@gmail.com>.

Hi,

are all users using the same cluster of data proc?

Regards,
Gourav

On Mon, Mar 7, 2022 at 9:28 AM Saurabh Gulati <sa...@fedex.com>
wrote:

> Thanks for the response, Gourav.
>
> Queries range from simple to large joins. We expose the data to our
> analytics users so that they can develop their models and they use superset
> as the SQL interface for testing.
>
> Hive-metastore will *not* do a full scan *if* we specify the partitioning
> column.
> But that's something users might/do forget, so we were thinking of
> enforcing a way to make sure people *do* specify partitioning column in
> their queries.
>
> The only way we see for now is to parse the query in superset to check if
> partition column is being used. But we are not sure of a way which will
> work for all types of queries.
>
> For example, we can parse the SQL and see if count (where) == count(
> partition_column ), but this may not work for complex queries.
>
>
> Regards
> Saurabh
> ------------------------------
> *From:* Gourav Sengupta <go...@gmail.com>
> *Sent:* 05 March 2022 11:06
> *To:* Saurabh Gulati <sa...@fedex.com.invalid>
> *Cc:* Mich Talebzadeh <mi...@gmail.com>; Kidong Lee <
> mykidong@gmail.com>; user@spark.apache.org <us...@spark.apache.org>
> *Subject:* Re: [EXTERNAL] Re: Need to make WHERE clause compulsory in
> Spark SQL
>
> Hi,
>
> I completely agree with Saurabh, the use of BQ with SPARK does not make
> sense at all, if you are trying to cut down your costs. I think that costs
> do matter to a few people at the end.
>
> Saurabh, is there any chance you can see what actual queries are hitting
> the thrift server? Using hive metastore is something that I have been doing
> in AWS EMR for the last 5 years and for sure it does not cause full table
> scan.
>
> Hi Sean,
> for some reason, I am not able to receive any emails from the spark user
> group. My account should be a very old one, is there any chance you can
> kindly have a look into it and kindly let me know if there is something
> blocking me? I will be sincerely obliged.
>
> Regards,
> Gourav Sengupta
>
>
> On Tue, Feb 22, 2022 at 3:58 PM Saurabh Gulati
> <sa...@fedex.com.invalid> wrote:
>
> Hey Mich,
> We use spark 3.2 now. We are using BQ but migrating away because:
>
>    - Its not reflective of our current lake structure with all
>    deltas/history tables/models outputs etc
>    - Its pretty expensive to load everything in BQ and essentially it
>    will be a copy of all data in gcs. External tables in BQ didnt work for us.
>    Currently we store only latest snapshots in BQ. This breaks idempotency of
>    models which need to time travel and run in the past.
>    - We might move to a different cloud provider in future so we want to
>    be cloud agnostic.
>
> So we need to have an execution engine which has the same overview of data
> as we have in gcs.
> We tried presto but performance was similar and presto didn't support auto
> scaling.
>
> TIA
> Saurabh
> ------------------------------
> *From:* Mich Talebzadeh <mi...@gmail.com>
> *Sent:* 22 February 2022 16:49
> *To:* Kidong Lee <my...@gmail.com>; Saurabh Gulati <
> saurabh.gulati@fedex.com>
> *Cc:* user@spark.apache.org <us...@spark.apache.org>
> *Subject:* Re: [EXTERNAL] Re: Need to make WHERE clause compulsory in
> Spark SQL
>
> Ok interesting.
>
> I am surprised why you are not using BigQuery and using Hive. My
> assumption is that your Spark is version 3.1.1 with standard GKE on
> auto-scaler. What benefits are you getting from Using Hive here? As you
> have your hive tables on gs buckets, you can easily download your hive
> tables into BigQuery and run spark on BigQuery?
>
> HTH
>
> On Tue, 22 Feb 2022 at 15:34, Saurabh Gulati <sa...@fedex.com>
> wrote:
>
> Thanks Sean for your response.
>
> @Mich Talebzadeh <mi...@gmail.com> We run all workloads on GKE
> as docker containers. So to answer your questions, Hive is running in a
> container as K8S service and spark thrift-server in another container as a
> service and Superset in a third container.
>
> We use Spark on GKE setup to run thrift-server which spawns workers
> depending on the load. For buckets we use gcs.
>
>
> TIA
> Saurabh
> ------------------------------
> *From:* Mich Talebzadeh <mi...@gmail.com>
> *Sent:* 22 February 2022 16:05
> *To:* Saurabh Gulati <sa...@fedex.com.invalid>
> *Cc:* user@spark.apache.org <us...@spark.apache.org>
> *Subject:* [EXTERNAL] Re: Need to make WHERE clause compulsory in Spark
> SQL
>
> *Caution! This email originated outside of FedEx. Please do not open
> attachments or click links from an unknown or suspicious origin*.
> Is your hive on prem with external tables in cloud storage?
>
> Where is your spark running from and what cloud buckets are you using?
>
> HTH
>
> On Tue, 22 Feb 2022 at 12:36, Saurabh Gulati
> <sa...@fedex.com.invalid> wrote:
>
> Hello,
> We are trying to setup Spark as the execution engine for exposing our data
> stored in lake. We have hive metastore running along with Spark thrift
> server and are using Superset as the UI.
>
> We save all tables as External tables in hive metastore with storge being
> on Cloud.
>
> We see that right now when users run a query in Superset SQL Lab it scans
> the whole table. What we want is to limit the data scan by setting
> something like hive.mapred.mode=strict in spark, so that user gets an
> exception if they don't specify a partition column.
>
> We tried setting spark.hadoop.hive.mapred.mode=strict in
> spark-defaults.conf in thrift server  but it still scans the whole table.
> Also tried setting hive.mapred.mode=strict in hive-defaults.conf for
> metastore container.
>
> We use Spark 3.2 with hive-metastore version 3.1.2
>
> Is there a way in spark settings to make it happen.
>
>
> TIA
> Saurabh
>
> --
>
>
>
>    view my Linkedin profile
> <https://urldefense.com/v3/__https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/__;!!AhGNFqKB8wRZstQ!UkIXXdMGzZQ1fweFWq7S_xng9u_1Pjbpz9cBjBrs_ajvgZ05vnA7VLJ1gTZbg4rhI9Q$>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
> <https://urldefense.com/v3/__https://en.everybodywiki.com/Mich_Talebzadeh__;!!AhGNFqKB8wRZstQ!UkIXXdMGzZQ1fweFWq7S_xng9u_1Pjbpz9cBjBrs_ajvgZ05vnA7VLJ1gTZbyZfziHU$>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> --
>
>
>
>    view my Linkedin profile
> <https://urldefense.com/v3/__https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/__;!!AhGNFqKB8wRZstQ!UfWFYiHTzKJJ-nH2Y6pF-eCjmx8xGyjsfI-JPuBBWC9NBHgiSu40dsH_v7mqVIOOMCo$>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
> <https://urldefense.com/v3/__https://en.everybodywiki.com/Mich_Talebzadeh__;!!AhGNFqKB8wRZstQ!UfWFYiHTzKJJ-nH2Y6pF-eCjmx8xGyjsfI-JPuBBWC9NBHgiSu40dsH_v7mq9Bt4EX4$>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>

Re: [EXTERNAL] Re: Need to make WHERE clause compulsory in Spark SQL

Posted by Saurabh Gulati <sa...@fedex.com.INVALID>.

Thanks for the response, Gourav.

Queries range from simple to large joins. We expose the data to our analytics users so that they can develop their models and they use superset as the SQL interface for testing.

Hive-metastore will not do a full scan if we specify the partitioning column.
But that's something users might/do forget, so we were thinking of enforcing a way to make sure people do specify partitioning column in their queries.

The only way we see for now is to parse the query in superset to check if partition column is being used. But we are not sure of a way which will work for all types of queries.

For example, we can parse the SQL and see if count (where) == count( partition_column ), but this may not work for complex queries.


Regards
Saurabh
________________________________
From: Gourav Sengupta <go...@gmail.com>
Sent: 05 March 2022 11:06
To: Saurabh Gulati <sa...@fedex.com.invalid>
Cc: Mich Talebzadeh <mi...@gmail.com>; Kidong Lee <my...@gmail.com>; user@spark.apache.org <us...@spark.apache.org>
Subject: Re: [EXTERNAL] Re: Need to make WHERE clause compulsory in Spark SQL

Hi,

I completely agree with Saurabh, the use of BQ with SPARK does not make sense at all, if you are trying to cut down your costs. I think that costs do matter to a few people at the end.

Saurabh, is there any chance you can see what actual queries are hitting the thrift server? Using hive metastore is something that I have been doing in AWS EMR for the last 5 years and for sure it does not cause full table scan.

Hi Sean,
for some reason, I am not able to receive any emails from the spark user group. My account should be a very old one, is there any chance you can kindly have a look into it and kindly let me know if there is something blocking me? I will be sincerely obliged.

Regards,
Gourav Sengupta


On Tue, Feb 22, 2022 at 3:58 PM Saurabh Gulati <sa...@fedex.com.invalid> wrote:
Hey Mich,
We use spark 3.2 now. We are using BQ but migrating away because:

  *   Its not reflective of our current lake structure with all deltas/history tables/models outputs etc
  *   Its pretty expensive to load everything in BQ and essentially it will be a copy of all data in gcs. External tables in BQ didnt work for us. Currently we store only latest snapshots in BQ. This breaks idempotency of models which need to time travel and run in the past.
  *   We might move to a different cloud provider in future so we want to be cloud agnostic.

So we need to have an execution engine which has the same overview of data as we have in gcs.
We tried presto but performance was similar and presto didn't support auto scaling.

TIA
Saurabh
________________________________
From: Mich Talebzadeh <mi...@gmail.com>>
Sent: 22 February 2022 16:49
To: Kidong Lee <my...@gmail.com>>; Saurabh Gulati <sa...@fedex.com>>
Cc: user@spark.apache.org<ma...@spark.apache.org> <us...@spark.apache.org>>
Subject: Re: [EXTERNAL] Re: Need to make WHERE clause compulsory in Spark SQL

Ok interesting.

I am surprised why you are not using BigQuery and using Hive. My assumption is that your Spark is version 3.1.1 with standard GKE on auto-scaler. What benefits are you getting from Using Hive here? As you have your hive tables on gs buckets, you can easily download your hive tables into BigQuery and run spark on BigQuery?

HTH

On Tue, 22 Feb 2022 at 15:34, Saurabh Gulati <sa...@fedex.com>> wrote:
Thanks Sean for your response.

@Mich Talebzadeh<ma...@gmail.com> We run all workloads on GKE as docker containers. So to answer your questions, Hive is running in a container as K8S service and spark thrift-server in another container as a service and Superset in a third container.

We use Spark on GKE setup to run thrift-server which spawns workers depending on the load. For buckets we use gcs.


TIA
Saurabh
________________________________
From: Mich Talebzadeh <mi...@gmail.com>>
Sent: 22 February 2022 16:05
To: Saurabh Gulati <sa...@fedex.com.invalid>
Cc: user@spark.apache.org<ma...@spark.apache.org> <us...@spark.apache.org>>
Subject: [EXTERNAL] Re: Need to make WHERE clause compulsory in Spark SQL

Caution! This email originated outside of FedEx. Please do not open attachments or click links from an unknown or suspicious origin.

Is your hive on prem with external tables in cloud storage?

Where is your spark running from and what cloud buckets are you using?

HTH

On Tue, 22 Feb 2022 at 12:36, Saurabh Gulati <sa...@fedex.com.invalid> wrote:
Hello,
We are trying to setup Spark as the execution engine for exposing our data stored in lake. We have hive metastore running along with Spark thrift server and are using Superset as the UI.

We save all tables as External tables in hive metastore with storge being on Cloud.

We see that right now when users run a query in Superset SQL Lab it scans the whole table. What we want is to limit the data scan by setting something like hive.mapred.mode=strict in spark, so that user gets an exception if they don't specify a partition column.

We tried setting spark.hadoop.hive.mapred.mode=strict in spark-defaults.conf in thrift server  but it still scans the whole table.
Also tried setting hive.mapred.mode=strict in hive-defaults.conf for metastore container.

We use Spark 3.2 with hive-metastore version 3.1.2

Is there a way in spark settings to make it happen.


TIA
Saurabh
--



 [https://docs.google.com/uc?export=download&id=1-q7RFGRfLMObPuQPWSd9sl_H1UPNFaIZ&revid=0B1BiUVX33unjMWtVUWpINWFCd0ZQTlhTRHpGckh4Wlg4RG80PQ]   view my Linkedin profile<https://urldefense.com/v3/__https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/__;!!AhGNFqKB8wRZstQ!UkIXXdMGzZQ1fweFWq7S_xng9u_1Pjbpz9cBjBrs_ajvgZ05vnA7VLJ1gTZbg4rhI9Q$>


 https://en.everybodywiki.com/Mich_Talebzadeh<https://urldefense.com/v3/__https://en.everybodywiki.com/Mich_Talebzadeh__;!!AhGNFqKB8wRZstQ!UkIXXdMGzZQ1fweFWq7S_xng9u_1Pjbpz9cBjBrs_ajvgZ05vnA7VLJ1gTZbyZfziHU$>



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.



--



 [https://docs.google.com/uc?export=download&id=1-q7RFGRfLMObPuQPWSd9sl_H1UPNFaIZ&revid=0B1BiUVX33unjMWtVUWpINWFCd0ZQTlhTRHpGckh4Wlg4RG80PQ]   view my Linkedin profile<https://urldefense.com/v3/__https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/__;!!AhGNFqKB8wRZstQ!UfWFYiHTzKJJ-nH2Y6pF-eCjmx8xGyjsfI-JPuBBWC9NBHgiSu40dsH_v7mqVIOOMCo$>


 https://en.everybodywiki.com/Mich_Talebzadeh<https://urldefense.com/v3/__https://en.everybodywiki.com/Mich_Talebzadeh__;!!AhGNFqKB8wRZstQ!UfWFYiHTzKJJ-nH2Y6pF-eCjmx8xGyjsfI-JPuBBWC9NBHgiSu40dsH_v7mq9Bt4EX4$>



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

Re: [EXTERNAL] Re: Need to make WHERE clause compulsory in Spark SQL

Posted by Gourav Sengupta <go...@gmail.com>.

Hi,

I completely agree with Saurabh, the use of BQ with SPARK does not make
sense at all, if you are trying to cut down your costs. I think that costs
do matter to a few people at the end.

Saurabh, is there any chance you can see what actual queries are hitting
the thrift server? Using hive metastore is something that I have been doing
in AWS EMR for the last 5 years and for sure it does not cause full table
scan.

Hi Sean,
for some reason, I am not able to receive any emails from the spark user
group. My account should be a very old one, is there any chance you can
kindly have a look into it and kindly let me know if there is something
blocking me? I will be sincerely obliged.

Regards,
Gourav Sengupta


On Tue, Feb 22, 2022 at 3:58 PM Saurabh Gulati
<sa...@fedex.com.invalid> wrote:

> Hey Mich,
> We use spark 3.2 now. We are using BQ but migrating away because:
>
>    - Its not reflective of our current lake structure with all
>    deltas/history tables/models outputs etc
>    - Its pretty expensive to load everything in BQ and essentially it
>    will be a copy of all data in gcs. External tables in BQ didnt work for us.
>    Currently we store only latest snapshots in BQ. This breaks idempotency of
>    models which need to time travel and run in the past.
>    - We might move to a different cloud provider in future so we want to
>    be cloud agnostic.
>
> So we need to have an execution engine which has the same overview of data
> as we have in gcs.
> We tried presto but performance was similar and presto didn't support auto
> scaling.
>
> TIA
> Saurabh
> ------------------------------
> *From:* Mich Talebzadeh <mi...@gmail.com>
> *Sent:* 22 February 2022 16:49
> *To:* Kidong Lee <my...@gmail.com>; Saurabh Gulati <
> saurabh.gulati@fedex.com>
> *Cc:* user@spark.apache.org <us...@spark.apache.org>
> *Subject:* Re: [EXTERNAL] Re: Need to make WHERE clause compulsory in
> Spark SQL
>
> Ok interesting.
>
> I am surprised why you are not using BigQuery and using Hive. My
> assumption is that your Spark is version 3.1.1 with standard GKE on
> auto-scaler. What benefits are you getting from Using Hive here? As you
> have your hive tables on gs buckets, you can easily download your hive
> tables into BigQuery and run spark on BigQuery?
>
> HTH
>
> On Tue, 22 Feb 2022 at 15:34, Saurabh Gulati <sa...@fedex.com>
> wrote:
>
> Thanks Sean for your response.
>
> @Mich Talebzadeh <mi...@gmail.com> We run all workloads on GKE
> as docker containers. So to answer your questions, Hive is running in a
> container as K8S service and spark thrift-server in another container as a
> service and Superset in a third container.
>
> We use Spark on GKE setup to run thrift-server which spawns workers
> depending on the load. For buckets we use gcs.
>
>
> TIA
> Saurabh
> ------------------------------
> *From:* Mich Talebzadeh <mi...@gmail.com>
> *Sent:* 22 February 2022 16:05
> *To:* Saurabh Gulati <sa...@fedex.com.invalid>
> *Cc:* user@spark.apache.org <us...@spark.apache.org>
> *Subject:* [EXTERNAL] Re: Need to make WHERE clause compulsory in Spark
> SQL
>
> *Caution! This email originated outside of FedEx. Please do not open
> attachments or click links from an unknown or suspicious origin*.
> Is your hive on prem with external tables in cloud storage?
>
> Where is your spark running from and what cloud buckets are you using?
>
> HTH
>
> On Tue, 22 Feb 2022 at 12:36, Saurabh Gulati
> <sa...@fedex.com.invalid> wrote:
>
> Hello,
> We are trying to setup Spark as the execution engine for exposing our data
> stored in lake. We have hive metastore running along with Spark thrift
> server and are using Superset as the UI.
>
> We save all tables as External tables in hive metastore with storge being
> on Cloud.
>
> We see that right now when users run a query in Superset SQL Lab it scans
> the whole table. What we want is to limit the data scan by setting
> something like hive.mapred.mode=strict in spark, so that user gets an
> exception if they don't specify a partition column.
>
> We tried setting spark.hadoop.hive.mapred.mode=strict in
> spark-defaults.conf in thrift server  but it still scans the whole table.
> Also tried setting hive.mapred.mode=strict in hive-defaults.conf for
> metastore container.
>
> We use Spark 3.2 with hive-metastore version 3.1.2
>
> Is there a way in spark settings to make it happen.
>
>
> TIA
> Saurabh
>
> --
>
>
>
>    view my Linkedin profile
> <https://urldefense.com/v3/__https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/__;!!AhGNFqKB8wRZstQ!UkIXXdMGzZQ1fweFWq7S_xng9u_1Pjbpz9cBjBrs_ajvgZ05vnA7VLJ1gTZbg4rhI9Q$>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
> <https://urldefense.com/v3/__https://en.everybodywiki.com/Mich_Talebzadeh__;!!AhGNFqKB8wRZstQ!UkIXXdMGzZQ1fweFWq7S_xng9u_1Pjbpz9cBjBrs_ajvgZ05vnA7VLJ1gTZbyZfziHU$>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> --
>
>
>
>    view my Linkedin profile
> <https://urldefense.com/v3/__https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/__;!!AhGNFqKB8wRZstQ!UfWFYiHTzKJJ-nH2Y6pF-eCjmx8xGyjsfI-JPuBBWC9NBHgiSu40dsH_v7mqVIOOMCo$>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
> <https://urldefense.com/v3/__https://en.everybodywiki.com/Mich_Talebzadeh__;!!AhGNFqKB8wRZstQ!UfWFYiHTzKJJ-nH2Y6pF-eCjmx8xGyjsfI-JPuBBWC9NBHgiSu40dsH_v7mq9Bt4EX4$>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>

Re: [EXTERNAL] Re: Need to make WHERE clause compulsory in Spark SQL

Posted by Mich Talebzadeh <mi...@gmail.com>.

Well all those parameter settings have no effect because you are only using
Hive as a metadata storage for your external tables on gcs. Can you give a
typical example of your external table creation in hive and the partitioned
column.

HTH



   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 22 Feb 2022 at 15:57, Saurabh Gulati <sa...@fedex.com>
wrote:

> Hey Mich,
> We use spark 3.2 now. We are using BQ but migrating away because:
>
>    - Its not reflective of our current lake structure with all
>    deltas/history tables/models outputs etc
>    - Its pretty expensive to load everything in BQ and essentially it
>    will be a copy of all data in gcs. External tables in BQ didnt work for us.
>    Currently we store only latest snapshots in BQ. This breaks idempotency of
>    models which need to time travel and run in the past.
>    - We might move to a different cloud provider in future so we want to
>    be cloud agnostic.
>
> So we need to have an execution engine which has the same overview of data
> as we have in gcs.
> We tried presto but performance was similar and presto didn't support auto
> scaling.
>
> TIA
> Saurabh
> ------------------------------
> *From:* Mich Talebzadeh <mi...@gmail.com>
> *Sent:* 22 February 2022 16:49
> *To:* Kidong Lee <my...@gmail.com>; Saurabh Gulati <
> saurabh.gulati@fedex.com>
> *Cc:* user@spark.apache.org <us...@spark.apache.org>
> *Subject:* Re: [EXTERNAL] Re: Need to make WHERE clause compulsory in
> Spark SQL
>
> Ok interesting.
>
> I am surprised why you are not using BigQuery and using Hive. My
> assumption is that your Spark is version 3.1.1 with standard GKE on
> auto-scaler. What benefits are you getting from Using Hive here? As you
> have your hive tables on gs buckets, you can easily download your hive
> tables into BigQuery and run spark on BigQuery?
>
> HTH
>
> On Tue, 22 Feb 2022 at 15:34, Saurabh Gulati <sa...@fedex.com>
> wrote:
>
> Thanks Sean for your response.
>
> @Mich Talebzadeh <mi...@gmail.com> We run all workloads on GKE
> as docker containers. So to answer your questions, Hive is running in a
> container as K8S service and spark thrift-server in another container as a
> service and Superset in a third container.
>
> We use Spark on GKE setup to run thrift-server which spawns workers
> depending on the load. For buckets we use gcs.
>
>
> TIA
> Saurabh
> ------------------------------
> *From:* Mich Talebzadeh <mi...@gmail.com>
> *Sent:* 22 February 2022 16:05
> *To:* Saurabh Gulati <sa...@fedex.com.invalid>
> *Cc:* user@spark.apache.org <us...@spark.apache.org>
> *Subject:* [EXTERNAL] Re: Need to make WHERE clause compulsory in Spark
> SQL
>
> *Caution! This email originated outside of FedEx. Please do not open
> attachments or click links from an unknown or suspicious origin*.
> Is your hive on prem with external tables in cloud storage?
>
> Where is your spark running from and what cloud buckets are you using?
>
> HTH
>
> On Tue, 22 Feb 2022 at 12:36, Saurabh Gulati
> <sa...@fedex.com.invalid> wrote:
>
> Hello,
> We are trying to setup Spark as the execution engine for exposing our data
> stored in lake. We have hive metastore running along with Spark thrift
> server and are using Superset as the UI.
>
> We save all tables as External tables in hive metastore with storge being
> on Cloud.
>
> We see that right now when users run a query in Superset SQL Lab it scans
> the whole table. What we want is to limit the data scan by setting
> something like hive.mapred.mode=strict in spark, so that user gets an
> exception if they don't specify a partition column.
>
> We tried setting spark.hadoop.hive.mapred.mode=strict in
> spark-defaults.conf in thrift server  but it still scans the whole table.
> Also tried setting hive.mapred.mode=strict in hive-defaults.conf for
> metastore container.
>
> We use Spark 3.2 with hive-metastore version 3.1.2
>
> Is there a way in spark settings to make it happen.
>
>
> TIA
> Saurabh
>
> --
>
>
>
>    view my Linkedin profile
> <https://urldefense.com/v3/__https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/__;!!AhGNFqKB8wRZstQ!UkIXXdMGzZQ1fweFWq7S_xng9u_1Pjbpz9cBjBrs_ajvgZ05vnA7VLJ1gTZbg4rhI9Q$>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
> <https://urldefense.com/v3/__https://en.everybodywiki.com/Mich_Talebzadeh__;!!AhGNFqKB8wRZstQ!UkIXXdMGzZQ1fweFWq7S_xng9u_1Pjbpz9cBjBrs_ajvgZ05vnA7VLJ1gTZbyZfziHU$>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> --
>
>
>
>    view my Linkedin profile
> <https://urldefense.com/v3/__https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/__;!!AhGNFqKB8wRZstQ!UfWFYiHTzKJJ-nH2Y6pF-eCjmx8xGyjsfI-JPuBBWC9NBHgiSu40dsH_v7mqVIOOMCo$>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
> <https://urldefense.com/v3/__https://en.everybodywiki.com/Mich_Talebzadeh__;!!AhGNFqKB8wRZstQ!UfWFYiHTzKJJ-nH2Y6pF-eCjmx8xGyjsfI-JPuBBWC9NBHgiSu40dsH_v7mq9Bt4EX4$>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>

Re: [EXTERNAL] Re: Need to make WHERE clause compulsory in Spark SQL

Posted by Saurabh Gulati <sa...@fedex.com.INVALID>.

Hey Mich,
We use spark 3.2 now. We are using BQ but migrating away because:

  *   Its not reflective of our current lake structure with all deltas/history tables/models outputs etc
  *   Its pretty expensive to load everything in BQ and essentially it will be a copy of all data in gcs. External tables in BQ didnt work for us. Currently we store only latest snapshots in BQ. This breaks idempotency of models which need to time travel and run in the past.
  *   We might move to a different cloud provider in future so we want to be cloud agnostic.

So we need to have an execution engine which has the same overview of data as we have in gcs.
We tried presto but performance was similar and presto didn't support auto scaling.

TIA
Saurabh
________________________________
From: Mich Talebzadeh <mi...@gmail.com>
Sent: 22 February 2022 16:49
To: Kidong Lee <my...@gmail.com>; Saurabh Gulati <sa...@fedex.com>
Cc: user@spark.apache.org <us...@spark.apache.org>
Subject: Re: [EXTERNAL] Re: Need to make WHERE clause compulsory in Spark SQL

Ok interesting.

I am surprised why you are not using BigQuery and using Hive. My assumption is that your Spark is version 3.1.1 with standard GKE on auto-scaler. What benefits are you getting from Using Hive here? As you have your hive tables on gs buckets, you can easily download your hive tables into BigQuery and run spark on BigQuery?

HTH

On Tue, 22 Feb 2022 at 15:34, Saurabh Gulati <sa...@fedex.com>> wrote:
Thanks Sean for your response.

@Mich Talebzadeh<ma...@gmail.com> We run all workloads on GKE as docker containers. So to answer your questions, Hive is running in a container as K8S service and spark thrift-server in another container as a service and Superset in a third container.

We use Spark on GKE setup to run thrift-server which spawns workers depending on the load. For buckets we use gcs.


TIA
Saurabh
________________________________
From: Mich Talebzadeh <mi...@gmail.com>>
Sent: 22 February 2022 16:05
To: Saurabh Gulati <sa...@fedex.com.invalid>
Cc: user@spark.apache.org<ma...@spark.apache.org> <us...@spark.apache.org>>
Subject: [EXTERNAL] Re: Need to make WHERE clause compulsory in Spark SQL

Caution! This email originated outside of FedEx. Please do not open attachments or click links from an unknown or suspicious origin.

Is your hive on prem with external tables in cloud storage?

Where is your spark running from and what cloud buckets are you using?

HTH

On Tue, 22 Feb 2022 at 12:36, Saurabh Gulati <sa...@fedex.com.invalid> wrote:
Hello,
We are trying to setup Spark as the execution engine for exposing our data stored in lake. We have hive metastore running along with Spark thrift server and are using Superset as the UI.

We save all tables as External tables in hive metastore with storge being on Cloud.

We see that right now when users run a query in Superset SQL Lab it scans the whole table. What we want is to limit the data scan by setting something like hive.mapred.mode=strict in spark, so that user gets an exception if they don't specify a partition column.

We tried setting spark.hadoop.hive.mapred.mode=strict in spark-defaults.conf in thrift server  but it still scans the whole table.
Also tried setting hive.mapred.mode=strict in hive-defaults.conf for metastore container.

We use Spark 3.2 with hive-metastore version 3.1.2

Is there a way in spark settings to make it happen.


TIA
Saurabh
--



 [https://docs.google.com/uc?export=download&id=1-q7RFGRfLMObPuQPWSd9sl_H1UPNFaIZ&revid=0B1BiUVX33unjMWtVUWpINWFCd0ZQTlhTRHpGckh4Wlg4RG80PQ]   view my Linkedin profile<https://urldefense.com/v3/__https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/__;!!AhGNFqKB8wRZstQ!UkIXXdMGzZQ1fweFWq7S_xng9u_1Pjbpz9cBjBrs_ajvgZ05vnA7VLJ1gTZbg4rhI9Q$>


 https://en.everybodywiki.com/Mich_Talebzadeh<https://urldefense.com/v3/__https://en.everybodywiki.com/Mich_Talebzadeh__;!!AhGNFqKB8wRZstQ!UkIXXdMGzZQ1fweFWq7S_xng9u_1Pjbpz9cBjBrs_ajvgZ05vnA7VLJ1gTZbyZfziHU$>



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.



--



 [https://docs.google.com/uc?export=download&id=1-q7RFGRfLMObPuQPWSd9sl_H1UPNFaIZ&revid=0B1BiUVX33unjMWtVUWpINWFCd0ZQTlhTRHpGckh4Wlg4RG80PQ]   view my Linkedin profile<https://urldefense.com/v3/__https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/__;!!AhGNFqKB8wRZstQ!UfWFYiHTzKJJ-nH2Y6pF-eCjmx8xGyjsfI-JPuBBWC9NBHgiSu40dsH_v7mqVIOOMCo$>


 https://en.everybodywiki.com/Mich_Talebzadeh<https://urldefense.com/v3/__https://en.everybodywiki.com/Mich_Talebzadeh__;!!AhGNFqKB8wRZstQ!UfWFYiHTzKJJ-nH2Y6pF-eCjmx8xGyjsfI-JPuBBWC9NBHgiSu40dsH_v7mq9Bt4EX4$>



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

Re: [EXTERNAL] Re: Need to make WHERE clause compulsory in Spark SQL

Posted by Mich Talebzadeh <mi...@gmail.com>.

Ok interesting.

I am surprised why you are not using BigQuery and using Hive. My assumption
is that your Spark is version 3.1.1 with standard GKE on auto-scaler. What
benefits are you getting from Using Hive here? As you have your hive tables
on gs buckets, you can easily download your hive tables into BigQuery and
run spark on BigQuery?

HTH

On Tue, 22 Feb 2022 at 15:34, Saurabh Gulati <sa...@fedex.com>
wrote:

> Thanks Sean for your response.
>
> @Mich Talebzadeh <mi...@gmail.com> We run all workloads on GKE
> as docker containers. So to answer your questions, Hive is running in a
> container as K8S service and spark thrift-server in another container as a
> service and Superset in a third container.
>
> We use Spark on GKE setup to run thrift-server which spawns workers
> depending on the load. For buckets we use gcs.
>
>
> TIA
> Saurabh
> ------------------------------
> *From:* Mich Talebzadeh <mi...@gmail.com>
> *Sent:* 22 February 2022 16:05
> *To:* Saurabh Gulati <sa...@fedex.com.invalid>
> *Cc:* user@spark.apache.org <us...@spark.apache.org>
> *Subject:* [EXTERNAL] Re: Need to make WHERE clause compulsory in Spark
> SQL
>
> *Caution! This email originated outside of FedEx. Please do not open
> attachments or click links from an unknown or suspicious origin*.
> Is your hive on prem with external tables in cloud storage?
>
> Where is your spark running from and what cloud buckets are you using?
>
> HTH
>
> On Tue, 22 Feb 2022 at 12:36, Saurabh Gulati
> <sa...@fedex.com.invalid> wrote:
>
> Hello,
> We are trying to setup Spark as the execution engine for exposing our data
> stored in lake. We have hive metastore running along with Spark thrift
> server and are using Superset as the UI.
>
> We save all tables as External tables in hive metastore with storge being
> on Cloud.
>
> We see that right now when users run a query in Superset SQL Lab it scans
> the whole table. What we want is to limit the data scan by setting
> something like hive.mapred.mode=strict in spark, so that user gets an
> exception if they don't specify a partition column.
>
> We tried setting spark.hadoop.hive.mapred.mode=strict in
> spark-defaults.conf in thrift server  but it still scans the whole table.
> Also tried setting hive.mapred.mode=strict in hive-defaults.conf for
> metastore container.
>
> We use Spark 3.2 with hive-metastore version 3.1.2
>
> Is there a way in spark settings to make it happen.
>
>
> TIA
> Saurabh
>
> --
>
>
>
>    view my Linkedin profile
> <https://urldefense.com/v3/__https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/__;!!AhGNFqKB8wRZstQ!UkIXXdMGzZQ1fweFWq7S_xng9u_1Pjbpz9cBjBrs_ajvgZ05vnA7VLJ1gTZbg4rhI9Q$>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
> <https://urldefense.com/v3/__https://en.everybodywiki.com/Mich_Talebzadeh__;!!AhGNFqKB8wRZstQ!UkIXXdMGzZQ1fweFWq7S_xng9u_1Pjbpz9cBjBrs_ajvgZ05vnA7VLJ1gTZbyZfziHU$>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
-- 



   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

Re: [EXTERNAL] Re: Need to make WHERE clause compulsory in Spark SQL

Posted by Saurabh Gulati <sa...@fedex.com.INVALID>.

To correct my last message, its hive-metastore running as a service in a container and not hive. We use Spark-thriftserver for query execution.
________________________________
From: Saurabh Gulati <sa...@fedex.com>
Sent: 22 February 2022 16:33
To: Mich Talebzadeh <mi...@gmail.com>
Cc: user@spark.apache.org <us...@spark.apache.org>
Subject: Re: [EXTERNAL] Re: Need to make WHERE clause compulsory in Spark SQL

Thanks Sean for your response.

@Mich Talebzadeh<ma...@gmail.com> We run all workloads on GKE as docker containers. So to answer your questions, Hive is running in a container as K8S service and spark thrift-server in another container as a service and Superset in a third container.

We use Spark on GKE setup to run thrift-server which spawns workers depending on the load. For buckets we use gcs.

TIA
Saurabh
________________________________
From: Mich Talebzadeh <mi...@gmail.com>
Sent: 22 February 2022 16:05
To: Saurabh Gulati <sa...@fedex.com.invalid>
Cc: user@spark.apache.org <us...@spark.apache.org>
Subject: [EXTERNAL] Re: Need to make WHERE clause compulsory in Spark SQL

Caution! This email originated outside of FedEx. Please do not open attachments or click links from an unknown or suspicious origin.

Is your hive on prem with external tables in cloud storage?

Where is your spark running from and what cloud buckets are you using?

HTH

On Tue, 22 Feb 2022 at 12:36, Saurabh Gulati <sa...@fedex.com.invalid> wrote:
Hello,
We are trying to setup Spark as the execution engine for exposing our data stored in lake. We have hive metastore running along with Spark thrift server and are using Superset as the UI.

We save all tables as External tables in hive metastore with storge being on Cloud.

We see that right now when users run a query in Superset SQL Lab it scans the whole table. What we want is to limit the data scan by setting something like hive.mapred.mode=strict in spark, so that user gets an exception if they don't specify a partition column.

We tried setting spark.hadoop.hive.mapred.mode=strict in spark-defaults.conf in thrift server  but it still scans the whole table.
Also tried setting hive.mapred.mode=strict in hive-defaults.conf for metastore container.

We use Spark 3.2 with hive-metastore version 3.1.2

Is there a way in spark settings to make it happen.

TIA
Saurabh
--

 [https://docs.google.com/uc?export=download&id=1-q7RFGRfLMObPuQPWSd9sl_H1UPNFaIZ&revid=0B1BiUVX33unjMWtVUWpINWFCd0ZQTlhTRHpGckh4Wlg4RG80PQ]   view my Linkedin profile<https://urldefense.com/v3/__https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/__;!!AhGNFqKB8wRZstQ!UkIXXdMGzZQ1fweFWq7S_xng9u_1Pjbpz9cBjBrs_ajvgZ05vnA7VLJ1gTZbg4rhI9Q$>

 https://en.everybodywiki.com/Mich_Talebzadeh<https://urldefense.com/v3/__https://en.everybodywiki.com/Mich_Talebzadeh__;!!AhGNFqKB8wRZstQ!UkIXXdMGzZQ1fweFWq7S_xng9u_1Pjbpz9cBjBrs_ajvgZ05vnA7VLJ1gTZbyZfziHU$>

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.