You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by "assaf.mendelson" <as...@rsa.com> on 2016/11/15 07:23:39 UTC

separate spark and hive

Hi,
Today, we basically force people to use hive if they want to get the full use of spark SQL.
When doing the default installation this means that a derby.log and metastore_db directory are created where we run from.
The problem with this is that if we run multiple scripts from the same working directory we have a problem.
The solution we employ locally is to always run from different directory as we ignore hive in practice (this of course means we lose the ability to use some of the catalog options in spark session).
The only other solution is to create a full blown hive installation with proper configuration (probably for a JDBC solution).

I would propose that in most cases there shouldn't be any hive use at all. Even for catalog elements such as saving a permanent table, we should be able to configure a target directory and simply write to it (doing everything file based to avoid the need for locking). Hive should be reserved for those who actually use it (probably for backward compatibility).

Am I missing something here?
Assaf.




--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/separate-spark-and-hive-tp19879.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: separate spark and hive

Posted by Ricardo Almeida <ri...@actnowib.com>.
Great to know about the "spark.sql.catalogImplementation" configuration
property.
I can't find this anywhere but in Jacek Laskowski's "Mastering Apache Spark
2.0" Gitbook.

I guess we should document on Spark Configuration page

On 15 November 2016 at 11:49, Herman van Hövell tot Westerflier <
hvanhovell@databricks.com> wrote:

> You can start a spark without hive support by setting the spark.sql.
> catalogImplementation configuration to in-memory, for example:
>>
>> ./bin/spark-shell --master local[*] --conf spark.sql.
>> catalogImplementation=in-memory
>
>
> I would not change the default from Hive to Spark-only just yet.
>
> On Tue, Nov 15, 2016 at 9:38 AM, assaf.mendelson <as...@rsa.com>
> wrote:
>
>> After looking at the code, I found that spark.sql.catalogImplementation
>> is set to “hive”. I would proposed that it should be set to “in-memory” by
>> default (or at least have this in the documentation, the configuration
>> documentation at http://spark.apache.org/docs/latest/configuration.html
>> has no mentioning of hive at all)
>>
>> Assaf.
>>
>>
>>
>> *From:* Mendelson, Assaf
>> *Sent:* Tuesday, November 15, 2016 10:11 AM
>> *To:* 'rxin [via Apache Spark Developers List]'
>> *Subject:* RE: separate spark and hive
>>
>>
>>
>> Spark shell (and pyspark) by default create the spark session with hive
>> support (also true when the session is created using getOrCreate, at least
>> in pyspark)
>>
>> At a minimum there should be a way to configure it using
>> spark-defaults.conf
>>
>> Assaf.
>>
>>
>>
>> *From:* rxin [via Apache Spark Developers List] [[hidden email]
>> <http:///user/SendEmail.jtp?type=node&node=19884&i=0>]
>> *Sent:* Tuesday, November 15, 2016 9:46 AM
>> *To:* Mendelson, Assaf
>> *Subject:* Re: separate spark and hive
>>
>>
>>
>> If you just start a SparkSession without calling enableHiveSupport it
>> actually won't use the Hive catalog support.
>>
>>
>>
>>
>>
>> On Mon, Nov 14, 2016 at 11:44 PM, Mendelson, Assaf <[hidden email]
>> <http:///user/SendEmail.jtp?type=node&node=19882&i=0>> wrote:
>>
>> The default generation of spark context is actually a hive context.
>>
>> I tried to find on the documentation what are the differences between
>> hive context and sql context and couldn’t find it for spark 2.0 (I know for
>> previous versions there were a couple of functions which required hive
>> context as well as window functions but those seem to have all been fixed
>> for spark 2.0).
>>
>> Furthermore, I can’t seem to find a way to configure spark not to use
>> hive. I can only find how to compile it without hive (and having to build
>> from source each time is not a good idea for a production system).
>>
>>
>>
>> I would suggest that working without hive should be either a simple
>> configuration or even the default and that if there is any missing
>> functionality it should be documented.
>>
>> Assaf.
>>
>>
>>
>>
>>
>> *From:* Reynold Xin [mailto:[hidden email]
>> <http:///user/SendEmail.jtp?type=node&node=19882&i=1>]
>> *Sent:* Tuesday, November 15, 2016 9:31 AM
>> *To:* Mendelson, Assaf
>> *Cc:* [hidden email]
>> <http:///user/SendEmail.jtp?type=node&node=19882&i=2>
>> *Subject:* Re: separate spark and hive
>>
>>
>>
>> I agree with the high level idea, and thus SPARK-15691
>> <https://issues.apache.org/jira/browse/SPARK-15691>.
>>
>>
>>
>> In reality, it's a huge amount of work to create & maintain a custom
>> catalog. It might actually make sense to do, but it just seems a lot of
>> work to do right now and it'd take a toll on interoperability.
>>
>>
>>
>> If you don't need persistent catalog, you can just run Spark without Hive
>> mode, can't you?
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Mon, Nov 14, 2016 at 11:23 PM, assaf.mendelson <[hidden email]
>> <http:///user/SendEmail.jtp?type=node&node=19882&i=3>> wrote:
>>
>> Hi,
>>
>> Today, we basically force people to use hive if they want to get the full
>> use of spark SQL.
>>
>> When doing the default installation this means that a derby.log and
>> metastore_db directory are created where we run from.
>>
>> The problem with this is that if we run multiple scripts from the same
>> working directory we have a problem.
>>
>> The solution we employ locally is to always run from different directory
>> as we ignore hive in practice (this of course means we lose the ability to
>> use some of the catalog options in spark session).
>>
>> The only other solution is to create a full blown hive installation with
>> proper configuration (probably for a JDBC solution).
>>
>>
>>
>> I would propose that in most cases there shouldn’t be any hive use at
>> all. Even for catalog elements such as saving a permanent table, we should
>> be able to configure a target directory and simply write to it (doing
>> everything file based to avoid the need for locking). Hive should be
>> reserved for those who actually use it (probably for backward
>> compatibility).
>>
>>
>>
>> Am I missing something here?
>>
>> Assaf.
>>
>>
>> ------------------------------
>>
>> View this message in context: separate spark and hive
>> <http://apache-spark-developers-list.1001551.n3.nabble.com/separate-spark-and-hive-tp19879.html>
>> Sent from the Apache Spark Developers List mailing list archive
>> <http://apache-spark-developers-list.1001551.n3.nabble.com/> at
>> Nabble.com.
>>
>>
>>
>>
>>
>>
>> ------------------------------
>>
>> *If you reply to this email, your message will be added to the discussion
>> below:*
>>
>> http://apache-spark-developers-list.1001551.n3.nabble.com/
>> separate-spark-and-hive-tp19879p19882.html
>>
>> To start a new topic under Apache Spark Developers List, email [hidden
>> email] <http:///user/SendEmail.jtp?type=node&node=19884&i=1>
>> To unsubscribe from Apache Spark Developers List, click here.
>> NAML
>> <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>
>> ------------------------------
>> View this message in context: RE: separate spark and hive
>> <http://apache-spark-developers-list.1001551.n3.nabble.com/separate-spark-and-hive-tp19879p19884.html>
>>
>> Sent from the Apache Spark Developers List mailing list archive
>> <http://apache-spark-developers-list.1001551.n3.nabble.com/> at
>> Nabble.com.
>>
>
>

Re: separate spark and hive

Posted by Herman van Hövell tot Westerflier <hv...@databricks.com>.
You can start a spark without hive support by setting the spark.sql.
catalogImplementation configuration to in-memory, for example:
>
> ./bin/spark-shell --master local[*] --conf
> spark.sql.catalogImplementation=in-memory


I would not change the default from Hive to Spark-only just yet.

On Tue, Nov 15, 2016 at 9:38 AM, assaf.mendelson <as...@rsa.com>
wrote:

> After looking at the code, I found that spark.sql.catalogImplementation
> is set to “hive”. I would proposed that it should be set to “in-memory” by
> default (or at least have this in the documentation, the configuration
> documentation at http://spark.apache.org/docs/latest/configuration.html
> has no mentioning of hive at all)
>
> Assaf.
>
>
>
> *From:* Mendelson, Assaf
> *Sent:* Tuesday, November 15, 2016 10:11 AM
> *To:* 'rxin [via Apache Spark Developers List]'
> *Subject:* RE: separate spark and hive
>
>
>
> Spark shell (and pyspark) by default create the spark session with hive
> support (also true when the session is created using getOrCreate, at least
> in pyspark)
>
> At a minimum there should be a way to configure it using
> spark-defaults.conf
>
> Assaf.
>
>
>
> *From:* rxin [via Apache Spark Developers List] [[hidden email]
> <http:///user/SendEmail.jtp?type=node&node=19884&i=0>]
> *Sent:* Tuesday, November 15, 2016 9:46 AM
> *To:* Mendelson, Assaf
> *Subject:* Re: separate spark and hive
>
>
>
> If you just start a SparkSession without calling enableHiveSupport it
> actually won't use the Hive catalog support.
>
>
>
>
>
> On Mon, Nov 14, 2016 at 11:44 PM, Mendelson, Assaf <[hidden email]
> <http:///user/SendEmail.jtp?type=node&node=19882&i=0>> wrote:
>
> The default generation of spark context is actually a hive context.
>
> I tried to find on the documentation what are the differences between hive
> context and sql context and couldn’t find it for spark 2.0 (I know for
> previous versions there were a couple of functions which required hive
> context as well as window functions but those seem to have all been fixed
> for spark 2.0).
>
> Furthermore, I can’t seem to find a way to configure spark not to use
> hive. I can only find how to compile it without hive (and having to build
> from source each time is not a good idea for a production system).
>
>
>
> I would suggest that working without hive should be either a simple
> configuration or even the default and that if there is any missing
> functionality it should be documented.
>
> Assaf.
>
>
>
>
>
> *From:* Reynold Xin [mailto:[hidden email]
> <http:///user/SendEmail.jtp?type=node&node=19882&i=1>]
> *Sent:* Tuesday, November 15, 2016 9:31 AM
> *To:* Mendelson, Assaf
> *Cc:* [hidden email] <http:///user/SendEmail.jtp?type=node&node=19882&i=2>
> *Subject:* Re: separate spark and hive
>
>
>
> I agree with the high level idea, and thus SPARK-15691
> <https://issues.apache.org/jira/browse/SPARK-15691>.
>
>
>
> In reality, it's a huge amount of work to create & maintain a custom
> catalog. It might actually make sense to do, but it just seems a lot of
> work to do right now and it'd take a toll on interoperability.
>
>
>
> If you don't need persistent catalog, you can just run Spark without Hive
> mode, can't you?
>
>
>
>
>
>
>
>
>
> On Mon, Nov 14, 2016 at 11:23 PM, assaf.mendelson <[hidden email]
> <http:///user/SendEmail.jtp?type=node&node=19882&i=3>> wrote:
>
> Hi,
>
> Today, we basically force people to use hive if they want to get the full
> use of spark SQL.
>
> When doing the default installation this means that a derby.log and
> metastore_db directory are created where we run from.
>
> The problem with this is that if we run multiple scripts from the same
> working directory we have a problem.
>
> The solution we employ locally is to always run from different directory
> as we ignore hive in practice (this of course means we lose the ability to
> use some of the catalog options in spark session).
>
> The only other solution is to create a full blown hive installation with
> proper configuration (probably for a JDBC solution).
>
>
>
> I would propose that in most cases there shouldn’t be any hive use at all.
> Even for catalog elements such as saving a permanent table, we should be
> able to configure a target directory and simply write to it (doing
> everything file based to avoid the need for locking). Hive should be
> reserved for those who actually use it (probably for backward
> compatibility).
>
>
>
> Am I missing something here?
>
> Assaf.
>
>
> ------------------------------
>
> View this message in context: separate spark and hive
> <http://apache-spark-developers-list.1001551.n3.nabble.com/separate-spark-and-hive-tp19879.html>
> Sent from the Apache Spark Developers List mailing list archive
> <http://apache-spark-developers-list.1001551.n3.nabble.com/> at
> Nabble.com.
>
>
>
>
>
>
> ------------------------------
>
> *If you reply to this email, your message will be added to the discussion
> below:*
>
> http://apache-spark-developers-list.1001551.n3.
> nabble.com/separate-spark-and-hive-tp19879p19882.html
>
> To start a new topic under Apache Spark Developers List, email [hidden
> email] <http:///user/SendEmail.jtp?type=node&node=19884&i=1>
> To unsubscribe from Apache Spark Developers List, click here.
> NAML
> <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>
> ------------------------------
> View this message in context: RE: separate spark and hive
> <http://apache-spark-developers-list.1001551.n3.nabble.com/separate-spark-and-hive-tp19879p19884.html>
>
> Sent from the Apache Spark Developers List mailing list archive
> <http://apache-spark-developers-list.1001551.n3.nabble.com/> at
> Nabble.com.
>

RE: separate spark and hive

Posted by "assaf.mendelson" <as...@rsa.com>.
After looking at the code, I found that spark.sql.catalogImplementation is set to “hive”. I would proposed that it should be set to “in-memory” by default (or at least have this in the documentation, the configuration documentation at http://spark.apache.org/docs/latest/configuration.html has no mentioning of hive at all)
Assaf.

From: Mendelson, Assaf
Sent: Tuesday, November 15, 2016 10:11 AM
To: 'rxin [via Apache Spark Developers List]'
Subject: RE: separate spark and hive

Spark shell (and pyspark) by default create the spark session with hive support (also true when the session is created using getOrCreate, at least in pyspark)
At a minimum there should be a way to configure it using spark-defaults.conf
Assaf.

From: rxin [via Apache Spark Developers List] [mailto:ml-node+s1001551n19882h31@n3.nabble.com]
Sent: Tuesday, November 15, 2016 9:46 AM
To: Mendelson, Assaf
Subject: Re: separate spark and hive

If you just start a SparkSession without calling enableHiveSupport it actually won't use the Hive catalog support.


On Mon, Nov 14, 2016 at 11:44 PM, Mendelson, Assaf <[hidden email]</user/SendEmail.jtp?type=node&node=19882&i=0>> wrote:
The default generation of spark context is actually a hive context.
I tried to find on the documentation what are the differences between hive context and sql context and couldn’t find it for spark 2.0 (I know for previous versions there were a couple of functions which required hive context as well as window functions but those seem to have all been fixed for spark 2.0).
Furthermore, I can’t seem to find a way to configure spark not to use hive. I can only find how to compile it without hive (and having to build from source each time is not a good idea for a production system).

I would suggest that working without hive should be either a simple configuration or even the default and that if there is any missing functionality it should be documented.
Assaf.


From: Reynold Xin [mailto:[hidden email]</user/SendEmail.jtp?type=node&node=19882&i=1>]
Sent: Tuesday, November 15, 2016 9:31 AM
To: Mendelson, Assaf
Cc: [hidden email]</user/SendEmail.jtp?type=node&node=19882&i=2>
Subject: Re: separate spark and hive

I agree with the high level idea, and thus SPARK-15691<https://issues.apache.org/jira/browse/SPARK-15691>.

In reality, it's a huge amount of work to create & maintain a custom catalog. It might actually make sense to do, but it just seems a lot of work to do right now and it'd take a toll on interoperability.

If you don't need persistent catalog, you can just run Spark without Hive mode, can't you?




On Mon, Nov 14, 2016 at 11:23 PM, assaf.mendelson <[hidden email]</user/SendEmail.jtp?type=node&node=19882&i=3>> wrote:
Hi,
Today, we basically force people to use hive if they want to get the full use of spark SQL.
When doing the default installation this means that a derby.log and metastore_db directory are created where we run from.
The problem with this is that if we run multiple scripts from the same working directory we have a problem.
The solution we employ locally is to always run from different directory as we ignore hive in practice (this of course means we lose the ability to use some of the catalog options in spark session).
The only other solution is to create a full blown hive installation with proper configuration (probably for a JDBC solution).

I would propose that in most cases there shouldn’t be any hive use at all. Even for catalog elements such as saving a permanent table, we should be able to configure a target directory and simply write to it (doing everything file based to avoid the need for locking). Hive should be reserved for those who actually use it (probably for backward compatibility).

Am I missing something here?
Assaf.

________________________________
View this message in context: separate spark and hive<http://apache-spark-developers-list.1001551.n3.nabble.com/separate-spark-and-hive-tp19879.html>
Sent from the Apache Spark Developers List mailing list archive<http://apache-spark-developers-list.1001551.n3.nabble.com/> at Nabble.com.



________________________________
If you reply to this email, your message will be added to the discussion below:
http://apache-spark-developers-list.1001551.n3.nabble.com/separate-spark-and-hive-tp19879p19882.html
To start a new topic under Apache Spark Developers List, email ml-node+s1001551n1h20@n3.nabble.com<ma...@n3.nabble.com>
To unsubscribe from Apache Spark Developers List, click here<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1&code=YXNzYWYubWVuZGVsc29uQHJzYS5jb218MXwtMTI4OTkxNTg1Mg==>.
NAML<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>




--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/separate-spark-and-hive-tp19879p19884.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

RE: separate spark and hive

Posted by "assaf.mendelson" <as...@rsa.com>.
Spark shell (and pyspark) by default create the spark session with hive support (also true when the session is created using getOrCreate, at least in pyspark)
At a minimum there should be a way to configure it using spark-defaults.conf
Assaf.

From: rxin [via Apache Spark Developers List] [mailto:ml-node+s1001551n19882h31@n3.nabble.com]
Sent: Tuesday, November 15, 2016 9:46 AM
To: Mendelson, Assaf
Subject: Re: separate spark and hive

If you just start a SparkSession without calling enableHiveSupport it actually won't use the Hive catalog support.


On Mon, Nov 14, 2016 at 11:44 PM, Mendelson, Assaf <[hidden email]</user/SendEmail.jtp?type=node&node=19882&i=0>> wrote:
The default generation of spark context is actually a hive context.
I tried to find on the documentation what are the differences between hive context and sql context and couldn’t find it for spark 2.0 (I know for previous versions there were a couple of functions which required hive context as well as window functions but those seem to have all been fixed for spark 2.0).
Furthermore, I can’t seem to find a way to configure spark not to use hive. I can only find how to compile it without hive (and having to build from source each time is not a good idea for a production system).

I would suggest that working without hive should be either a simple configuration or even the default and that if there is any missing functionality it should be documented.
Assaf.


From: Reynold Xin [mailto:[hidden email]</user/SendEmail.jtp?type=node&node=19882&i=1>]
Sent: Tuesday, November 15, 2016 9:31 AM
To: Mendelson, Assaf
Cc: [hidden email]</user/SendEmail.jtp?type=node&node=19882&i=2>
Subject: Re: separate spark and hive

I agree with the high level idea, and thus SPARK-15691<https://issues.apache.org/jira/browse/SPARK-15691>.

In reality, it's a huge amount of work to create & maintain a custom catalog. It might actually make sense to do, but it just seems a lot of work to do right now and it'd take a toll on interoperability.

If you don't need persistent catalog, you can just run Spark without Hive mode, can't you?




On Mon, Nov 14, 2016 at 11:23 PM, assaf.mendelson <[hidden email]</user/SendEmail.jtp?type=node&node=19882&i=3>> wrote:
Hi,
Today, we basically force people to use hive if they want to get the full use of spark SQL.
When doing the default installation this means that a derby.log and metastore_db directory are created where we run from.
The problem with this is that if we run multiple scripts from the same working directory we have a problem.
The solution we employ locally is to always run from different directory as we ignore hive in practice (this of course means we lose the ability to use some of the catalog options in spark session).
The only other solution is to create a full blown hive installation with proper configuration (probably for a JDBC solution).

I would propose that in most cases there shouldn’t be any hive use at all. Even for catalog elements such as saving a permanent table, we should be able to configure a target directory and simply write to it (doing everything file based to avoid the need for locking). Hive should be reserved for those who actually use it (probably for backward compatibility).

Am I missing something here?
Assaf.

________________________________
View this message in context: separate spark and hive<http://apache-spark-developers-list.1001551.n3.nabble.com/separate-spark-and-hive-tp19879.html>
Sent from the Apache Spark Developers List mailing list archive<http://apache-spark-developers-list.1001551.n3.nabble.com/> at Nabble.com.



________________________________
If you reply to this email, your message will be added to the discussion below:
http://apache-spark-developers-list.1001551.n3.nabble.com/separate-spark-and-hive-tp19879p19882.html
To start a new topic under Apache Spark Developers List, email ml-node+s1001551n1h20@n3.nabble.com<ma...@n3.nabble.com>
To unsubscribe from Apache Spark Developers List, click here<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1&code=YXNzYWYubWVuZGVsc29uQHJzYS5jb218MXwtMTI4OTkxNTg1Mg==>.
NAML<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>




--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/separate-spark-and-hive-tp19879p19883.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: separate spark and hive

Posted by Reynold Xin <rx...@databricks.com>.
If you just start a SparkSession without calling enableHiveSupport it
actually won't use the Hive catalog support.


On Mon, Nov 14, 2016 at 11:44 PM, Mendelson, Assaf <As...@rsa.com>
wrote:

> The default generation of spark context is actually a hive context.
>
> I tried to find on the documentation what are the differences between hive
> context and sql context and couldn’t find it for spark 2.0 (I know for
> previous versions there were a couple of functions which required hive
> context as well as window functions but those seem to have all been fixed
> for spark 2.0).
>
> Furthermore, I can’t seem to find a way to configure spark not to use
> hive. I can only find how to compile it without hive (and having to build
> from source each time is not a good idea for a production system).
>
>
>
> I would suggest that working without hive should be either a simple
> configuration or even the default and that if there is any missing
> functionality it should be documented.
>
> Assaf.
>
>
>
>
>
> *From:* Reynold Xin [mailto:rxin@databricks.com]
> *Sent:* Tuesday, November 15, 2016 9:31 AM
> *To:* Mendelson, Assaf
> *Cc:* dev@spark.apache.org
> *Subject:* Re: separate spark and hive
>
>
>
> I agree with the high level idea, and thus SPARK-15691
> <https://issues.apache.org/jira/browse/SPARK-15691>.
>
>
>
> In reality, it's a huge amount of work to create & maintain a custom
> catalog. It might actually make sense to do, but it just seems a lot of
> work to do right now and it'd take a toll on interoperability.
>
>
>
> If you don't need persistent catalog, you can just run Spark without Hive
> mode, can't you?
>
>
>
>
>
>
>
>
>
> On Mon, Nov 14, 2016 at 11:23 PM, assaf.mendelson <as...@rsa.com>
> wrote:
>
> Hi,
>
> Today, we basically force people to use hive if they want to get the full
> use of spark SQL.
>
> When doing the default installation this means that a derby.log and
> metastore_db directory are created where we run from.
>
> The problem with this is that if we run multiple scripts from the same
> working directory we have a problem.
>
> The solution we employ locally is to always run from different directory
> as we ignore hive in practice (this of course means we lose the ability to
> use some of the catalog options in spark session).
>
> The only other solution is to create a full blown hive installation with
> proper configuration (probably for a JDBC solution).
>
>
>
> I would propose that in most cases there shouldn’t be any hive use at all.
> Even for catalog elements such as saving a permanent table, we should be
> able to configure a target directory and simply write to it (doing
> everything file based to avoid the need for locking). Hive should be
> reserved for those who actually use it (probably for backward
> compatibility).
>
>
>
> Am I missing something here?
>
> Assaf.
>
>
> ------------------------------
>
> View this message in context: separate spark and hive
> <http://apache-spark-developers-list.1001551.n3.nabble.com/separate-spark-and-hive-tp19879.html>
> Sent from the Apache Spark Developers List mailing list archive
> <http://apache-spark-developers-list.1001551.n3.nabble.com/> at
> Nabble.com.
>
>
>

RE: separate spark and hive

Posted by "Mendelson, Assaf" <As...@rsa.com>.
The default generation of spark context is actually a hive context.
I tried to find on the documentation what are the differences between hive context and sql context and couldn’t find it for spark 2.0 (I know for previous versions there were a couple of functions which required hive context as well as window functions but those seem to have all been fixed for spark 2.0).
Furthermore, I can’t seem to find a way to configure spark not to use hive. I can only find how to compile it without hive (and having to build from source each time is not a good idea for a production system).

I would suggest that working without hive should be either a simple configuration or even the default and that if there is any missing functionality it should be documented.
Assaf.


From: Reynold Xin [mailto:rxin@databricks.com]
Sent: Tuesday, November 15, 2016 9:31 AM
To: Mendelson, Assaf
Cc: dev@spark.apache.org
Subject: Re: separate spark and hive

I agree with the high level idea, and thus SPARK-15691<https://issues.apache.org/jira/browse/SPARK-15691>.

In reality, it's a huge amount of work to create & maintain a custom catalog. It might actually make sense to do, but it just seems a lot of work to do right now and it'd take a toll on interoperability.

If you don't need persistent catalog, you can just run Spark without Hive mode, can't you?




On Mon, Nov 14, 2016 at 11:23 PM, assaf.mendelson <as...@rsa.com>> wrote:
Hi,
Today, we basically force people to use hive if they want to get the full use of spark SQL.
When doing the default installation this means that a derby.log and metastore_db directory are created where we run from.
The problem with this is that if we run multiple scripts from the same working directory we have a problem.
The solution we employ locally is to always run from different directory as we ignore hive in practice (this of course means we lose the ability to use some of the catalog options in spark session).
The only other solution is to create a full blown hive installation with proper configuration (probably for a JDBC solution).

I would propose that in most cases there shouldn’t be any hive use at all. Even for catalog elements such as saving a permanent table, we should be able to configure a target directory and simply write to it (doing everything file based to avoid the need for locking). Hive should be reserved for those who actually use it (probably for backward compatibility).

Am I missing something here?
Assaf.

________________________________
View this message in context: separate spark and hive<http://apache-spark-developers-list.1001551.n3.nabble.com/separate-spark-and-hive-tp19879.html>
Sent from the Apache Spark Developers List mailing list archive<http://apache-spark-developers-list.1001551.n3.nabble.com/> at Nabble.com.


Re: separate spark and hive

Posted by Reynold Xin <rx...@databricks.com>.
I agree with the high level idea, and thus SPARK-15691
<https://issues.apache.org/jira/browse/SPARK-15691>.

In reality, it's a huge amount of work to create & maintain a custom
catalog. It might actually make sense to do, but it just seems a lot of
work to do right now and it'd take a toll on interoperability.

If you don't need persistent catalog, you can just run Spark without Hive
mode, can't you?




On Mon, Nov 14, 2016 at 11:23 PM, assaf.mendelson <as...@rsa.com>
wrote:

> Hi,
>
> Today, we basically force people to use hive if they want to get the full
> use of spark SQL.
>
> When doing the default installation this means that a derby.log and
> metastore_db directory are created where we run from.
>
> The problem with this is that if we run multiple scripts from the same
> working directory we have a problem.
>
> The solution we employ locally is to always run from different directory
> as we ignore hive in practice (this of course means we lose the ability to
> use some of the catalog options in spark session).
>
> The only other solution is to create a full blown hive installation with
> proper configuration (probably for a JDBC solution).
>
>
>
> I would propose that in most cases there shouldn’t be any hive use at all.
> Even for catalog elements such as saving a permanent table, we should be
> able to configure a target directory and simply write to it (doing
> everything file based to avoid the need for locking). Hive should be
> reserved for those who actually use it (probably for backward
> compatibility).
>
>
>
> Am I missing something here?
>
> Assaf.
>
> ------------------------------
> View this message in context: separate spark and hive
> <http://apache-spark-developers-list.1001551.n3.nabble.com/separate-spark-and-hive-tp19879.html>
> Sent from the Apache Spark Developers List mailing list archive
> <http://apache-spark-developers-list.1001551.n3.nabble.com/> at
> Nabble.com.
>