You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Haopu Wang <HW...@qilinsoft.com> on 2015/03/10 11:37:34 UTC

[SparkSQL] Reuse HiveContext to different Hive warehouse?

I'm using Spark 1.3.0 RC3 build with Hive support.

 

In Spark Shell, I want to reuse the HiveContext instance to different
warehouse locations. Below are the steps for my test (Assume I have
loaded a file into table "src").

 

======

15/03/10 18:22:59 INFO SparkILoop: Created sql context (with Hive
support)..

SQL context available as sqlContext.

scala> sqlContext.sql("SET hive.metastore.warehouse.dir=/test/w")

scala> sqlContext.sql("SELECT * from src").saveAsTable("table1")

scala> sqlContext.sql("SET hive.metastore.warehouse.dir=/test/w2")

scala> sqlContext.sql("SELECT * from src").saveAsTable("table2")

======

After these steps, the tables are stored in "/test/w" only. I expect
"table2" to be stored in "/test/w2" folder.

 

Another question is: if I set "hive.metastore.warehouse.dir" to a HDFS
folder, I cannot use saveAsTable()? Is this by design? Exception stack
trace is below:

======

15/03/10 18:35:28 INFO BlockManagerMaster: Updated info of block
broadcast_0_piece0

15/03/10 18:35:28 INFO SparkContext: Created broadcast 0 from broadcast
at TableReader.scala:74

java.lang.IllegalArgumentException: Wrong FS:
hdfs://server:8020/space/warehouse/table2, expected: file:///

        at
org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:643)

        at
org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:463)

        at
org.apache.hadoop.fs.FilterFileSystem.makeQualified(FilterFileSystem.jav
a:118)

        at
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.a
pply(newParquet.scala:252)

        at
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.a
pply(newParquet.scala:251)

        at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.sc
ala:244)

        at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.sc
ala:244)

        at scala.collection.immutable.List.foreach(List.scala:318)

        at
scala.collection.TraversableLike$class.map(TraversableLike.scala:244)

        at
scala.collection.AbstractTraversable.map(Traversable.scala:105)

        at
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newP
arquet.scala:251)

        at
org.apache.spark.sql.parquet.ParquetRelation2.<init>(newParquet.scala:37
0)

        at
org.apache.spark.sql.parquet.DefaultSource.createRelation(newParquet.sca
la:96)

        at
org.apache.spark.sql.parquet.DefaultSource.createRelation(newParquet.sca
la:125)

        at
org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:308)

        at
org.apache.spark.sql.hive.execution.CreateMetastoreDataSourceAsSelect.ru
n(commands.scala:217)

        at
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompu
te(commands.scala:55)

        at
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands
.scala:55)

        at
org.apache.spark.sql.execution.ExecutedCommand.execute(commands.scala:65
)

        at
org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLConte
xt.scala:1088)

        at
org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:10
88)

        at
org.apache.spark.sql.DataFrame.saveAsTable(DataFrame.scala:1048)

        at
org.apache.spark.sql.DataFrame.saveAsTable(DataFrame.scala:998)

        at
org.apache.spark.sql.DataFrame.saveAsTable(DataFrame.scala:964)

        at
org.apache.spark.sql.DataFrame.saveAsTable(DataFrame.scala:942)

        at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:20)

        at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:25)

        at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:27)

        at $iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:29)

        at $iwC$$iwC$$iwC$$iwC.<init>(<console>:31)

        at $iwC$$iwC$$iwC.<init>(<console>:33)

        at $iwC$$iwC.<init>(<console>:35)

        at $iwC.<init>(<console>:37)

        at <init>(<console>:39)

 

Thank you very much!

Re: [SparkSQL] Reuse HiveContext to different Hive warehouse?

Posted by Michael Armbrust <mi...@databricks.com>.

That val is not really your problem.  In general, there is a lot of global
state throughout the hive codebase that make it unsafe to try and connect
to more than one hive installation from the same JVM.

On Tue, Mar 10, 2015 at 11:36 PM, Haopu Wang <HW...@qilinsoft.com> wrote:

>  Hao, thanks for the response.
>
>
>
> For Q1, in my case, I have a tool on SparkShell which serves multiple
> users where they can use different Hive installation. I take a look at the
> code of HiveContext. It looks like I cannot do that today because "catalog"
> field cannot be changed after initialize.
>
>
>
>   /* A catalyst metadata catalog that points to the Hive Metastore. */
>
>   @transient
>
>   *override* *protected*[sql] *lazy* *val* catalog = *new*
> HiveMetastoreCatalog(*this*) *with* OverrideCatalog
>
>
>
> For Q2, I check HDFS and it is running as a cluster. I can run the DDL
> from spark shell with HiveContext as well. To reproduce the exception, I
> just run below script. It happens in the last step.
>
>
>
> 15/03/11 14:24:48 INFO SparkILoop: Created sql context (with Hive
> support)..
>
> SQL context available as sqlContext.
>
> scala> sqlContext.sql("SET
> hive.metastore.warehouse.dir=hdfs://server:8020/space/warehouse")
>
> scala> sqlContext.sql("CREATE TABLE IF NOT EXISTS src(key INT, value
> STRING)")
>
> scala> sqlContext.sql("LOAD DATA LOCAL INPATH
> 'examples/src/main/resources/kv1.txt' INTO TABLE src")
>
> scala> var output = sqlContext.sql("SELECT key,value FROM src")
>
> scala> output.saveAsTable("outputtable")
>
>
>  ------------------------------
>
> *From:* Cheng, Hao [mailto:hao.cheng@intel.com]
> *Sent:* Wednesday, March 11, 2015 8:25 AM
> *To:* Haopu Wang; user; dev@spark.apache.org
> *Subject:* RE: [SparkSQL] Reuse HiveContext to different Hive warehouse?
>
>
>
> I am not so sure if Hive supports change the metastore after initialized,
> I guess not. Spark SQL totally rely on Hive Metastore in HiveContext,
> probably that’s why it doesn’t work as expected for Q1.
>
>
>
> BTW, in most of cases, people configure the metastore settings in
> hive-site.xml, and will not change that since then, is there any reason
> that you want to change that in runtime?
>
>
>
> For Q2, probably something wrong in configuration, seems the HDFS run into
> the pseudo/single node mode, can you double check that? Or can you run the
> DDL (like create a table) from the spark shell with HiveContext?
>
>
>
> *From:* Haopu Wang [mailto:HWang@qilinsoft.com]
> *Sent:* Tuesday, March 10, 2015 6:38 PM
> *To:* user; dev@spark.apache.org
> *Subject:* [SparkSQL] Reuse HiveContext to different Hive warehouse?
>
>
>
> I'm using Spark 1.3.0 RC3 build with Hive support.
>
>
>
> In Spark Shell, I want to reuse the HiveContext instance to different
> warehouse locations. Below are the steps for my test (Assume I have loaded
> a file into table "src").
>
>
>
> ======
>
> 15/03/10 18:22:59 INFO SparkILoop: Created sql context (with Hive
> support)..
>
> SQL context available as sqlContext.
>
> scala> sqlContext.sql("SET hive.metastore.warehouse.dir=/test/w")
>
> scala> sqlContext.sql("SELECT * from src").saveAsTable("table1")
>
> scala> sqlContext.sql("SET hive.metastore.warehouse.dir=/test/w2")
>
> scala> sqlContext.sql("SELECT * from src").saveAsTable("table2")
>
> ======
>
> After these steps, the tables are stored in "/test/w" only. I expect
> "table2" to be stored in "/test/w2" folder.
>
>
>
> Another question is: if I set "hive.metastore.warehouse.dir" to a HDFS
> folder, I cannot use saveAsTable()? Is this by design? Exception stack
> trace is below:
>
> ======
>
> 15/03/10 18:35:28 INFO BlockManagerMaster: Updated info of block
> broadcast_0_piece0
>
> 15/03/10 18:35:28 INFO SparkContext: Created broadcast 0 from broadcast at
> TableReader.scala:74
>
> java.lang.IllegalArgumentException: Wrong FS:
> hdfs://server:8020/space/warehouse/table2, expected: file:///
>
>         at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:643)
>
>         at
> org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:463)
>
>         at
> org.apache.hadoop.fs.FilterFileSystem.makeQualified(FilterFileSystem.java:118)
>
>         at
> org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:252)
>
>         at
> org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:251)
>
>         at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>
>         at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>
>         at scala.collection.immutable.List.foreach(List.scala:318)
>
>         at
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>
>         at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>
>         at
> org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newParquet.scala:251)
>
>         at
> org.apache.spark.sql.parquet.ParquetRelation2.<init>(newParquet.scala:370)
>
>         at
> org.apache.spark.sql.parquet.DefaultSource.createRelation(newParquet.scala:96)
>
>         at
> org.apache.spark.sql.parquet.DefaultSource.createRelation(newParquet.scala:125)
>
>         at
> org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:308)
>
>         at
> org.apache.spark.sql.hive.execution.CreateMetastoreDataSourceAsSelect.run(commands.scala:217)
>
>         at
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:55)
>
>         at
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:55)
>
>         at
> org.apache.spark.sql.execution.ExecutedCommand.execute(commands.scala:65)
>
>         at
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:1088)
>
>         at
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:1088)
>
>         at org.apache.spark.sql.DataFrame.saveAsTable(DataFrame.scala:1048)
>
>         at org.apache.spark.sql.DataFrame.saveAsTable(DataFrame.scala:998)
>
>         at org.apache.spark.sql.DataFrame.saveAsTable(DataFrame.scala:964)
>
>         at org.apache.spark.sql.DataFrame.saveAsTable(DataFrame.scala:942)
>
>         at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:20)
>
>         at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:25)
>
>         at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:27)
>
>         at $iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:29)
>
>         at $iwC$$iwC$$iwC$$iwC.<init>(<console>:31)
>
>         at $iwC$$iwC$$iwC.<init>(<console>:33)
>
>         at $iwC$$iwC.<init>(<console>:35)
>
>         at $iwC.<init>(<console>:37)
>
>         at <init>(<console>:39)
>
>
>
> Thank you very much!
>
>
>

Re: [SparkSQL] Reuse HiveContext to different Hive warehouse?

Posted by Michael Armbrust <mi...@databricks.com>.

That val is not really your problem.  In general, there is a lot of global
state throughout the hive codebase that make it unsafe to try and connect
to more than one hive installation from the same JVM.

On Tue, Mar 10, 2015 at 11:36 PM, Haopu Wang <HW...@qilinsoft.com> wrote:

>  Hao, thanks for the response.
>
>
>
> For Q1, in my case, I have a tool on SparkShell which serves multiple
> users where they can use different Hive installation. I take a look at the
> code of HiveContext. It looks like I cannot do that today because "catalog"
> field cannot be changed after initialize.
>
>
>
>   /* A catalyst metadata catalog that points to the Hive Metastore. */
>
>   @transient
>
>   *override* *protected*[sql] *lazy* *val* catalog = *new*
> HiveMetastoreCatalog(*this*) *with* OverrideCatalog
>
>
>
> For Q2, I check HDFS and it is running as a cluster. I can run the DDL
> from spark shell with HiveContext as well. To reproduce the exception, I
> just run below script. It happens in the last step.
>
>
>
> 15/03/11 14:24:48 INFO SparkILoop: Created sql context (with Hive
> support)..
>
> SQL context available as sqlContext.
>
> scala> sqlContext.sql("SET
> hive.metastore.warehouse.dir=hdfs://server:8020/space/warehouse")
>
> scala> sqlContext.sql("CREATE TABLE IF NOT EXISTS src(key INT, value
> STRING)")
>
> scala> sqlContext.sql("LOAD DATA LOCAL INPATH
> 'examples/src/main/resources/kv1.txt' INTO TABLE src")
>
> scala> var output = sqlContext.sql("SELECT key,value FROM src")
>
> scala> output.saveAsTable("outputtable")
>
>
>  ------------------------------
>
> *From:* Cheng, Hao [mailto:hao.cheng@intel.com]
> *Sent:* Wednesday, March 11, 2015 8:25 AM
> *To:* Haopu Wang; user; dev@spark.apache.org
> *Subject:* RE: [SparkSQL] Reuse HiveContext to different Hive warehouse?
>
>
>
> I am not so sure if Hive supports change the metastore after initialized,
> I guess not. Spark SQL totally rely on Hive Metastore in HiveContext,
> probably that’s why it doesn’t work as expected for Q1.
>
>
>
> BTW, in most of cases, people configure the metastore settings in
> hive-site.xml, and will not change that since then, is there any reason
> that you want to change that in runtime?
>
>
>
> For Q2, probably something wrong in configuration, seems the HDFS run into
> the pseudo/single node mode, can you double check that? Or can you run the
> DDL (like create a table) from the spark shell with HiveContext?
>
>
>
> *From:* Haopu Wang [mailto:HWang@qilinsoft.com]
> *Sent:* Tuesday, March 10, 2015 6:38 PM
> *To:* user; dev@spark.apache.org
> *Subject:* [SparkSQL] Reuse HiveContext to different Hive warehouse?
>
>
>
> I'm using Spark 1.3.0 RC3 build with Hive support.
>
>
>
> In Spark Shell, I want to reuse the HiveContext instance to different
> warehouse locations. Below are the steps for my test (Assume I have loaded
> a file into table "src").
>
>
>
> ======
>
> 15/03/10 18:22:59 INFO SparkILoop: Created sql context (with Hive
> support)..
>
> SQL context available as sqlContext.
>
> scala> sqlContext.sql("SET hive.metastore.warehouse.dir=/test/w")
>
> scala> sqlContext.sql("SELECT * from src").saveAsTable("table1")
>
> scala> sqlContext.sql("SET hive.metastore.warehouse.dir=/test/w2")
>
> scala> sqlContext.sql("SELECT * from src").saveAsTable("table2")
>
> ======
>
> After these steps, the tables are stored in "/test/w" only. I expect
> "table2" to be stored in "/test/w2" folder.
>
>
>
> Another question is: if I set "hive.metastore.warehouse.dir" to a HDFS
> folder, I cannot use saveAsTable()? Is this by design? Exception stack
> trace is below:
>
> ======
>
> 15/03/10 18:35:28 INFO BlockManagerMaster: Updated info of block
> broadcast_0_piece0
>
> 15/03/10 18:35:28 INFO SparkContext: Created broadcast 0 from broadcast at
> TableReader.scala:74
>
> java.lang.IllegalArgumentException: Wrong FS:
> hdfs://server:8020/space/warehouse/table2, expected: file:///
>
>         at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:643)
>
>         at
> org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:463)
>
>         at
> org.apache.hadoop.fs.FilterFileSystem.makeQualified(FilterFileSystem.java:118)
>
>         at
> org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:252)
>
>         at
> org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:251)
>
>         at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>
>         at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>
>         at scala.collection.immutable.List.foreach(List.scala:318)
>
>         at
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>
>         at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>
>         at
> org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newParquet.scala:251)
>
>         at
> org.apache.spark.sql.parquet.ParquetRelation2.<init>(newParquet.scala:370)
>
>         at
> org.apache.spark.sql.parquet.DefaultSource.createRelation(newParquet.scala:96)
>
>         at
> org.apache.spark.sql.parquet.DefaultSource.createRelation(newParquet.scala:125)
>
>         at
> org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:308)
>
>         at
> org.apache.spark.sql.hive.execution.CreateMetastoreDataSourceAsSelect.run(commands.scala:217)
>
>         at
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:55)
>
>         at
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:55)
>
>         at
> org.apache.spark.sql.execution.ExecutedCommand.execute(commands.scala:65)
>
>         at
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:1088)
>
>         at
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:1088)
>
>         at org.apache.spark.sql.DataFrame.saveAsTable(DataFrame.scala:1048)
>
>         at org.apache.spark.sql.DataFrame.saveAsTable(DataFrame.scala:998)
>
>         at org.apache.spark.sql.DataFrame.saveAsTable(DataFrame.scala:964)
>
>         at org.apache.spark.sql.DataFrame.saveAsTable(DataFrame.scala:942)
>
>         at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:20)
>
>         at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:25)
>
>         at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:27)
>
>         at $iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:29)
>
>         at $iwC$$iwC$$iwC$$iwC.<init>(<console>:31)
>
>         at $iwC$$iwC$$iwC.<init>(<console>:33)
>
>         at $iwC$$iwC.<init>(<console>:35)
>
>         at $iwC.<init>(<console>:37)
>
>         at <init>(<console>:39)
>
>
>
> Thank you very much!
>
>
>

RE: [SparkSQL] Reuse HiveContext to different Hive warehouse?

Posted by Haopu Wang <HW...@qilinsoft.com>.

Hao, thanks for the response.

 

For Q1, in my case, I have a tool on SparkShell which serves multiple
users where they can use different Hive installation. I take a look at
the code of HiveContext. It looks like I cannot do that today because
"catalog" field cannot be changed after initialize.

 

  /* A catalyst metadata catalog that points to the Hive Metastore. */

  @transient

  override protected[sql] lazy val catalog = new
HiveMetastoreCatalog(this) with OverrideCatalog

 

For Q2, I check HDFS and it is running as a cluster. I can run the DDL
from spark shell with HiveContext as well. To reproduce the exception, I
just run below script. It happens in the last step.

 

15/03/11 14:24:48 INFO SparkILoop: Created sql context (with Hive
support)..

SQL context available as sqlContext.

scala> sqlContext.sql("SET
hive.metastore.warehouse.dir=hdfs://server:8020/space/warehouse")

scala> sqlContext.sql("CREATE TABLE IF NOT EXISTS src(key INT, value
STRING)")

scala> sqlContext.sql("LOAD DATA LOCAL INPATH
'examples/src/main/resources/kv1.txt' INTO TABLE src")

scala> var output = sqlContext.sql("SELECT key,value FROM src")

scala> output.saveAsTable("outputtable")

 

________________________________

From: Cheng, Hao [mailto:hao.cheng@intel.com] 
Sent: Wednesday, March 11, 2015 8:25 AM
To: Haopu Wang; user; dev@spark.apache.org
Subject: RE: [SparkSQL] Reuse HiveContext to different Hive warehouse?

 

I am not so sure if Hive supports change the metastore after
initialized, I guess not. Spark SQL totally rely on Hive Metastore in
HiveContext, probably that's why it doesn't work as expected for Q1.

 

BTW, in most of cases, people configure the metastore settings in
hive-site.xml, and will not change that since then, is there any reason
that you want to change that in runtime?

 

For Q2, probably something wrong in configuration, seems the HDFS run
into the pseudo/single node mode, can you double check that? Or can you
run the DDL (like create a table) from the spark shell with HiveContext?


 

From: Haopu Wang [mailto:HWang@qilinsoft.com] 
Sent: Tuesday, March 10, 2015 6:38 PM
To: user; dev@spark.apache.org
Subject: [SparkSQL] Reuse HiveContext to different Hive warehouse?

 

I'm using Spark 1.3.0 RC3 build with Hive support.

 

In Spark Shell, I want to reuse the HiveContext instance to different
warehouse locations. Below are the steps for my test (Assume I have
loaded a file into table "src").

 

======

15/03/10 18:22:59 INFO SparkILoop: Created sql context (with Hive
support)..

SQL context available as sqlContext.

scala> sqlContext.sql("SET hive.metastore.warehouse.dir=/test/w")

scala> sqlContext.sql("SELECT * from src").saveAsTable("table1")

scala> sqlContext.sql("SET hive.metastore.warehouse.dir=/test/w2")

scala> sqlContext.sql("SELECT * from src").saveAsTable("table2")

======

After these steps, the tables are stored in "/test/w" only. I expect
"table2" to be stored in "/test/w2" folder.

 

Another question is: if I set "hive.metastore.warehouse.dir" to a HDFS
folder, I cannot use saveAsTable()? Is this by design? Exception stack
trace is below:

======

15/03/10 18:35:28 INFO BlockManagerMaster: Updated info of block
broadcast_0_piece0

15/03/10 18:35:28 INFO SparkContext: Created broadcast 0 from broadcast
at TableReader.scala:74

java.lang.IllegalArgumentException: Wrong FS:
hdfs://server:8020/space/warehouse/table2, expected: file:///
<file:///\\> 

        at
org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:643)

        at
org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:463)

        at
org.apache.hadoop.fs.FilterFileSystem.makeQualified(FilterFileSystem.jav
a:118)

        at
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.a
pply(newParquet.scala:252)

        at
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.a
pply(newParquet.scala:251)

        at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.sc
ala:244)

        at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.sc
ala:244)

        at scala.collection.immutable.List.foreach(List.scala:318)

        at
scala.collection.TraversableLike$class.map(TraversableLike.scala:244)

        at
scala.collection.AbstractTraversable.map(Traversable.scala:105)

        at
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newP
arquet.scala:251)

        at
org.apache.spark.sql.parquet.ParquetRelation2.<init>(newParquet.scala:37
0)

        at
org.apache.spark.sql.parquet.DefaultSource.createRelation(newParquet.sca
la:96)

        at
org.apache.spark.sql.parquet.DefaultSource.createRelation(newParquet.sca
la:125)

        at
org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:308)

        at
org.apache.spark.sql.hive.execution.CreateMetastoreDataSourceAsSelect.ru
n(commands.scala:217)

        at
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompu
te(commands.scala:55)

        at
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands
.scala:55)

        at
org.apache.spark.sql.execution.ExecutedCommand.execute(commands.scala:65
)

        at
org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLConte
xt.scala:1088)

        at
org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:10
88)

        at
org.apache.spark.sql.DataFrame.saveAsTable(DataFrame.scala:1048)

        at
org.apache.spark.sql.DataFrame.saveAsTable(DataFrame.scala:998)

        at
org.apache.spark.sql.DataFrame.saveAsTable(DataFrame.scala:964)

        at
org.apache.spark.sql.DataFrame.saveAsTable(DataFrame.scala:942)

        at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:20)

        at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:25)

        at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:27)

        at $iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:29)

        at $iwC$$iwC$$iwC$$iwC.<init>(<console>:31)

        at $iwC$$iwC$$iwC.<init>(<console>:33)

        at $iwC$$iwC.<init>(<console>:35)

        at $iwC.<init>(<console>:37)

        at <init>(<console>:39)

 

Thank you very much!

RE: [SparkSQL] Reuse HiveContext to different Hive warehouse?

Posted by Haopu Wang <HW...@qilinsoft.com>.

Hao, thanks for the response.

 

For Q1, in my case, I have a tool on SparkShell which serves multiple
users where they can use different Hive installation. I take a look at
the code of HiveContext. It looks like I cannot do that today because
"catalog" field cannot be changed after initialize.

 

  /* A catalyst metadata catalog that points to the Hive Metastore. */

  @transient

  override protected[sql] lazy val catalog = new
HiveMetastoreCatalog(this) with OverrideCatalog

 

For Q2, I check HDFS and it is running as a cluster. I can run the DDL
from spark shell with HiveContext as well. To reproduce the exception, I
just run below script. It happens in the last step.

 

15/03/11 14:24:48 INFO SparkILoop: Created sql context (with Hive
support)..

SQL context available as sqlContext.

scala> sqlContext.sql("SET
hive.metastore.warehouse.dir=hdfs://server:8020/space/warehouse")

scala> sqlContext.sql("CREATE TABLE IF NOT EXISTS src(key INT, value
STRING)")

scala> sqlContext.sql("LOAD DATA LOCAL INPATH
'examples/src/main/resources/kv1.txt' INTO TABLE src")

scala> var output = sqlContext.sql("SELECT key,value FROM src")

scala> output.saveAsTable("outputtable")

 

________________________________

From: Cheng, Hao [mailto:hao.cheng@intel.com] 
Sent: Wednesday, March 11, 2015 8:25 AM
To: Haopu Wang; user; dev@spark.apache.org
Subject: RE: [SparkSQL] Reuse HiveContext to different Hive warehouse?

 

I am not so sure if Hive supports change the metastore after
initialized, I guess not. Spark SQL totally rely on Hive Metastore in
HiveContext, probably that's why it doesn't work as expected for Q1.

 

BTW, in most of cases, people configure the metastore settings in
hive-site.xml, and will not change that since then, is there any reason
that you want to change that in runtime?

 

For Q2, probably something wrong in configuration, seems the HDFS run
into the pseudo/single node mode, can you double check that? Or can you
run the DDL (like create a table) from the spark shell with HiveContext?


 

From: Haopu Wang [mailto:HWang@qilinsoft.com] 
Sent: Tuesday, March 10, 2015 6:38 PM
To: user; dev@spark.apache.org
Subject: [SparkSQL] Reuse HiveContext to different Hive warehouse?

 

I'm using Spark 1.3.0 RC3 build with Hive support.

 

In Spark Shell, I want to reuse the HiveContext instance to different
warehouse locations. Below are the steps for my test (Assume I have
loaded a file into table "src").

 

======

15/03/10 18:22:59 INFO SparkILoop: Created sql context (with Hive
support)..

SQL context available as sqlContext.

scala> sqlContext.sql("SET hive.metastore.warehouse.dir=/test/w")

scala> sqlContext.sql("SELECT * from src").saveAsTable("table1")

scala> sqlContext.sql("SET hive.metastore.warehouse.dir=/test/w2")

scala> sqlContext.sql("SELECT * from src").saveAsTable("table2")

======

After these steps, the tables are stored in "/test/w" only. I expect
"table2" to be stored in "/test/w2" folder.

 

Another question is: if I set "hive.metastore.warehouse.dir" to a HDFS
folder, I cannot use saveAsTable()? Is this by design? Exception stack
trace is below:

======

15/03/10 18:35:28 INFO BlockManagerMaster: Updated info of block
broadcast_0_piece0

15/03/10 18:35:28 INFO SparkContext: Created broadcast 0 from broadcast
at TableReader.scala:74

java.lang.IllegalArgumentException: Wrong FS:
hdfs://server:8020/space/warehouse/table2, expected: file:///
<file:///\\> 

        at
org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:643)

        at
org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:463)

        at
org.apache.hadoop.fs.FilterFileSystem.makeQualified(FilterFileSystem.jav
a:118)

        at
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.a
pply(newParquet.scala:252)

        at
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.a
pply(newParquet.scala:251)

        at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.sc
ala:244)

        at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.sc
ala:244)

        at scala.collection.immutable.List.foreach(List.scala:318)

        at
scala.collection.TraversableLike$class.map(TraversableLike.scala:244)

        at
scala.collection.AbstractTraversable.map(Traversable.scala:105)

        at
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newP
arquet.scala:251)

        at
org.apache.spark.sql.parquet.ParquetRelation2.<init>(newParquet.scala:37
0)

        at
org.apache.spark.sql.parquet.DefaultSource.createRelation(newParquet.sca
la:96)

        at
org.apache.spark.sql.parquet.DefaultSource.createRelation(newParquet.sca
la:125)

        at
org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:308)

        at
org.apache.spark.sql.hive.execution.CreateMetastoreDataSourceAsSelect.ru
n(commands.scala:217)

        at
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompu
te(commands.scala:55)

        at
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands
.scala:55)

        at
org.apache.spark.sql.execution.ExecutedCommand.execute(commands.scala:65
)

        at
org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLConte
xt.scala:1088)

        at
org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:10
88)

        at
org.apache.spark.sql.DataFrame.saveAsTable(DataFrame.scala:1048)

        at
org.apache.spark.sql.DataFrame.saveAsTable(DataFrame.scala:998)

        at
org.apache.spark.sql.DataFrame.saveAsTable(DataFrame.scala:964)

        at
org.apache.spark.sql.DataFrame.saveAsTable(DataFrame.scala:942)

        at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:20)

        at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:25)

        at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:27)

        at $iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:29)

        at $iwC$$iwC$$iwC$$iwC.<init>(<console>:31)

        at $iwC$$iwC$$iwC.<init>(<console>:33)

        at $iwC$$iwC.<init>(<console>:35)

        at $iwC.<init>(<console>:37)

        at <init>(<console>:39)

 

Thank you very much!

RE: [SparkSQL] Reuse HiveContext to different Hive warehouse?

Posted by "Cheng, Hao" <ha...@intel.com>.

I am not so sure if Hive supports change the metastore after initialized, I guess not. Spark SQL totally rely on Hive Metastore in HiveContext, probably that's why it doesn't work as expected for Q1.

BTW, in most of cases, people configure the metastore settings in hive-site.xml, and will not change that since then, is there any reason that you want to change that in runtime?

For Q2, probably something wrong in configuration, seems the HDFS run into the pseudo/single node mode, can you double check that? Or can you run the DDL (like create a table) from the spark shell with HiveContext?

From: Haopu Wang [mailto:HWang@qilinsoft.com]
Sent: Tuesday, March 10, 2015 6:38 PM
To: user; dev@spark.apache.org
Subject: [SparkSQL] Reuse HiveContext to different Hive warehouse?


I'm using Spark 1.3.0 RC3 build with Hive support.



In Spark Shell, I want to reuse the HiveContext instance to different warehouse locations. Below are the steps for my test (Assume I have loaded a file into table "src").



======

15/03/10 18:22:59 INFO SparkILoop: Created sql context (with Hive support)..

SQL context available as sqlContext.

scala> sqlContext.sql("SET hive.metastore.warehouse.dir=/test/w")

scala> sqlContext.sql("SELECT * from src").saveAsTable("table1")

scala> sqlContext.sql("SET hive.metastore.warehouse.dir=/test/w2")

scala> sqlContext.sql("SELECT * from src").saveAsTable("table2")

======

After these steps, the tables are stored in "/test/w" only. I expect "table2" to be stored in "/test/w2" folder.



Another question is: if I set "hive.metastore.warehouse.dir" to a HDFS folder, I cannot use saveAsTable()? Is this by design? Exception stack trace is below:

======

15/03/10 18:35:28 INFO BlockManagerMaster: Updated info of block broadcast_0_piece0

15/03/10 18:35:28 INFO SparkContext: Created broadcast 0 from broadcast at TableReader.scala:74

java.lang.IllegalArgumentException: Wrong FS: hdfs://server:8020/space/warehouse/table2, expected: file:///<file:///\\>

        at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:643)

        at org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:463)

        at org.apache.hadoop.fs.FilterFileSystem.makeQualified(FilterFileSystem.java:118)

        at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:252)

        at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:251)

        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)

        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)

        at scala.collection.immutable.List.foreach(List.scala:318)

        at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)

        at scala.collection.AbstractTraversable.map(Traversable.scala:105)

        at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newParquet.scala:251)

        at org.apache.spark.sql.parquet.ParquetRelation2.<init>(newParquet.scala:370)

        at org.apache.spark.sql.parquet.DefaultSource.createRelation(newParquet.scala:96)

        at org.apache.spark.sql.parquet.DefaultSource.createRelation(newParquet.scala:125)

        at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:308)

        at org.apache.spark.sql.hive.execution.CreateMetastoreDataSourceAsSelect.run(commands.scala:217)

        at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:55)

        at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:55)

        at org.apache.spark.sql.execution.ExecutedCommand.execute(commands.scala:65)

        at org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:1088)

        at org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:1088)

        at org.apache.spark.sql.DataFrame.saveAsTable(DataFrame.scala:1048)

        at org.apache.spark.sql.DataFrame.saveAsTable(DataFrame.scala:998)

        at org.apache.spark.sql.DataFrame.saveAsTable(DataFrame.scala:964)

        at org.apache.spark.sql.DataFrame.saveAsTable(DataFrame.scala:942)

        at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:20)

        at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:25)

        at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:27)

        at $iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:29)

        at $iwC$$iwC$$iwC$$iwC.<init>(<console>:31)

        at $iwC$$iwC$$iwC.<init>(<console>:33)

        at $iwC$$iwC.<init>(<console>:35)

        at $iwC.<init>(<console>:37)

        at <init>(<console>:39)



Thank you very much!

RE: [SparkSQL] Reuse HiveContext to different Hive warehouse?

Posted by "Cheng, Hao" <ha...@intel.com>.

I am not so sure if Hive supports change the metastore after initialized, I guess not. Spark SQL totally rely on Hive Metastore in HiveContext, probably that's why it doesn't work as expected for Q1.

BTW, in most of cases, people configure the metastore settings in hive-site.xml, and will not change that since then, is there any reason that you want to change that in runtime?

For Q2, probably something wrong in configuration, seems the HDFS run into the pseudo/single node mode, can you double check that? Or can you run the DDL (like create a table) from the spark shell with HiveContext?

From: Haopu Wang [mailto:HWang@qilinsoft.com]
Sent: Tuesday, March 10, 2015 6:38 PM
To: user; dev@spark.apache.org
Subject: [SparkSQL] Reuse HiveContext to different Hive warehouse?


I'm using Spark 1.3.0 RC3 build with Hive support.



In Spark Shell, I want to reuse the HiveContext instance to different warehouse locations. Below are the steps for my test (Assume I have loaded a file into table "src").



======

15/03/10 18:22:59 INFO SparkILoop: Created sql context (with Hive support)..

SQL context available as sqlContext.

scala> sqlContext.sql("SET hive.metastore.warehouse.dir=/test/w")

scala> sqlContext.sql("SELECT * from src").saveAsTable("table1")

scala> sqlContext.sql("SET hive.metastore.warehouse.dir=/test/w2")

scala> sqlContext.sql("SELECT * from src").saveAsTable("table2")

======

After these steps, the tables are stored in "/test/w" only. I expect "table2" to be stored in "/test/w2" folder.



Another question is: if I set "hive.metastore.warehouse.dir" to a HDFS folder, I cannot use saveAsTable()? Is this by design? Exception stack trace is below:

======

15/03/10 18:35:28 INFO BlockManagerMaster: Updated info of block broadcast_0_piece0

15/03/10 18:35:28 INFO SparkContext: Created broadcast 0 from broadcast at TableReader.scala:74

java.lang.IllegalArgumentException: Wrong FS: hdfs://server:8020/space/warehouse/table2, expected: file:///<file:///\\>

        at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:643)

        at org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:463)

        at org.apache.hadoop.fs.FilterFileSystem.makeQualified(FilterFileSystem.java:118)

        at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:252)

        at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:251)

        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)

        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)

        at scala.collection.immutable.List.foreach(List.scala:318)

        at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)

        at scala.collection.AbstractTraversable.map(Traversable.scala:105)

        at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newParquet.scala:251)

        at org.apache.spark.sql.parquet.ParquetRelation2.<init>(newParquet.scala:370)

        at org.apache.spark.sql.parquet.DefaultSource.createRelation(newParquet.scala:96)

        at org.apache.spark.sql.parquet.DefaultSource.createRelation(newParquet.scala:125)

        at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:308)

        at org.apache.spark.sql.hive.execution.CreateMetastoreDataSourceAsSelect.run(commands.scala:217)

        at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:55)

        at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:55)

        at org.apache.spark.sql.execution.ExecutedCommand.execute(commands.scala:65)

        at org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:1088)

        at org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:1088)

        at org.apache.spark.sql.DataFrame.saveAsTable(DataFrame.scala:1048)

        at org.apache.spark.sql.DataFrame.saveAsTable(DataFrame.scala:998)

        at org.apache.spark.sql.DataFrame.saveAsTable(DataFrame.scala:964)

        at org.apache.spark.sql.DataFrame.saveAsTable(DataFrame.scala:942)

        at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:20)

        at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:25)

        at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:27)

        at $iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:29)

        at $iwC$$iwC$$iwC$$iwC.<init>(<console>:31)

        at $iwC$$iwC$$iwC.<init>(<console>:33)

        at $iwC$$iwC.<init>(<console>:35)

        at $iwC.<init>(<console>:37)

        at <init>(<console>:39)



Thank you very much!