You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Mithila Joshi <jo...@gmail.com> on 2015/07/23 22:20:28 UTC
Fail to load hive tables through Spark
I am new to Spark and needed help in figuring out why my Hive databases are
not accessible to perform a data load through Spark.
Background:
1.
I am running Hive, Spark, and my Java program on a single machine. It's
a Cloudera QuickStart VM, CDH5.4x, on a VirtualBox.
2.
I have downloaded pre-built Spark 1.3.1.
3.
I am using the Hive bundled with the VM and can run hive queries through
Spark-shell and Hive cmd line without any issue. This includes running the
command:
LOAD DATA INPATH
'hdfs://quickstart.cloudera:8020/user/cloudera/test_table/result.parquet/'
INTO TABLE test_spark.test_table PARTITION(part = '2015-08-21');
Problem:
I am writing a Java program to read data from Cassandra and load it into
Hive. I have saved the results of the Cassandra read in parquet format in a
folder called 'result.parquet'.
Now I would like to load this into Hive. For this, I
1.
Copied the Hive-site.xml to the Spark conf folder.
- I made a change to this xml. I noticed that I had two hive-site.xml -
one which was auto generated and another which had Hive execution
parameters. I combined both into a single hive-site.xml.
2.
Code used (Java):
HiveContext hiveContext = new
HiveContext(JavaSparkContext.toSparkContext(sc));
hiveContext.sql("show databases").show();
hiveContext.sql("LOAD DATA INPATH
'hdfs://quickstart.cloudera:8020/user/cloudera/test_table/result.parquet/'
INTO TABLE test_spark.test_table PARTITION(part = '2015-08-21')").show();
So, this worked. And I could load data into Hive. Except, after I restarted
my VM, it has stopped working.
When I run the show databases Hive query, I get a result saying
result
default
instead of the databases in Hive, which are
default
test_spark
I also notice a folder called metastore_db being created in my Project
Folder. From googling around, I know this happens when Spark can't connect
to the Hive metastore, so it creates one of its own.I thought I had fixed
that, but clearly not.
What am I doing wrong?
Best,
Mithila
Re: Fail to load hive tables through Spark
Posted by ayan guha <gu...@gmail.com>.
Please check if your metastore service is running. You may need to switch
on automatic metastore service restart on restart of vm
On 24 Jul 2015 06:20, "Mithila Joshi" <jo...@gmail.com> wrote:
> I am new to Spark and needed help in figuring out why my Hive databases
> are not accessible to perform a data load through Spark.
>
> Background:
>
> 1.
>
> I am running Hive, Spark, and my Java program on a single machine.
> It's a Cloudera QuickStart VM, CDH5.4x, on a VirtualBox.
> 2.
>
> I have downloaded pre-built Spark 1.3.1.
> 3.
>
> I am using the Hive bundled with the VM and can run hive queries
> through Spark-shell and Hive cmd line without any issue. This includes
> running the command:
>
> LOAD DATA INPATH
> 'hdfs://quickstart.cloudera:8020/user/cloudera/test_table/result.parquet/'
> INTO TABLE test_spark.test_table PARTITION(part = '2015-08-21');
>
> Problem:
>
> I am writing a Java program to read data from Cassandra and load it into
> Hive. I have saved the results of the Cassandra read in parquet format in a
> folder called 'result.parquet'.
>
> Now I would like to load this into Hive. For this, I
>
> 1.
>
> Copied the Hive-site.xml to the Spark conf folder.
> - I made a change to this xml. I noticed that I had two hive-site.xml
> - one which was auto generated and another which had Hive execution
> parameters. I combined both into a single hive-site.xml.
> 2.
>
> Code used (Java):
>
> HiveContext hiveContext = new
> HiveContext(JavaSparkContext.toSparkContext(sc));
> hiveContext.sql("show databases").show();
> hiveContext.sql("LOAD DATA INPATH
> 'hdfs://quickstart.cloudera:8020/user/cloudera/test_table/result.parquet/'
> INTO TABLE test_spark.test_table PARTITION(part = '2015-08-21')").show();
>
>
> So, this worked. And I could load data into Hive. Except, after I
> restarted my VM, it has stopped working.
>
> When I run the show databases Hive query, I get a result saying
>
> result
> default
>
> instead of the databases in Hive, which are
>
> default
> test_spark
>
> I also notice a folder called metastore_db being created in my Project
> Folder. From googling around, I know this happens when Spark can't connect
> to the Hive metastore, so it creates one of its own.I thought I had fixed
> that, but clearly not.
>
> What am I doing wrong?
>
>
> Best,
>
> Mithila
>