You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Chaoyu Tang <ct...@gmail.com> on 2017/08/09 13:52:27 UTC
Re: Populating tables using hive and spark

You can check the property hive.compute.query.using.stats, and try setting
it to false if it is true to see if it helps.

On Mon, Aug 22, 2016 at 7:48 AM, Mich Talebzadeh <mi...@gmail.com>
wrote:

> hm. This looks correct to me
>
> hive>  select count(1) from testme;
> Query ID = hduser_20160821212812_2a8384af-23f1-4f28-9395-a99a5f4c1a4a
> Total jobs = 1
> Launching Job 1 out of 1
> In order to change the average load for a reducer (in bytes):
>   set hive.exec.reducers.bytes.per.reducer=<number>
> In order to limit the maximum number of reducers:
>   set hive.exec.reducers.max=<number>
> In order to set a constant number of reducers:
>   set mapreduce.job.reduces=<number>
> Starting Spark Job = 72e40552-e157-4954-a3e4-4ef165efe544
> 2016-08-22 12:40:23,307 Stage-6_0: 1/1 Finished Stage-7_0: 1/1 Finished
> Status: Finished successfully in 1.01 seconds
> OK
> 4
> Time taken: 24.663 seconds, Fetched: 1 row(s)
>
> Ok this is after I did compute statistics.
>
> Let me try to create an empty Hive table first
>
> hive> drop table testme;
> OK
> hive> create table testme (col1 int, col2 string);
> OK
> hive> select count(1) from testme;
> OK
> 0
>
> Insert those two records from Spark
>
> scala> sql("insert into test.testme select * from tmp")
> res13: org.apache.spark.sql.DataFrame = []
> scala> sql("select * from testme").show
> +----+----------+
> |col1|      col2|
> +----+----------+
> |   3|California|
> |   4|     Dehli|
> +----+----------+
>
> So the rows are there.
>
> Let me go to Hive
>
> hive>  select count(1) from testme;
> Query ID = hduser_20160821212812_2a8384af-23f1-4f28-9395-a99a5f4c1a4a
> OK
> 2
>
> So it sees the rows.
>
> *I am using Spark 2 and Hive 2*
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 22 August 2016 at 12:33, Nitin Kumar <nk...@gmail.com> wrote:
>
>> Hi Mich!
>>
>> There is no problem is displaying records or performing any aggregations
>> on the records after inserting data from spark into the hive table. It is
>> the count query (in hive) that returns the wrong result in hive prior to
>> issuing the compute statistics command.
>>
>> On Mon, Aug 22, 2016 at 4:50 PM, Mich Talebzadeh <
>> mich.talebzadeh@gmail.com> wrote:
>>
>>> Ok This is my test
>>>
>>> 1) create table in Hive and populate it with two rows
>>>
>>> hive> create table testme (col1 int, col2 string);
>>> OK
>>> hive> insert into testme values (1,'London');
>>> Query ID = hduser_20160821212812_2a8384af-23f1-4f28-9395-a99a5f4c1a4a
>>> OK
>>> hive> insert into testme values (2,'NY');
>>> Query ID = hduser_20160821212812_2a8384af-23f1-4f28-9395-a99a5f4c1a4a
>>> OK
>>> hive> select * from testme;
>>> OK
>>> 1       London
>>> 2       NY
>>>
>>> So the rows are there
>>>
>>> Now use  Spark to create two more rows
>>>
>>> scala> case class columns (col1: Int, col2: String)
>>> defined class columns
>>> scala> val df =sc.parallelize(Array((3,"California"),(4,"Dehli"))).map(p
>>> => columns(p._1.toString.toInt, p._2.toString)).toDF()
>>> df: org.apache.spark.sql.DataFrame = [col1: int, col2: string]
>>> scala> df.show
>>> +----+----------+
>>> |col1|      col2|
>>> +----+----------+
>>> |   3|California|
>>> |   4|     Dehli|
>>> +----+----------+
>>>
>>> // register it as tempTable
>>> scala> df.registerTempTable("tmp")
>>> scala> sql("insert into test.testme select * from tmp")
>>> res9: org.apache.spark.sql.DataFrame = []
>>> scala> sql("select * from testme").show
>>> +----+----------+
>>> |col1|      col2|
>>> +----+----------+
>>> |   1|    London|
>>> |   2|        NY|
>>> |   3|California|
>>> |   4|     Dehli|
>>> +----+----------+
>>> So the rows are there.
>>>
>>> Let me go to Hive again now
>>>
>>> Try and hit a count query in hive on testme here... it seems to return a
>> value of 2
>>
>>>
>>> hive>  select * from testme;
>>> OK
>>> 1       London
>>> 2       NY
>>> 3       California
>>> 4       Dehli
>>>
>>> hive> analyze table testme compute statistics for columns;
>>>
>>> So is there any issue here?
>>>
>>> Thanks
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 22 August 2016 at 11:51, Nitin Kumar <nk...@gmail.com>
>>> wrote:
>>>
>>>> Hi Furcy,
>>>>
>>>> If I execute the command "ANALYZE TABLE TEST_ORC COMPUTE STATISTICS"
>>>> before checking the count from hive, Hive returns the correct count albeit
>>>> it does not spawn a map-reduce job for computing the count.
>>>>
>>>> I'm running a HDP 2.4 Cluster with Hive 1.2.1.2.4 and Spark 1.6.1
>>>>
>>>> If others can concur we can go ahead and report it as a bug.
>>>>
>>>> Regards,
>>>> Nitin
>>>>
>>>>
>>>>
>>>> On Mon, Aug 22, 2016 at 4:15 PM, Furcy Pin <fu...@flaminem.com>
>>>> wrote:
>>>>
>>>>> Hi Nitin,
>>>>>
>>>>> I confirm that there is something odd here.
>>>>>
>>>>> I did the following test :
>>>>>
>>>>> create table test_orc (id int, name string, dept string) stored as ORC;
>>>>> insert into table test_orc values (1, 'abc', 'xyz');
>>>>> insert into table test_orc values (2, 'def', 'xyz');
>>>>> insert into table test_orc values (3, 'pqr', 'xyz');
>>>>> insert into table test_orc values (4, 'ghi', 'xyz');
>>>>>
>>>>>
>>>>> I ended up with 4 files on hdfs:
>>>>>
>>>>> 000000_0
>>>>> 000000_0_copy_1
>>>>> 000000_0_copy_2
>>>>> 000000_0_copy_3
>>>>>
>>>>>
>>>>> Then I renamed 000000_0_copy_2 to part-00000, and I still got COUNT(*)
>>>>> = 4 with hive.
>>>>> So this is not a file name issue.
>>>>>
>>>>> I then removed one of the files, and I got this :
>>>>>
>>>>> > SELECT COUNT(1) FROM test_orc ;
>>>>> +------+--+
>>>>> | _c0  |
>>>>> +------+--+
>>>>> | 4    |
>>>>> +------+--+
>>>>>
>>>>> > SELECT * FROM test_orc ;
>>>>> +--------------+----------------+----------------+--+
>>>>> | test_orc.id  | test_orc.name  | test_orc.dept  |
>>>>> +--------------+----------------+----------------+--+
>>>>> | 1            | abc            | xyz            |
>>>>> | 2            | def            | xyz            |
>>>>> | 4            | ghi            | xyz            |
>>>>> +--------------+----------------+----------------+--+
>>>>> 3 rows selected (0.162 seconds)
>>>>>
>>>>> So, my guess is that when Hive inserts data, it must keep somewhere in
>>>>> the metastore the number of rows in the table.
>>>>> However, if the files are modified by someone else than Hive itself,
>>>>> (either manually or with Spark), you end up with an inconsistency.
>>>>>
>>>>> So I guess we can call it a bug:
>>>>>
>>>>> Hive should detect that the files changed and invalidate its
>>>>> pre-calculated count.
>>>>> Optionally, Spark should be nice with Hive and update the the count
>>>>> when inserting.
>>>>>
>>>>> I don't know if this bug has already been reported, and I tested on
>>>>> Hive 1.1.0, so perhaps it is already solved in later releases.
>>>>>
>>>>> Regards,
>>>>>
>>>>> Furcy
>>>>>
>>>>>
>>>>> On Mon, Aug 22, 2016 at 9:34 AM, Nitin Kumar <
>>>>> nk94.nitinkumar@gmail.com> wrote:
>>>>>
>>>>>> Hi!
>>>>>>
>>>>>> I've noticed that hive has problems in registering new data records
>>>>>> if the same table is written to using both the hive terminal and spark sql.
>>>>>> The problem is demonstrated through the commands listed below
>>>>>>
>>>>>> ====================================================================
>>>>>> hive> use default;
>>>>>> hive> create table test_orc (id int, name string, dept string) stored
>>>>>> as ORC;
>>>>>> hive> insert into table test_orc values (1, 'abc', 'xyz');
>>>>>> hive> insert into table test_orc values (2, 'def', 'xyz');
>>>>>> hive> select count(*) from test_orc;
>>>>>> OK
>>>>>> 2
>>>>>> hive> select distinct(name) from test_orc;
>>>>>> OK
>>>>>> abc
>>>>>> def
>>>>>>
>>>>>> *** files in hdfs path in warehouse for the created table ***
>>>>>>
>>>>>>
>>>>>> 
>>>>>>
>>>>>> >>> data_points = [(3, 'pqr', 'xyz'), (4, 'ghi', 'xyz')]
>>>>>> >>> column_names = ['identity_id', 'emp_name', 'dept_name']
>>>>>> >>> data_df = sqlContext.createDataFrame(data_points, column_names)
>>>>>> >>> data_df.show()
>>>>>>
>>>>>> +-----------+--------+---------+
>>>>>> |identity_id|emp_name|dept_name|
>>>>>> +-----------+--------+---------+
>>>>>> |          3|     pqr|      xyz|
>>>>>> |          4|     ghi|      xyz|
>>>>>> +-----------+--------+---------+
>>>>>>
>>>>>> >>> data_df.registerTempTable('temp_table')
>>>>>> >>> sqlContext.sql('insert into table default.test_orc select * from
>>>>>> temp_table')
>>>>>>
>>>>>> *** files in hdfs path in warehouse for the created table ***
>>>>>>
>>>>>> 
>>>>>> hive> select count(*) from test_orc; (Does not launch map-reduce job)
>>>>>> OK
>>>>>> 2
>>>>>> hive> select distinct(name) from test_orc; (Launches map-reduce job)
>>>>>> abc
>>>>>> def
>>>>>> ghi
>>>>>> pqr
>>>>>>
>>>>>> hive> create table test_orc_new like test_orc stored as ORC;
>>>>>> hive> insert into table test_orc_new select * from test_orc;
>>>>>> hive> select count(*) from test_orc_new;
>>>>>> OK
>>>>>> 4
>>>>>> ==================================================================
>>>>>>
>>>>>> Even if I restart the hive services I cannot get the proper count
>>>>>> output from hive. This problem only occurs if the table is written to using
>>>>>> both hive and spark. If only spark is used to insert records into the table
>>>>>> multiple times, the count query in the hive terminal works perfectly fine.
>>>>>>
>>>>>> This problem occurs for tables stored with different storage formats
>>>>>> as well (textFile etc.)
>>>>>>
>>>>>> Is this because of the different naming conventions used by hive and
>>>>>> spark to write records to hdfs? Or maybe it is not a recommended practice
>>>>>> to write tables using different services?
>>>>>>
>>>>>> Your thoughts and comments on this matter would be highly appreciated!
>>>>>>
>>>>>> Thanks!
>>>>>> Nitin
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>