You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by "@Sanjiv Singh" <sa...@gmail.com> on 2016/03/12 09:24:23 UTC

Spark SQL is not returning records for HIVE transactional tables on HDP

Hi All,

I am facing this issue on HDP setup on which COMPACTION is required only
once for transactional tables to fetch records with Spark SQL.
On the other hand, Apache setup doesn't required compaction even once.

May be something got triggered on meta-store after compaction, Spark SQL
start recognizing delta files.

Let know me if needed other details to get root cause.

Try this,

*See complete scenario :*

hive> create table default.foo(id int) clustered by (id) into 2 buckets
STORED AS ORC TBLPROPERTIES ('transactional'='true');
hive> insert into default.foo values(10);

scala> sqlContext.table("default.foo").count // Gives 0, which is wrong
because data is still in delta files

Now run major compaction:

hive> ALTER TABLE default.foo COMPACT 'MAJOR';

scala> sqlContext.table("default.foo").count // Gives 1

hive> insert into foo values(20);

scala> sqlContext.table("default.foo").count* // Gives 2 , no compaction
required.*




Regards
Sanjiv Singh
Mob :  +091 9990-447-339

Re: Spark SQL is not returning records for HIVE transactional tables on HDP

Posted by Mich Talebzadeh <mi...@gmail.com>.

This is an interesting one as it appears that a hive transactional table

   1. Hive version 2
   2. Hive on Spark engine 1.3.1
   3. Spark 1.5.2

hive> create table default.foo(id int) clustered by (id) into 2 buckets
STORED AS ORC TBLPROPERTIES ('transactional'='true');
hive> insert into default.foo values(10);

hive> select * from foo;
OK
10
Time taken: 0.067 seconds, Fetched: 1 row(s)
hive> select * from foo;
10

At this stage if you do a simple select from spark from foo, you will get
an error which sounds like a big

spark-sql> select * from foo;
16/03/12 17:08:21 ERROR SparkSQLDriver: Failed in [select * from foo]
java.lang.RuntimeException: serious problem

 No locks are held in Hive on that table. Let us go back and do a
compaction in Hive

hive> alter table foo compact 'major';
Compaction enqueued.

These messages appear in Hive log. The job is a Map-reduce job

2016-03-12T17:12:29,776 INFO  [rhes564-31]: mapreduce.Job
(Job.java:monitorAndPrintJob(1345)) - Running job: job_1457790020440_0006
2016-03-12T17:12:31,915 INFO
[org.apache.hadoop.hive.ql.txn.compactor.HouseKeeperServiceBase$1-0]:
txn.AcidHouseKeeperService (AcidHouseKeeperService.java:run(67)) - timeout
reaper ran for 0seconds.  isAliveCounter=-2147483542
2016-03-12T17:13:51,918 INFO
[org.apache.hadoop.hive.ql.txn.compactor.HouseKeeperServiceBase$1-0]:
txn.AcidCompactionHistoryService
(AcidCompactionHistoryService.java:run(76)) - History reaper reaper ran for
0seconds.  isAliveCounter=-2147483488

And it goes through every single table to compact it including temp tables

2016-03-12T17:15:52,440 INFO  [Thread-9]: compactor.Initiator
(Initiator.java:run(89)) - Checking to see if we should compact default.foo
2016-03-12T17:15:52,449 INFO  [Thread-9]: compactor.Initiator
(Initiator.java:run(89)) - Checking to see if we should compact
oraclehadoop.sales3
2016-03-12T17:15:52,468 INFO  [Thread-9]: compactor.Initiator
(Initiator.java:run(89)) - Checking to see if we should compact
oraclehadoop.smallsales
2016-03-12T17:15:52,480 INFO  [Thread-9]: compactor.Initiator
(Initiator.java:run(89)) - Checking to see if we should compact test.stg_t2
2016-03-12T17:15:52,491 INFO  [Thread-9]: compactor.Initiator
(Initiator.java:run(89)) - Checking to see if we should compact
default.values__tmp__table__3
2016-03-12T17:15:52,492 INFO  [Thread-9]: compactor.Initiator
(Initiator.java:run(94)) - Can't find table default.values__tmp__table__3,
assuming it's a temp table or has been dropped and moving on.
2016-03-12T17:15:52,492 INFO  [Thread-9]: compactor.Initiator
(Initiator.java:run(89)) - Checking to see if we should compact
default.values__tmp__table__4
2016-03-12T17:15:52,492 INFO  [Thread-9]: compactor.Initiator
(Initiator.java:run(94)) - Can't find table default.values__tmp__table__4,
assuming it's a temp table or has been dropped and moving on.
2016-03-12T17:15:52,493 INFO  [Thread-9]: compactor.Initiator
(Initiator.java:run(89)) - Checking to see if we should compact
default.values__tmp__table__1
2016-03-12T17:15:52,493 INFO  [Thread-9]: compactor.Initiator
(Initiator.java:run(94)) - Can't find table default.values__tmp__table__1,
assuming it's a temp table or has been dropped and moving on.
2016-03-12T17:15:52,493 INFO  [Thread-9]: compactor.Initiator
(Initiator.java:run(89)) - Checking to see if we should compact test.t2
2016-03-12T17:15:52,504 INFO  [Thread-9]: compactor.Initiator
(Initiator.java:run(89)) - Checking to see if we should compact
default.values__tmp__table__2
2016-03-12T17:15:52,505 INFO  [Thread-9]: compactor.Initiator
(Initiator.java:run(94)) - Can't find table default.values__tmp__table__2,
assuming it's a temp table or has been dropped and moving on.

OK once the compaction (which Hive does it in background) is complete then
one can query the table from Spark

spark-sql> select * from foo;
10
Time taken: 4.509 seconds, Fetched 1 row(s)

I notice that if you insert a new row into foo (from Hive), you still get
the same error in Spark

scala> HiveContext.sql("select * from foo").collect.foreach(println)
java.lang.RuntimeException: serious problem

This looks like a bug as irt seems it only works after compaction is done
interactively or after Hive does it itself!

HTH

Dr Mich Talebzadeh

LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*

http://talebzadehmich.wordpress.com

On 12 March 2016 at 08:24, @Sanjiv Singh <sa...@gmail.com> wrote:

> Hi All,
>
> I am facing this issue on HDP setup on which COMPACTION is required only
> once for transactional tables to fetch records with Spark SQL.
> On the other hand, Apache setup doesn't required compaction even once.
>
> May be something got triggered on meta-store after compaction, Spark SQL
> start recognizing delta files.
>
> Let know me if needed other details to get root cause.
>
> Try this,
>
> *See complete scenario :*
>
> hive> create table default.foo(id int) clustered by (id) into 2 buckets
> STORED AS ORC TBLPROPERTIES ('transactional'='true');
> hive> insert into default.foo values(10);
>
> scala> sqlContext.table("default.foo").count // Gives 0, which is wrong
> because data is still in delta files
>
> Now run major compaction:
>
> hive> ALTER TABLE default.foo COMPACT 'MAJOR';
>
> scala> sqlContext.table("default.foo").count // Gives 1
>
> hive> insert into foo values(20);
>
> scala> sqlContext.table("default.foo").count* // Gives 2 , no compaction
> required.*
>
>
>
>
> Regards
> Sanjiv Singh
> Mob :  +091 9990-447-339
>

Re: Spark SQL is not returning records for HIVE transactional tables on HDP

Posted by Gopal Vijayaraghavan <go...@apache.org>.

> We are using for Spark SQL :

Does SparkSQL support Transactional tables?

I thought Transactional tables needed a dead-lock proof LockManager, which
was a hive-specific feature?

Cheers,
Gopal

Re: Spark SQL is not returning records for HIVE transactional tables on HDP

Posted by "@Sanjiv Singh" <sa...@gmail.com>.

Hi All,
We are using for Spark SQL :


   - Hive :1.2.1
   - Spark : 1.3.1
   - Hadoop :2.7.1

Let me know if needs other details to debug the issue.


Regards
Sanjiv Singh
Mob :  +091 9990-447-339

On Sun, Mar 13, 2016 at 1:07 AM, Mich Talebzadeh <mi...@gmail.com>
wrote:

> Hi,
>
> Thanks for the input. I use Hive 2 and still have this issue.
>
>
>
>    1. Hive version 2
>    2. Hive on Spark engine 1.3.1
>    3. Spark 1.5.2
>
>
> I have added Hive user group  to this as well. So hopefully we may get
> some resolution.
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 12 March 2016 at 19:25, Timur Shenkao <ts...@timshenkao.su> wrote:
>
>> Hi,
>>
>> I have suffered from Hive Streaming , Transactions enough, so I can share
>> my experience with you.
>>
>> 1) It's not a problem of Spark. It happens because of "peculiarities" /
>> bugs of Hive Streaming.  Hive Streaming, transactions are very raw
>> technologies. If you look at Hive JIRA, you'll see several critical bugs
>> concerning Hive Streaming, transactions. Some of them are resolved in Hive
>> 2+ only. But Cloudera & Hortonworks ship their distributions with outdated
>> & buggy Hive.
>> So use Hive 2+. Earlier versions of Hive didn't run compaction at all.
>>
>> 2) In Hive 1.1, I  issue the following lines
>> ALTER TABLE default.foo COMPACT 'MAJOR';
>> SHOW COMPACTIONS;
>>
>> My manual compaction was shown but it was never fulfilled.
>>
>> 3) If you use Hive Streaming, it's not recommended or even forbidden to
>> insert rows into Hive Streaming tables manually. Only the process that
>> writes to such table should insert incoming rows sequentially. Otherwise
>> you'll get unpredictable behaviour.
>>
>> 4) Ordinary Hive tables are catalogs with text, ORC, etc. files.
>> Hive Streaming / transactional tables are catalogs that have numerous
>> subcatalogs with "delta" prefix. Moreover, there are files with
>> "flush_length" suffix in some delta subfolders. "flush_length" files have 8
>> bytes length. The presence of "flush_length" file in some subfolder means
>> that Hive writes updates to this subfolder right now. When Hive fails or is
>> restarted, it begins to write into new delta subfolder with new
>> "flush_length" file. And old "flush_length" file (that was used before
>> failure) still remains.
>> One of the goal of compaction is to delete outdated "flush_length" files.
>> Not every application / library can read such folder structure or knows
>> details of Hive Streaming / transactions implementation. Most of the
>> software solutions still expect ordinary Hive tables as input.
>> When they encounter subcatalogs or special files "flush_length" file,
>> applications / libraries either "see nothing" (return 0 or empty result
>> set) or stumble over "flush_length" files (return unexplainable errors).
>>
>> For instance, Facebook Presto couldn't read subfolders by default unless
>> you activate special parameters. But it stumbles over "flush_length" files
>> as Presto expect legal ORC files not 8-byte-length text files in folders.
>>
>> So, I don't advise you to use Hive Streaming, transactions right now in
>> real production systems (24 / 7 /365) with hundreds millions of events a
>> day.
>>
>> On Sat, Mar 12, 2016 at 11:24 AM, @Sanjiv Singh <sa...@gmail.com>
>> wrote:
>>
>>> Hi All,
>>>
>>> I am facing this issue on HDP setup on which COMPACTION is required only
>>> once for transactional tables to fetch records with Spark SQL.
>>> On the other hand, Apache setup doesn't required compaction even once.
>>>
>>> May be something got triggered on meta-store after compaction, Spark SQL
>>> start recognizing delta files.
>>>
>>> Let know me if needed other details to get root cause.
>>>
>>> Try this,
>>>
>>> *See complete scenario :*
>>>
>>> hive> create table default.foo(id int) clustered by (id) into 2 buckets
>>> STORED AS ORC TBLPROPERTIES ('transactional'='true');
>>> hive> insert into default.foo values(10);
>>>
>>> scala> sqlContext.table("default.foo").count // Gives 0, which is wrong
>>> because data is still in delta files
>>>
>>> Now run major compaction:
>>>
>>> hive> ALTER TABLE default.foo COMPACT 'MAJOR';
>>>
>>> scala> sqlContext.table("default.foo").count // Gives 1
>>>
>>> hive> insert into foo values(20);
>>>
>>> scala> sqlContext.table("default.foo").count* // Gives 2 , no
>>> compaction required.*
>>>
>>>
>>>
>>>
>>> Regards
>>> Sanjiv Singh
>>> Mob :  +091 9990-447-339
>>>
>>
>>
>

Re: Spark SQL is not returning records for HIVE transactional tables on HDP

Posted by "@Sanjiv Singh" <sa...@gmail.com>.

Hi All,
We are using for Spark SQL :


   - Hive :1.2.1
   - Spark : 1.3.1
   - Hadoop :2.7.1

Let me know if needs other details to debug the issue.


Regards
Sanjiv Singh
Mob :  +091 9990-447-339

On Sun, Mar 13, 2016 at 1:07 AM, Mich Talebzadeh <mi...@gmail.com>
wrote:

> Hi,
>
> Thanks for the input. I use Hive 2 and still have this issue.
>
>
>
>    1. Hive version 2
>    2. Hive on Spark engine 1.3.1
>    3. Spark 1.5.2
>
>
> I have added Hive user group  to this as well. So hopefully we may get
> some resolution.
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 12 March 2016 at 19:25, Timur Shenkao <ts...@timshenkao.su> wrote:
>
>> Hi,
>>
>> I have suffered from Hive Streaming , Transactions enough, so I can share
>> my experience with you.
>>
>> 1) It's not a problem of Spark. It happens because of "peculiarities" /
>> bugs of Hive Streaming.  Hive Streaming, transactions are very raw
>> technologies. If you look at Hive JIRA, you'll see several critical bugs
>> concerning Hive Streaming, transactions. Some of them are resolved in Hive
>> 2+ only. But Cloudera & Hortonworks ship their distributions with outdated
>> & buggy Hive.
>> So use Hive 2+. Earlier versions of Hive didn't run compaction at all.
>>
>> 2) In Hive 1.1, I  issue the following lines
>> ALTER TABLE default.foo COMPACT 'MAJOR';
>> SHOW COMPACTIONS;
>>
>> My manual compaction was shown but it was never fulfilled.
>>
>> 3) If you use Hive Streaming, it's not recommended or even forbidden to
>> insert rows into Hive Streaming tables manually. Only the process that
>> writes to such table should insert incoming rows sequentially. Otherwise
>> you'll get unpredictable behaviour.
>>
>> 4) Ordinary Hive tables are catalogs with text, ORC, etc. files.
>> Hive Streaming / transactional tables are catalogs that have numerous
>> subcatalogs with "delta" prefix. Moreover, there are files with
>> "flush_length" suffix in some delta subfolders. "flush_length" files have 8
>> bytes length. The presence of "flush_length" file in some subfolder means
>> that Hive writes updates to this subfolder right now. When Hive fails or is
>> restarted, it begins to write into new delta subfolder with new
>> "flush_length" file. And old "flush_length" file (that was used before
>> failure) still remains.
>> One of the goal of compaction is to delete outdated "flush_length" files.
>> Not every application / library can read such folder structure or knows
>> details of Hive Streaming / transactions implementation. Most of the
>> software solutions still expect ordinary Hive tables as input.
>> When they encounter subcatalogs or special files "flush_length" file,
>> applications / libraries either "see nothing" (return 0 or empty result
>> set) or stumble over "flush_length" files (return unexplainable errors).
>>
>> For instance, Facebook Presto couldn't read subfolders by default unless
>> you activate special parameters. But it stumbles over "flush_length" files
>> as Presto expect legal ORC files not 8-byte-length text files in folders.
>>
>> So, I don't advise you to use Hive Streaming, transactions right now in
>> real production systems (24 / 7 /365) with hundreds millions of events a
>> day.
>>
>> On Sat, Mar 12, 2016 at 11:24 AM, @Sanjiv Singh <sa...@gmail.com>
>> wrote:
>>
>>> Hi All,
>>>
>>> I am facing this issue on HDP setup on which COMPACTION is required only
>>> once for transactional tables to fetch records with Spark SQL.
>>> On the other hand, Apache setup doesn't required compaction even once.
>>>
>>> May be something got triggered on meta-store after compaction, Spark SQL
>>> start recognizing delta files.
>>>
>>> Let know me if needed other details to get root cause.
>>>
>>> Try this,
>>>
>>> *See complete scenario :*
>>>
>>> hive> create table default.foo(id int) clustered by (id) into 2 buckets
>>> STORED AS ORC TBLPROPERTIES ('transactional'='true');
>>> hive> insert into default.foo values(10);
>>>
>>> scala> sqlContext.table("default.foo").count // Gives 0, which is wrong
>>> because data is still in delta files
>>>
>>> Now run major compaction:
>>>
>>> hive> ALTER TABLE default.foo COMPACT 'MAJOR';
>>>
>>> scala> sqlContext.table("default.foo").count // Gives 1
>>>
>>> hive> insert into foo values(20);
>>>
>>> scala> sqlContext.table("default.foo").count* // Gives 2 , no
>>> compaction required.*
>>>
>>>
>>>
>>>
>>> Regards
>>> Sanjiv Singh
>>> Mob :  +091 9990-447-339
>>>
>>
>>
>

Re: Spark SQL is not returning records for HIVE transactional tables on HDP

Posted by Mich Talebzadeh <mi...@gmail.com>.

Hi,

Thanks for the input. I use Hive 2 and still have this issue.



   1. Hive version 2
   2. Hive on Spark engine 1.3.1
   3. Spark 1.5.2


I have added Hive user group  to this as well. So hopefully we may get some
resolution.

HTH

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 12 March 2016 at 19:25, Timur Shenkao <ts...@timshenkao.su> wrote:

> Hi,
>
> I have suffered from Hive Streaming , Transactions enough, so I can share
> my experience with you.
>
> 1) It's not a problem of Spark. It happens because of "peculiarities" /
> bugs of Hive Streaming.  Hive Streaming, transactions are very raw
> technologies. If you look at Hive JIRA, you'll see several critical bugs
> concerning Hive Streaming, transactions. Some of them are resolved in Hive
> 2+ only. But Cloudera & Hortonworks ship their distributions with outdated
> & buggy Hive.
> So use Hive 2+. Earlier versions of Hive didn't run compaction at all.
>
> 2) In Hive 1.1, I  issue the following lines
> ALTER TABLE default.foo COMPACT 'MAJOR';
> SHOW COMPACTIONS;
>
> My manual compaction was shown but it was never fulfilled.
>
> 3) If you use Hive Streaming, it's not recommended or even forbidden to
> insert rows into Hive Streaming tables manually. Only the process that
> writes to such table should insert incoming rows sequentially. Otherwise
> you'll get unpredictable behaviour.
>
> 4) Ordinary Hive tables are catalogs with text, ORC, etc. files.
> Hive Streaming / transactional tables are catalogs that have numerous
> subcatalogs with "delta" prefix. Moreover, there are files with
> "flush_length" suffix in some delta subfolders. "flush_length" files have 8
> bytes length. The presence of "flush_length" file in some subfolder means
> that Hive writes updates to this subfolder right now. When Hive fails or is
> restarted, it begins to write into new delta subfolder with new
> "flush_length" file. And old "flush_length" file (that was used before
> failure) still remains.
> One of the goal of compaction is to delete outdated "flush_length" files.
> Not every application / library can read such folder structure or knows
> details of Hive Streaming / transactions implementation. Most of the
> software solutions still expect ordinary Hive tables as input.
> When they encounter subcatalogs or special files "flush_length" file,
> applications / libraries either "see nothing" (return 0 or empty result
> set) or stumble over "flush_length" files (return unexplainable errors).
>
> For instance, Facebook Presto couldn't read subfolders by default unless
> you activate special parameters. But it stumbles over "flush_length" files
> as Presto expect legal ORC files not 8-byte-length text files in folders.
>
> So, I don't advise you to use Hive Streaming, transactions right now in
> real production systems (24 / 7 /365) with hundreds millions of events a
> day.
>
> On Sat, Mar 12, 2016 at 11:24 AM, @Sanjiv Singh <sa...@gmail.com>
> wrote:
>
>> Hi All,
>>
>> I am facing this issue on HDP setup on which COMPACTION is required only
>> once for transactional tables to fetch records with Spark SQL.
>> On the other hand, Apache setup doesn't required compaction even once.
>>
>> May be something got triggered on meta-store after compaction, Spark SQL
>> start recognizing delta files.
>>
>> Let know me if needed other details to get root cause.
>>
>> Try this,
>>
>> *See complete scenario :*
>>
>> hive> create table default.foo(id int) clustered by (id) into 2 buckets
>> STORED AS ORC TBLPROPERTIES ('transactional'='true');
>> hive> insert into default.foo values(10);
>>
>> scala> sqlContext.table("default.foo").count // Gives 0, which is wrong
>> because data is still in delta files
>>
>> Now run major compaction:
>>
>> hive> ALTER TABLE default.foo COMPACT 'MAJOR';
>>
>> scala> sqlContext.table("default.foo").count // Gives 1
>>
>> hive> insert into foo values(20);
>>
>> scala> sqlContext.table("default.foo").count* // Gives 2 , no compaction
>> required.*
>>
>>
>>
>>
>> Regards
>> Sanjiv Singh
>> Mob :  +091 9990-447-339
>>
>
>

Re: Spark SQL is not returning records for HIVE transactional tables on HDP

Posted by Mich Talebzadeh <mi...@gmail.com>.

Hi,

Thanks for the input. I use Hive 2 and still have this issue.



   1. Hive version 2
   2. Hive on Spark engine 1.3.1
   3. Spark 1.5.2


I have added Hive user group  to this as well. So hopefully we may get some
resolution.

HTH

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 12 March 2016 at 19:25, Timur Shenkao <ts...@timshenkao.su> wrote:

> Hi,
>
> I have suffered from Hive Streaming , Transactions enough, so I can share
> my experience with you.
>
> 1) It's not a problem of Spark. It happens because of "peculiarities" /
> bugs of Hive Streaming.  Hive Streaming, transactions are very raw
> technologies. If you look at Hive JIRA, you'll see several critical bugs
> concerning Hive Streaming, transactions. Some of them are resolved in Hive
> 2+ only. But Cloudera & Hortonworks ship their distributions with outdated
> & buggy Hive.
> So use Hive 2+. Earlier versions of Hive didn't run compaction at all.
>
> 2) In Hive 1.1, I  issue the following lines
> ALTER TABLE default.foo COMPACT 'MAJOR';
> SHOW COMPACTIONS;
>
> My manual compaction was shown but it was never fulfilled.
>
> 3) If you use Hive Streaming, it's not recommended or even forbidden to
> insert rows into Hive Streaming tables manually. Only the process that
> writes to such table should insert incoming rows sequentially. Otherwise
> you'll get unpredictable behaviour.
>
> 4) Ordinary Hive tables are catalogs with text, ORC, etc. files.
> Hive Streaming / transactional tables are catalogs that have numerous
> subcatalogs with "delta" prefix. Moreover, there are files with
> "flush_length" suffix in some delta subfolders. "flush_length" files have 8
> bytes length. The presence of "flush_length" file in some subfolder means
> that Hive writes updates to this subfolder right now. When Hive fails or is
> restarted, it begins to write into new delta subfolder with new
> "flush_length" file. And old "flush_length" file (that was used before
> failure) still remains.
> One of the goal of compaction is to delete outdated "flush_length" files.
> Not every application / library can read such folder structure or knows
> details of Hive Streaming / transactions implementation. Most of the
> software solutions still expect ordinary Hive tables as input.
> When they encounter subcatalogs or special files "flush_length" file,
> applications / libraries either "see nothing" (return 0 or empty result
> set) or stumble over "flush_length" files (return unexplainable errors).
>
> For instance, Facebook Presto couldn't read subfolders by default unless
> you activate special parameters. But it stumbles over "flush_length" files
> as Presto expect legal ORC files not 8-byte-length text files in folders.
>
> So, I don't advise you to use Hive Streaming, transactions right now in
> real production systems (24 / 7 /365) with hundreds millions of events a
> day.
>
> On Sat, Mar 12, 2016 at 11:24 AM, @Sanjiv Singh <sa...@gmail.com>
> wrote:
>
>> Hi All,
>>
>> I am facing this issue on HDP setup on which COMPACTION is required only
>> once for transactional tables to fetch records with Spark SQL.
>> On the other hand, Apache setup doesn't required compaction even once.
>>
>> May be something got triggered on meta-store after compaction, Spark SQL
>> start recognizing delta files.
>>
>> Let know me if needed other details to get root cause.
>>
>> Try this,
>>
>> *See complete scenario :*
>>
>> hive> create table default.foo(id int) clustered by (id) into 2 buckets
>> STORED AS ORC TBLPROPERTIES ('transactional'='true');
>> hive> insert into default.foo values(10);
>>
>> scala> sqlContext.table("default.foo").count // Gives 0, which is wrong
>> because data is still in delta files
>>
>> Now run major compaction:
>>
>> hive> ALTER TABLE default.foo COMPACT 'MAJOR';
>>
>> scala> sqlContext.table("default.foo").count // Gives 1
>>
>> hive> insert into foo values(20);
>>
>> scala> sqlContext.table("default.foo").count* // Gives 2 , no compaction
>> required.*
>>
>>
>>
>>
>> Regards
>> Sanjiv Singh
>> Mob :  +091 9990-447-339
>>
>
>

Re: Spark SQL is not returning records for HIVE transactional tables on HDP

Posted by Timur Shenkao <ts...@timshenkao.su>.

Hi,

I have suffered from Hive Streaming , Transactions enough, so I can share
my experience with you.

1) It's not a problem of Spark. It happens because of "peculiarities" /
bugs of Hive Streaming.  Hive Streaming, transactions are very raw
technologies. If you look at Hive JIRA, you'll see several critical bugs
concerning Hive Streaming, transactions. Some of them are resolved in Hive
2+ only. But Cloudera & Hortonworks ship their distributions with outdated
& buggy Hive.
So use Hive 2+. Earlier versions of Hive didn't run compaction at all.

2) In Hive 1.1, I  issue the following lines
ALTER TABLE default.foo COMPACT 'MAJOR';
SHOW COMPACTIONS;

My manual compaction was shown but it was never fulfilled.

3) If you use Hive Streaming, it's not recommended or even forbidden to
insert rows into Hive Streaming tables manually. Only the process that
writes to such table should insert incoming rows sequentially. Otherwise
you'll get unpredictable behaviour.

4) Ordinary Hive tables are catalogs with text, ORC, etc. files.
Hive Streaming / transactional tables are catalogs that have numerous
subcatalogs with "delta" prefix. Moreover, there are files with
"flush_length" suffix in some delta subfolders. "flush_length" files have 8
bytes length. The presence of "flush_length" file in some subfolder means
that Hive writes updates to this subfolder right now. When Hive fails or is
restarted, it begins to write into new delta subfolder with new
"flush_length" file. And old "flush_length" file (that was used before
failure) still remains.
One of the goal of compaction is to delete outdated "flush_length" files.
Not every application / library can read such folder structure or knows
details of Hive Streaming / transactions implementation. Most of the
software solutions still expect ordinary Hive tables as input.
When they encounter subcatalogs or special files "flush_length" file,
applications / libraries either "see nothing" (return 0 or empty result
set) or stumble over "flush_length" files (return unexplainable errors).

For instance, Facebook Presto couldn't read subfolders by default unless
you activate special parameters. But it stumbles over "flush_length" files
as Presto expect legal ORC files not 8-byte-length text files in folders.

So, I don't advise you to use Hive Streaming, transactions right now in
real production systems (24 / 7 /365) with hundreds millions of events a
day.

On Sat, Mar 12, 2016 at 11:24 AM, @Sanjiv Singh <sa...@gmail.com>
wrote:

> Hi All,
>
> I am facing this issue on HDP setup on which COMPACTION is required only
> once for transactional tables to fetch records with Spark SQL.
> On the other hand, Apache setup doesn't required compaction even once.
>
> May be something got triggered on meta-store after compaction, Spark SQL
> start recognizing delta files.
>
> Let know me if needed other details to get root cause.
>
> Try this,
>
> *See complete scenario :*
>
> hive> create table default.foo(id int) clustered by (id) into 2 buckets
> STORED AS ORC TBLPROPERTIES ('transactional'='true');
> hive> insert into default.foo values(10);
>
> scala> sqlContext.table("default.foo").count // Gives 0, which is wrong
> because data is still in delta files
>
> Now run major compaction:
>
> hive> ALTER TABLE default.foo COMPACT 'MAJOR';
>
> scala> sqlContext.table("default.foo").count // Gives 1
>
> hive> insert into foo values(20);
>
> scala> sqlContext.table("default.foo").count* // Gives 2 , no compaction
> required.*
>
>
>
>
> Regards
> Sanjiv Singh
> Mob :  +091 9990-447-339
>

Re: Spark SQL is not returning records for HIVE transactional tables on HDP

Posted by Mich Talebzadeh <mi...@gmail.com>.

This is an interesting one as it appears that a hive transactional table

   1. Hive version 2
   2. Hive on Spark engine 1.3.1
   3. Spark 1.5.2

hive> create table default.foo(id int) clustered by (id) into 2 buckets
STORED AS ORC TBLPROPERTIES ('transactional'='true');
hive> insert into default.foo values(10);

hive> select * from foo;
OK
10
Time taken: 0.067 seconds, Fetched: 1 row(s)
hive> select * from foo;
10

At this stage if you do a simple select from spark from foo, you will get
an error which sounds like a big

spark-sql> select * from foo;
16/03/12 17:08:21 ERROR SparkSQLDriver: Failed in [select * from foo]
java.lang.RuntimeException: serious problem

 No locks are held in Hive on that table. Let us go back and do a
compaction in Hive

hive> alter table foo compact 'major';
Compaction enqueued.

These messages appear in Hive log. The job is a Map-reduce job

2016-03-12T17:12:29,776 INFO  [rhes564-31]: mapreduce.Job
(Job.java:monitorAndPrintJob(1345)) - Running job: job_1457790020440_0006
2016-03-12T17:12:31,915 INFO
[org.apache.hadoop.hive.ql.txn.compactor.HouseKeeperServiceBase$1-0]:
txn.AcidHouseKeeperService (AcidHouseKeeperService.java:run(67)) - timeout
reaper ran for 0seconds.  isAliveCounter=-2147483542
2016-03-12T17:13:51,918 INFO
[org.apache.hadoop.hive.ql.txn.compactor.HouseKeeperServiceBase$1-0]:
txn.AcidCompactionHistoryService
(AcidCompactionHistoryService.java:run(76)) - History reaper reaper ran for
0seconds.  isAliveCounter=-2147483488

And it goes through every single table to compact it including temp tables

2016-03-12T17:15:52,440 INFO  [Thread-9]: compactor.Initiator
(Initiator.java:run(89)) - Checking to see if we should compact default.foo
2016-03-12T17:15:52,449 INFO  [Thread-9]: compactor.Initiator
(Initiator.java:run(89)) - Checking to see if we should compact
oraclehadoop.sales3
2016-03-12T17:15:52,468 INFO  [Thread-9]: compactor.Initiator
(Initiator.java:run(89)) - Checking to see if we should compact
oraclehadoop.smallsales
2016-03-12T17:15:52,480 INFO  [Thread-9]: compactor.Initiator
(Initiator.java:run(89)) - Checking to see if we should compact test.stg_t2
2016-03-12T17:15:52,491 INFO  [Thread-9]: compactor.Initiator
(Initiator.java:run(89)) - Checking to see if we should compact
default.values__tmp__table__3
2016-03-12T17:15:52,492 INFO  [Thread-9]: compactor.Initiator
(Initiator.java:run(94)) - Can't find table default.values__tmp__table__3,
assuming it's a temp table or has been dropped and moving on.
2016-03-12T17:15:52,492 INFO  [Thread-9]: compactor.Initiator
(Initiator.java:run(89)) - Checking to see if we should compact
default.values__tmp__table__4
2016-03-12T17:15:52,492 INFO  [Thread-9]: compactor.Initiator
(Initiator.java:run(94)) - Can't find table default.values__tmp__table__4,
assuming it's a temp table or has been dropped and moving on.
2016-03-12T17:15:52,493 INFO  [Thread-9]: compactor.Initiator
(Initiator.java:run(89)) - Checking to see if we should compact
default.values__tmp__table__1
2016-03-12T17:15:52,493 INFO  [Thread-9]: compactor.Initiator
(Initiator.java:run(94)) - Can't find table default.values__tmp__table__1,
assuming it's a temp table or has been dropped and moving on.
2016-03-12T17:15:52,493 INFO  [Thread-9]: compactor.Initiator
(Initiator.java:run(89)) - Checking to see if we should compact test.t2
2016-03-12T17:15:52,504 INFO  [Thread-9]: compactor.Initiator
(Initiator.java:run(89)) - Checking to see if we should compact
default.values__tmp__table__2
2016-03-12T17:15:52,505 INFO  [Thread-9]: compactor.Initiator
(Initiator.java:run(94)) - Can't find table default.values__tmp__table__2,
assuming it's a temp table or has been dropped and moving on.

OK once the compaction (which Hive does it in background) is complete then
one can query the table from Spark

spark-sql> select * from foo;
10
Time taken: 4.509 seconds, Fetched 1 row(s)

I notice that if you insert a new row into foo (from Hive), you still get
the same error in Spark

scala> HiveContext.sql("select * from foo").collect.foreach(println)
java.lang.RuntimeException: serious problem

This looks like a bug as irt seems it only works after compaction is done
interactively or after Hive does it itself!

HTH

Dr Mich Talebzadeh

LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*

http://talebzadehmich.wordpress.com

On 12 March 2016 at 08:24, @Sanjiv Singh <sa...@gmail.com> wrote:

> Hi All,
>
> I am facing this issue on HDP setup on which COMPACTION is required only
> once for transactional tables to fetch records with Spark SQL.
> On the other hand, Apache setup doesn't required compaction even once.
>
> May be something got triggered on meta-store after compaction, Spark SQL
> start recognizing delta files.
>
> Let know me if needed other details to get root cause.
>
> Try this,
>
> *See complete scenario :*
>
> hive> create table default.foo(id int) clustered by (id) into 2 buckets
> STORED AS ORC TBLPROPERTIES ('transactional'='true');
> hive> insert into default.foo values(10);
>
> scala> sqlContext.table("default.foo").count // Gives 0, which is wrong
> because data is still in delta files
>
> Now run major compaction:
>
> hive> ALTER TABLE default.foo COMPACT 'MAJOR';
>
> scala> sqlContext.table("default.foo").count // Gives 1
>
> hive> insert into foo values(20);
>
> scala> sqlContext.table("default.foo").count* // Gives 2 , no compaction
> required.*
>
>
>
>
> Regards
> Sanjiv Singh
> Mob :  +091 9990-447-339
>

Re: Spark SQL is not returning records for HIVE transactional tables on HDP

Posted by Mich Talebzadeh <mi...@gmail.com>.

Thanks for update Alan

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 14 March 2016 at 17:49, Alan Gates <al...@gmail.com> wrote:

>
> > On Mar 14, 2016, at 10:31, Mich Talebzadeh <mi...@gmail.com>
> wrote:
> >
> > That is an interesting point Alan.
> >
> > Does this imply that Hive on Spark (Hive 2 encourages Spark or TEZ)  is
> going to have an issue with transactional tables?
> I was partially wrong.  In HIVE-12366 Wei refactored this so that the
> heartbeats are sent from the client rather than from the execution engine.
> This JIRA has been committed to master and branch-1, but after 1.2.  So
> Hive on Spark should work fine with transactional tables in Hive 2.x.  In
> 1.2 and earlier it will not.
>
> Alan.
>
>
>

Re: Spark SQL is not returning records for HIVE transactional tables on HDP

Posted by Alan Gates <al...@gmail.com>.

> On Mar 14, 2016, at 10:31, Mich Talebzadeh <mi...@gmail.com> wrote:
> 
> That is an interesting point Alan.
> 
> Does this imply that Hive on Spark (Hive 2 encourages Spark or TEZ)  is going to have an issue with transactional tables?
I was partially wrong.  In HIVE-12366 Wei refactored this so that the heartbeats are sent from the client rather than from the execution engine.  This JIRA has been committed to master and branch-1, but after 1.2.  So Hive on Spark should work fine with transactional tables in Hive 2.x.  In 1.2 and earlier it will not.

Alan.

Re: Spark SQL is not returning records for HIVE transactional tables on HDP

Posted by Mich Talebzadeh <mi...@gmail.com>.

That is an interesting point Alan.

Does this imply that Hive on Spark (Hive 2 encourages Spark or TEZ)  is
going to have an issue with transactional tables?

Besides this begs the question that we still run Hive on Spark 1.3.1
engine. We don't seem to make Hive work on Spark engine with a newer
version of Spark.

Cheers,

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 14 March 2016 at 16:48, Alan Gates <al...@gmail.com> wrote:

> I don’t know why you’re seeing Hive on Spark sometimes work with
> transactional tables and sometimes not.  But note that in general it
> doesn’t work.  The Spark runtime in Hive does not send heartbeats to the
> transaction/lock manager so it will timeout any job that takes longer than
> the heartbeat interval (5 min by default).
>
> Alan.
>
> > On Mar 12, 2016, at 00:24, @Sanjiv Singh <sa...@gmail.com> wrote:
> >
> > Hi All,
> >
> > I am facing this issue on HDP setup on which COMPACTION is required only
> once for transactional tables to fetch records with Spark SQL.
> > On the other hand, Apache setup doesn't required compaction even once.
> >
> > May be something got triggered on meta-store after compaction, Spark SQL
> start recognizing delta files.
> >
> > Let know me if needed other details to get root cause.
> >
> > Try this,
> >
> > See complete scenario :
> >
> > hive> create table default.foo(id int) clustered by (id) into 2 buckets
> STORED AS ORC TBLPROPERTIES ('transactional'='true');
> > hive> insert into default.foo values(10);
> >
> > scala> sqlContext.table("default.foo").count // Gives 0, which is wrong
> because data is still in delta files
> >
> > Now run major compaction:
> >
> > hive> ALTER TABLE default.foo COMPACT 'MAJOR';
> >
> > scala> sqlContext.table("default.foo").count // Gives 1
> >
> > hive> insert into foo values(20);
> >
> > scala> sqlContext.table("default.foo").count // Gives 2 , no compaction
> required.
> >
> >
> >
> >
> > Regards
> > Sanjiv Singh
> > Mob :  +091 9990-447-339
>
>

Re: Spark SQL is not returning records for HIVE transactional tables on HDP

Posted by Alan Gates <al...@gmail.com>.

I don’t know why you’re seeing Hive on Spark sometimes work with transactional tables and sometimes not.  But note that in general it doesn’t work.  The Spark runtime in Hive does not send heartbeats to the transaction/lock manager so it will timeout any job that takes longer than the heartbeat interval (5 min by default).  

Alan.

> On Mar 12, 2016, at 00:24, @Sanjiv Singh <sa...@gmail.com> wrote:
> 
> Hi All,
> 
> I am facing this issue on HDP setup on which COMPACTION is required only once for transactional tables to fetch records with Spark SQL.
> On the other hand, Apache setup doesn't required compaction even once.
> 
> May be something got triggered on meta-store after compaction, Spark SQL start recognizing delta files.
>   
> Let know me if needed other details to get root cause.
> 
> Try this,
> 
> See complete scenario :
> 
> hive> create table default.foo(id int) clustered by (id) into 2 buckets STORED AS ORC TBLPROPERTIES ('transactional'='true');
> hive> insert into default.foo values(10);
> 
> scala> sqlContext.table("default.foo").count // Gives 0, which is wrong because data is still in delta files
> 
> Now run major compaction:
> 
> hive> ALTER TABLE default.foo COMPACT 'MAJOR';
> 
> scala> sqlContext.table("default.foo").count // Gives 1
> 
> hive> insert into foo values(20);
> 
> scala> sqlContext.table("default.foo").count // Gives 2 , no compaction required.
> 
> 
> 
> 
> Regards
> Sanjiv Singh
> Mob :  +091 9990-447-339