You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by "Grant Overby (groverby)" <gr...@cisco.com> on 2015/04/14 20:46:27 UTC

External Table with unclosed orc files.

What will Hive do if querying an external table containing orc files that are still being written to?

If the process writing the orc files exits without calling .close()?


Sorry for taking the cheap way out and asking instead of testing. I couldn’t find anything on this via google. I won’t be able to test these scenarios till tomorrow and would like to have some idea of what to expect this afternoon.

Re: External Table with unclosed orc files.

Posted by Alan Gates <al...@gmail.com>.
It will fail.  Orc writes info in the footers that are required to 
properly read the file.  If close hasn't been called, then that footer 
hasn't been written yet.

Alan.

> Grant Overby (groverby) <ma...@cisco.com>
> April 14, 2015 at 20:46
> What will Hive do if querying an external table containing orc files 
> that are still being written to?
>
> If the process writing the orc files exits without calling .close()?
>
>
> Sorry for taking the cheap way out and asking instead of testing. I 
> couldn’t find anything on this via google. I won’t be able to test 
> these scenarios till tomorrow and would like to have some idea of what 
> to expect this afternoon.

Re: External Table with unclosed orc files.

Posted by "Grant Overby (groverby)" <gr...@cisco.com>.
The remainder of my ranting paragraph is intended as an expansion on that
comment. Sorry, I wasn’t clear.


Grant Overby
Software Engineer
Cisco.com <http://www.cisco.com/>
groverby@cisco.com
Mobile: 865 724 4910




 Think before you print.This email may contain confidential and privileged
material for the sole use of the intended recipient. Any review, use,
distribution or disclosure by others is strictly prohibited. If you are
not the intended recipient (or authorized to receive for the recipient),
please contact the sender by reply email and delete all copies of this
message.
Please click here 
<http://www.cisco.com/web/about/doing_business/legal/cri/index.html> for
Company Registration Information.







On 4/14/15, 5:09 PM, "Mich Talebzadeh" <mi...@peridale.co.uk> wrote:

>Hi Grant,
>
>Thanks for insight.
>
>You mentioned and I quote
>
>" Acid tables have been a real pain for us. We don’t believe they are
>production ready.. "
>
>Can you please elaborate on this/
>
>Thanks
>
>Mich Talebzadeh
>
>http://talebzadehmich.wordpress.com
>
>Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE
>15",
>ISBN 978-0-9563693-0-7.
>co-author "Sybase Transact SQL Guidelines Best Practices", ISBN
>978-0-9759693-0-4
>Publications due shortly:
>Creating in-memory Data Grid for Trading Systems with Oracle TimesTen and
>Coherence Cache
>Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume
>one out shortly
>
>NOTE: The information in this email is proprietary and confidential. This
>message is for the designated recipient only, if you are not the intended
>recipient, you should destroy it immediately. Any information in this
>message shall not be understood as given or endorsed by Peridale Ltd, its
>subsidiaries or their employees, unless expressly so stated. It is the
>responsibility of the recipient to ensure that this email is virus free,
>therefore neither Peridale Ltd, its subsidiaries nor their employees
>accept
>any responsibility.
>
>
>-----Original Message-----
>From: Grant Overby (groverby) [mailto:groverby@cisco.com]
>Sent: 14 April 2015 22:02
>To: Gopal Vijayaraghavan; user@hive.apache.org
>Subject: Re: External Table with unclosed orc files.
>
>Thanks for the link to the hive streaming bolt. We rolled our own bolt
>many
>moons ago to utilize hive streaming. We’ve tried it against 0.13 and
>0.14 . Acid tables have been a real pain for us. We don’t believe they are
>production ready. At least in our use cases, Tez crashes for assorted
>reasons or only assigns 1 mapper to the partition. Having delta files and
>no
>base files borks mapper assignments.  Files containing flush in their name
>are left scattered about, borking queries. Latency is higher with
>streaming
>than writing to an orc file in hdfs, forcing obscene quantities of buckets
>and orc files smaller than any reasonable orc stripe / hdfs block size.
>The
>compactor hangs seemingly at random for no reason we’ve been able to
>discern.
>
>
>
>An orc file without a footer is junk data (or, at least, the last stripe
>is
>junk data). I suppose my question should have been 'what will the hive
>query
>do when it encounters this? Skip the stripe / file? Error out the query?
>Something else?’
>
>
>
>
>Grant Overby
>Software Engineer
>Cisco.com <http://www.cisco.com/>
>groverby@cisco.com
>Mobile: 865 724 4910
>
>
>
>
> Think before you print.This email may contain confidential and privileged
>material for the sole use of the intended recipient. Any review, use,
>distribution or disclosure by others is strictly prohibited. If you are
>not
>the intended recipient (or authorized to receive for the recipient),
>please
>contact the sender by reply email and delete all copies of this message.
>Please click here
><http://www.cisco.com/web/about/doing_business/legal/cri/index.html> for
>Company Registration Information.
>
>
>
>
>
>
>
>On 4/14/15, 4:23 PM, "Gopal Vijayaraghavan" <go...@apache.org> wrote:
>
>>
>>> What will Hive do if querying an external table containing orc files
>>>that are still being written to?
>>
>>Doing that directly won¹t work at all. Because ORC files are only
>>readable
>>after the Footer is written out, which won¹t be for any open files.
>>
>>> I won¹t be able to test these scenarios till tomorrow and would like to
>>>have some idea of what to expect this afternoon.
>>
>>If I remember correctly, your previous question was about writing ORC
>>from
>>Storm.
>>
>>If you¹re on a recent version of Storm, I¹d advise you to look at
>>storm-hive/ 
>>
>>https://github.com/apache/storm/tree/master/external/storm-hive
>>
>>
>>Or alternatively, there¹s a ³hortonworks trucking demo² which does a
>>partition insert instead.
>>
>>Cheers,
>>Gopal
>>
>>
>
>


RE: External Table with unclosed orc files.

Posted by Mich Talebzadeh <mi...@peridale.co.uk>.
Hi Grant,

Thanks for insight.

You mentioned and I quote

" Acid tables have been a real pain for us. We don’t believe they are
production ready.. "

Can you please elaborate on this/

Thanks

Mich Talebzadeh

http://talebzadehmich.wordpress.com

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15",
ISBN 978-0-9563693-0-7. 
co-author "Sybase Transact SQL Guidelines Best Practices", ISBN
978-0-9759693-0-4
Publications due shortly:
Creating in-memory Data Grid for Trading Systems with Oracle TimesTen and
Coherence Cache
Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume
one out shortly

NOTE: The information in this email is proprietary and confidential. This
message is for the designated recipient only, if you are not the intended
recipient, you should destroy it immediately. Any information in this
message shall not be understood as given or endorsed by Peridale Ltd, its
subsidiaries or their employees, unless expressly so stated. It is the
responsibility of the recipient to ensure that this email is virus free,
therefore neither Peridale Ltd, its subsidiaries nor their employees accept
any responsibility.


-----Original Message-----
From: Grant Overby (groverby) [mailto:groverby@cisco.com] 
Sent: 14 April 2015 22:02
To: Gopal Vijayaraghavan; user@hive.apache.org
Subject: Re: External Table with unclosed orc files.

Thanks for the link to the hive streaming bolt. We rolled our own bolt many
moons ago to utilize hive streaming. We’ve tried it against 0.13 and
0.14 . Acid tables have been a real pain for us. We don’t believe they are
production ready. At least in our use cases, Tez crashes for assorted
reasons or only assigns 1 mapper to the partition. Having delta files and no
base files borks mapper assignments.  Files containing flush in their name
are left scattered about, borking queries. Latency is higher with streaming
than writing to an orc file in hdfs, forcing obscene quantities of buckets
and orc files smaller than any reasonable orc stripe / hdfs block size. The
compactor hangs seemingly at random for no reason we’ve been able to
discern.



An orc file without a footer is junk data (or, at least, the last stripe is
junk data). I suppose my question should have been 'what will the hive query
do when it encounters this? Skip the stripe / file? Error out the query?
Something else?’




Grant Overby
Software Engineer
Cisco.com <http://www.cisco.com/>
groverby@cisco.com
Mobile: 865 724 4910




 Think before you print.This email may contain confidential and privileged
material for the sole use of the intended recipient. Any review, use,
distribution or disclosure by others is strictly prohibited. If you are not
the intended recipient (or authorized to receive for the recipient), please
contact the sender by reply email and delete all copies of this message.
Please click here
<http://www.cisco.com/web/about/doing_business/legal/cri/index.html> for
Company Registration Information.







On 4/14/15, 4:23 PM, "Gopal Vijayaraghavan" <go...@apache.org> wrote:

>
>> What will Hive do if querying an external table containing orc files
>>that are still being written to?
>
>Doing that directly won¹t work at all. Because ORC files are only readable
>after the Footer is written out, which won¹t be for any open files.
>
>> I won¹t be able to test these scenarios till tomorrow and would like to
>>have some idea of what to expect this afternoon.
>
>If I remember correctly, your previous question was about writing ORC from
>Storm.
>
>If you¹re on a recent version of Storm, I¹d advise you to look at
>storm-hive/ 
>
>https://github.com/apache/storm/tree/master/external/storm-hive
>
>
>Or alternatively, there¹s a ³hortonworks trucking demo² which does a
>partition insert instead.
>
>Cheers,
>Gopal
>
>



Re: External Table with unclosed orc files.

Posted by Alan Gates <ga...@apache.org>.
So it was in the map reduce job itself? Then the best information would
be the logs from the MR job, so we can see what it's doing (or perhaps
not doing).

Alan.

Grant Overby (groverby) wrote:
> It wasn’t reliably reproducible for us. If we killed the compaction
> job in yarn and manually triggered compaction for the same partition,
> it would succeed. We would see this about 1 time every 2 days / 200
> partitions. There weren’t any errors logged that we noticed. The job
> was simply sitting there making no progress.
>
> I’m not using acid tables currently, but I’ll likely give it another
> go. What information should I capture to help with this issue?
>
>
>
>
>
> From: Alan Gates <gates@apache.org <ma...@apache.org>>
> Reply-To: "user@hive.apache.org <ma...@hive.apache.org>"
> <user@hive.apache.org <ma...@hive.apache.org>>,
> "gates@apache.org <ma...@apache.org>" <gates@apache.org
> <ma...@apache.org>>
> Date: Wednesday, April 15, 2015 at 4:07 AM
> To: "user@hive.apache.org <ma...@hive.apache.org>"
> <user@hive.apache.org <ma...@hive.apache.org>>
> Subject: Re: External Table with unclosed orc files.
>
>
>
> Grant Overby (groverby) wrote:
>> Thanks for the link to the hive streaming bolt. We rolled our own bolt
>> many moons ago to utilize hive streaming. We’ve tried it against 0.13 and
>> 0.14 . Acid tables have been a real pain for us. We don’t believe they are
>> production ready. At least in our use cases, Tez crashes for assorted
>> reasons or only assigns 1 mapper to the partition. Having delta files and
>> no base files borks mapper assignments.  Files containing flush in their
>> name are left scattered about, borking queries. Latency is higher with
>> streaming than writing to an orc file in hdfs, forcing obscene quantities
>> of buckets and orc files smaller than any reasonable orc stripe / hdfs
>> block size. The compactor hangs seemingly at random for no reason we’ve
>> been able to discern.
> The issues with flush files borking queries has been resolved in Hive
> 1.0. I haven't seen any issues with the compactor hanging at random.
> Could you expand on what parts hung? If you have a reproducible case
> it would be great to file a JIRA so we can fix it.
>
> Alan.
>> An orc file without a footer is junk data (or, at least, the last stripe
>> is junk data). I suppose my question should have been 'what will the hive
>> query do when it encounters this? Skip the stripe / file? Error out the
>> query? Something else?’
>>
>>
>>
>>
>> Grant Overby
>> Software Engineer
>> Cisco.com <http://www.cisco.com/>groverby@cisco.com
>> Mobile: 865 724 4910
>>
>>
>>
>>
>>  Think before you print.This email may contain confidential and privileged
>> material for the sole use of the intended recipient. Any review, use,
>> distribution or disclosure by others is strictly prohibited. If you are
>> not the intended recipient (or authorized to receive for the recipient),
>> please contact the sender by reply email and delete all copies of this
>> message.
>> Please click here 
>> <http://www.cisco.com/web/about/doing_business/legal/cri/index.html> for
>> Company Registration Information.
>>
>>
>>
>>
>>
>>
>>
>> On 4/14/15, 4:23 PM, "Gopal Vijayaraghavan" <go...@apache.org> wrote:
>>
>>>> What will Hive do if querying an external table containing orc files
>>>> that are still being written to?
>>> Doing that directly won¹t work at all. Because ORC files are only readable
>>> after the Footer is written out, which won¹t be for any open files.
>>>
>>>> I won¹t be able to test these scenarios till tomorrow and would like to
>>>> have some idea of what to expect this afternoon.
>>> If I remember correctly, your previous question was about writing ORC from
>>> Storm.
>>>
>>> If you¹re on a recent version of Storm, I¹d advise you to look at
>>> storm-hive/ 
>>>
>>> https://github.com/apache/storm/tree/master/external/storm-hive
>>>
>>>
>>> Or alternatively, there¹s a ³hortonworks trucking demo² which does a
>>> partition insert instead.
>>>
>>> Cheers,
>>> Gopal
>>>
>>>

Re: External Table with unclosed orc files.

Posted by "Grant Overby (groverby)" <gr...@cisco.com>.
It wasn’t reliably reproducible for us. If we killed the compaction job in yarn and manually triggered compaction for the same partition, it would succeed. We would see this about 1 time every 2 days / 200 partitions. There weren’t any errors logged that we noticed. The job was simply sitting there making no progress.

I’m not using acid tables currently, but I’ll likely give it another go. What information should I capture to help with this issue?





From: Alan Gates <ga...@apache.org>>
Reply-To: "user@hive.apache.org<ma...@hive.apache.org>" <us...@hive.apache.org>>, "gates@apache.org<ma...@apache.org>" <ga...@apache.org>>
Date: Wednesday, April 15, 2015 at 4:07 AM
To: "user@hive.apache.org<ma...@hive.apache.org>" <us...@hive.apache.org>>
Subject: Re: External Table with unclosed orc files.



Grant Overby (groverby) wrote:

Thanks for the link to the hive streaming bolt. We rolled our own bolt
many moons ago to utilize hive streaming. We’ve tried it against 0.13 and
0.14 . Acid tables have been a real pain for us. We don’t believe they are
production ready. At least in our use cases, Tez crashes for assorted
reasons or only assigns 1 mapper to the partition. Having delta files and
no base files borks mapper assignments.  Files containing flush in their
name are left scattered about, borking queries. Latency is higher with
streaming than writing to an orc file in hdfs, forcing obscene quantities
of buckets and orc files smaller than any reasonable orc stripe / hdfs
block size. The compactor hangs seemingly at random for no reason we’ve
been able to discern.

The issues with flush files borking queries has been resolved in Hive 1.0.  I haven't seen any issues with the compactor hanging at random.  Could you expand on what parts hung?  If you have a reproducible case it would be great to file a JIRA so we can fix it.

Alan.


An orc file without a footer is junk data (or, at least, the last stripe
is junk data). I suppose my question should have been 'what will the hive
query do when it encounters this? Skip the stripe / file? Error out the
query? Something else?’




Grant Overby
Software Engineer
Cisco.com <ht...@cisco.com>
Mobile: 865 724 4910




 Think before you print.This email may contain confidential and privileged
material for the sole use of the intended recipient. Any review, use,
distribution or disclosure by others is strictly prohibited. If you are
not the intended recipient (or authorized to receive for the recipient),
please contact the sender by reply email and delete all copies of this
message.
Please click here
<http://www.cisco.com/web/about/doing_business/legal/cri/index.html><http://www.cisco.com/web/about/doing_business/legal/cri/index.html> for
Company Registration Information.







On 4/14/15, 4:23 PM, "Gopal Vijayaraghavan" <go...@apache.org> wrote:



What will Hive do if querying an external table containing orc files
that are still being written to?


Doing that directly won¹t work at all. Because ORC files are only readable
after the Footer is written out, which won¹t be for any open files.



I won¹t be able to test these scenarios till tomorrow and would like to
have some idea of what to expect this afternoon.


If I remember correctly, your previous question was about writing ORC from
Storm.

If you¹re on a recent version of Storm, I¹d advise you to look at
storm-hive/

https://github.com/apache/storm/tree/master/external/storm-hive


Or alternatively, there¹s a ³hortonworks trucking demo² which does a
partition insert instead.

Cheers,
Gopal




Re: External Table with unclosed orc files.

Posted by Alan Gates <ga...@apache.org>.

Grant Overby (groverby) wrote:
> Thanks for the link to the hive streaming bolt. We rolled our own bolt
> many moons ago to utilize hive streaming. We’ve tried it against 0.13 and
> 0.14 . Acid tables have been a real pain for us. We don’t believe they are
> production ready. At least in our use cases, Tez crashes for assorted
> reasons or only assigns 1 mapper to the partition. Having delta files and
> no base files borks mapper assignments.  Files containing flush in their
> name are left scattered about, borking queries. Latency is higher with
> streaming than writing to an orc file in hdfs, forcing obscene quantities
> of buckets and orc files smaller than any reasonable orc stripe / hdfs
> block size. The compactor hangs seemingly at random for no reason we’ve
> been able to discern.
The issues with flush files borking queries has been resolved in Hive
1.0. I haven't seen any issues with the compactor hanging at random.
Could you expand on what parts hung? If you have a reproducible case it
would be great to file a JIRA so we can fix it.

Alan.
>
>
>
> An orc file without a footer is junk data (or, at least, the last stripe
> is junk data). I suppose my question should have been 'what will the hive
> query do when it encounters this? Skip the stripe / file? Error out the
> query? Something else?’
>
>
>
>
> Grant Overby
> Software Engineer
> Cisco.com <http://www.cisco.com/>
> groverby@cisco.com
> Mobile: 865 724 4910
>
>
>
>
>  Think before you print.This email may contain confidential and privileged
> material for the sole use of the intended recipient. Any review, use,
> distribution or disclosure by others is strictly prohibited. If you are
> not the intended recipient (or authorized to receive for the recipient),
> please contact the sender by reply email and delete all copies of this
> message.
> Please click here 
> <http://www.cisco.com/web/about/doing_business/legal/cri/index.html> for
> Company Registration Information.
>
>
>
>
>
>
>
> On 4/14/15, 4:23 PM, "Gopal Vijayaraghavan" <go...@apache.org> wrote:
>
>>> What will Hive do if querying an external table containing orc files
>>> that are still being written to?
>> Doing that directly won¹t work at all. Because ORC files are only readable
>> after the Footer is written out, which won¹t be for any open files.
>>
>>> I won¹t be able to test these scenarios till tomorrow and would like to
>>> have some idea of what to expect this afternoon.
>> If I remember correctly, your previous question was about writing ORC from
>> Storm.
>>
>> If you¹re on a recent version of Storm, I¹d advise you to look at
>> storm-hive/ 
>>
>> https://github.com/apache/storm/tree/master/external/storm-hive
>>
>>
>> Or alternatively, there¹s a ³hortonworks trucking demo² which does a
>> partition insert instead.
>>
>> Cheers,
>> Gopal
>>
>>
>

Re: External Table with unclosed orc files.

Posted by Chad Dotzenrod <cd...@gmail.com>.
unsubscribe

On Tue, Apr 14, 2015 at 4:28 PM, Gopal Vijayaraghavan <go...@apache.org>
wrote:

>
> >0.14 . Acid tables have been a real pain for us. We don¹t believe they are
> >production ready. At least in our use cases, Tez crashes for assorted
> >reasons or only assigns 1 mapper to the partition. Having delta files and
> >no base files borks mapper assignments.
>
> Some of the chicken-egg problems for those were solved recently in
> HIVE-10114.
>
> Then TEZ-1993 is coming out in the next version of Tez, into which we¹re
> plugging in HIVE-7428 (no fix yet).
>
> Currently delta-only splits have 0 bytes as the ³file size², so it grouped
> together to make a 16Mb chunk (rather a huge single 0 sized split).
>
> Those patches are the effect of me shaving the yak from the ³1 mapper²
> issue.
>
> After which the writer has to follow up on HIVE-9933 to get the locality
> of files fixed.
>
> >name are left scattered about, borking queries. Latency is higher with
> >streaming than writing to an orc file in hdfs, forcing obscene quantities
> >of buckets and orc files smaller than any reasonable orc stripe / hdfs
> >block size. The compactor hangs seemingly at random for no reason we¹ve
> >been able to discern.
>
> I haven¹t seen these issues yet, but I am not dealing with a large volume
> insert rate, so haven¹t produced latency issues there.
>
> Since I work on Hive performance and I haven¹t seen too many bugs filed,
> so I haven¹t paid attention to the performance of ACID.
>
> Please file bugs when you find them, so that it appears on the radar for
> folks like me.
>
> I¹m poking about because I want a live stream into LLAP to work seamlessly
> & return sub-second query results when queried (pre-cache/stage & merge
> etc).
>
> >An orc file without a footer is junk data (or, at least, the last stripe
> >is junk data). I suppose my question should have been 'what will the hive
> >query do when it encounters this? Skip the stripe / file? Error out the
> >query? Something else?¹
>
> It should throw an exception, because that¹s a corrupt ORC file.
>
> The trucking demo uses Storm without ACID - this is likely to get better
> once we use Apache Falcon to move the data around.
>
> Cheers,
> Gopal
>
>
>


-- 
Chad J. Dotzenrod
(630)669-6095
cdotzenrod@gmail.com

Re: External Table with unclosed orc files.

Posted by "Grant Overby (groverby)" <gr...@cisco.com>.
IIRC the HW Trucking Demo creates a temporary table from csv files of the
new data then issues a select … insert into an orc table.

For the love of google, I can’t find this demo atm, and I’m out of time.


If I recall correctly, this strikes me as suboptimal compared to writing
orc files directly. Data must be written to disk in a huge format and then
must be copied.


I’ll dig deep here as soon as I get a chance.



On 4/14/15, 6:09 PM, "Grant Overby (groverby)" <gr...@cisco.com> wrote:

>Submitting patches or test cases is tricky business for a Cisco employee.
>I’ll put in the legal admin effort to get approval to do this. :/ The
>majority of the issues I mentioned /should/ find their way to apache via
>hortonworks.
>
>
>Additional responses are inline.
>
>
>
>
>
>
>
>
>
>On 4/14/15, 5:28 PM, "Gopal Vijayaraghavan" <go...@apache.org> wrote:
>
>>
>>>0.14 . Acid tables have been a real pain for us. We don¹t believe they
>>>are
>>>production ready. At least in our use cases, Tez crashes for assorted
>>>reasons or only assigns 1 mapper to the partition. Having delta files
>>>and
>>>no base files borks mapper assignments.
>>
>>Some of the chicken-egg problems for those were solved recently in
>>HIVE-10114.
>>
>>Then TEZ-1993 is coming out in the next version of Tez, into which we¹re
>>plugging in HIVE-7428 (no fix yet).
>>
>>Currently delta-only splits have 0 bytes as the ³file size², so it
>>grouped
>>together to make a 16Mb chunk (rather a huge single 0 sized split).
>>
>>Those patches are the effect of me shaving the yak from the ³1 mapper²
>>issue.
>>
>>After which the writer has to follow up on HIVE-9933 to get the locality
>>of files fixed.
>
>I’ll look into this. If the 1 mapper issue is solved, that would be a huge
>win for streaming for us.
>
>
>>
>>>name are left scattered about, borking queries. Latency is higher with
>>>streaming than writing to an orc file in hdfs, forcing obscene
>>>quantities
>>>of buckets and orc files smaller than any reasonable orc stripe / hdfs
>>>block size. The compactor hangs seemingly at random for no reason we¹ve
>>>been able to discern.
>>
>>I haven¹t seen these issues yet, but I am not dealing with a large volume
>>insert rate, so haven¹t produced latency issues there.
>>
>>Since I work on Hive performance and I haven¹t seen too many bugs filed,
>>so I haven¹t paid attention to the performance of ACID.
>>
>>Please file bugs when you find them, so that it appears on the radar for
>>folks like me.
>>
>>I¹m poking about because I want a live stream into LLAP to work
>>seamlessly
>>& return sub-second query results when queried (pre-cache/stage & merge
>>etc).
>
>These files aren’t orc, but hive expects them to be, leading to errors.
>They are made by using the hive streaming api.
>root@twig13:~# hdfs dfs -ls -R
>/apps/hive/warehouse/events.db/connection_events4/ | grep flush | head -n
>1
>-rw-r--r-- 3 storm hadoop 200 2015-04-09 17:12
>/apps/hive/warehouse/events.db/connection_events4/dt=1428613200/delta_1171
>4
>703_11714802/bucket_00007_flush_length
>root@twig13:~# hdfs dfs -ls -R
>/apps/hive/warehouse/events.db/connection_events4/ | grep flush | wc -l
>283
>
>This may be addressed by 8966 which is in the 1.0.0 release. kill -9 to
>the processing writing to hive is a near guaranteed way to leave these
>orphaned flush files, but we have seen them on several occasions when
>there is no indication that .close() was skipped.
>
>Our insert rate is about 100k/s for a 4 box cluster. Storm, Kafka, Hdfs,
>Hive, etc are ‘pancaked’ on this cluster. To keep up with this insert rate
>we need somewhere between 64 and 128 buckets for streaming to support an
>equal number of threads. We can keep up this same pace when writing orc
>files directly to hdfs with only 8 threads and thus 8 orc files. The orc
>files from streaming are on the order of 5mb a piece (15min insert-time
>base partitions). Even if orc stripes this small isn’t a problem, it’s
>still going to waste a lot of disk space due to hdfs block size.
>
>
>>
>>>An orc file without a footer is junk data (or, at least, the last stripe
>>>is junk data). I suppose my question should have been 'what will the
>>>hive
>>>query do when it encounters this? Skip the stripe / file? Error out the
>>>query? Something else?¹
>>
>>It should throw an exception, because that¹s a corrupt ORC file.
>>
>>The trucking demo uses Storm without ACID - this is likely to get better
>>once we use Apache Falcon to move the data around.
>>
>>Cheers,
>>Gopal
>>
>>
>
>I suppose the best thing to do then is to write the orc file outside the
>of the partition directory then issue an mv when the file is closed?
>
>>
>


Re: External Table with unclosed orc files.

Posted by "Grant Overby (groverby)" <gr...@cisco.com>.
Submitting patches or test cases is tricky business for a Cisco employee.
I’ll put in the legal admin effort to get approval to do this. :/ The
majority of the issues I mentioned /should/ find their way to apache via
hortonworks.


Additional responses are inline.









On 4/14/15, 5:28 PM, "Gopal Vijayaraghavan" <go...@apache.org> wrote:

>
>>0.14 . Acid tables have been a real pain for us. We don¹t believe they
>>are
>>production ready. At least in our use cases, Tez crashes for assorted
>>reasons or only assigns 1 mapper to the partition. Having delta files and
>>no base files borks mapper assignments.
>
>Some of the chicken-egg problems for those were solved recently in
>HIVE-10114.
>
>Then TEZ-1993 is coming out in the next version of Tez, into which we¹re
>plugging in HIVE-7428 (no fix yet).
>
>Currently delta-only splits have 0 bytes as the ³file size², so it grouped
>together to make a 16Mb chunk (rather a huge single 0 sized split).
>
>Those patches are the effect of me shaving the yak from the ³1 mapper²
>issue.
>
>After which the writer has to follow up on HIVE-9933 to get the locality
>of files fixed.

I’ll look into this. If the 1 mapper issue is solved, that would be a huge
win for streaming for us.


>
>>name are left scattered about, borking queries. Latency is higher with
>>streaming than writing to an orc file in hdfs, forcing obscene quantities
>>of buckets and orc files smaller than any reasonable orc stripe / hdfs
>>block size. The compactor hangs seemingly at random for no reason we¹ve
>>been able to discern.
>
>I haven¹t seen these issues yet, but I am not dealing with a large volume
>insert rate, so haven¹t produced latency issues there.
>
>Since I work on Hive performance and I haven¹t seen too many bugs filed,
>so I haven¹t paid attention to the performance of ACID.
>
>Please file bugs when you find them, so that it appears on the radar for
>folks like me.
>
>I¹m poking about because I want a live stream into LLAP to work seamlessly
>& return sub-second query results when queried (pre-cache/stage & merge
>etc).

These files aren’t orc, but hive expects them to be, leading to errors.
They are made by using the hive streaming api.
root@twig13:~# hdfs dfs -ls -R
/apps/hive/warehouse/events.db/connection_events4/ | grep flush | head -n 1
-rw-r--r-- 3 storm hadoop 200 2015-04-09 17:12
/apps/hive/warehouse/events.db/connection_events4/dt=1428613200/delta_11714
703_11714802/bucket_00007_flush_length
root@twig13:~# hdfs dfs -ls -R
/apps/hive/warehouse/events.db/connection_events4/ | grep flush | wc -l
283

This may be addressed by 8966 which is in the 1.0.0 release. kill -9 to
the processing writing to hive is a near guaranteed way to leave these
orphaned flush files, but we have seen them on several occasions when
there is no indication that .close() was skipped.

Our insert rate is about 100k/s for a 4 box cluster. Storm, Kafka, Hdfs,
Hive, etc are ‘pancaked’ on this cluster. To keep up with this insert rate
we need somewhere between 64 and 128 buckets for streaming to support an
equal number of threads. We can keep up this same pace when writing orc
files directly to hdfs with only 8 threads and thus 8 orc files. The orc
files from streaming are on the order of 5mb a piece (15min insert-time
base partitions). Even if orc stripes this small isn’t a problem, it’s
still going to waste a lot of disk space due to hdfs block size.


>
>>An orc file without a footer is junk data (or, at least, the last stripe
>>is junk data). I suppose my question should have been 'what will the hive
>>query do when it encounters this? Skip the stripe / file? Error out the
>>query? Something else?¹
>
>It should throw an exception, because that¹s a corrupt ORC file.
>
>The trucking demo uses Storm without ACID - this is likely to get better
>once we use Apache Falcon to move the data around.
>
>Cheers,
>Gopal
>
>

I suppose the best thing to do then is to write the orc file outside the
of the partition directory then issue an mv when the file is closed?

>


Re: External Table with unclosed orc files.

Posted by Gopal Vijayaraghavan <go...@apache.org>.
>0.14 . Acid tables have been a real pain for us. We don¹t believe they are
>production ready. At least in our use cases, Tez crashes for assorted
>reasons or only assigns 1 mapper to the partition. Having delta files and
>no base files borks mapper assignments.

Some of the chicken-egg problems for those were solved recently in
HIVE-10114.

Then TEZ-1993 is coming out in the next version of Tez, into which we¹re
plugging in HIVE-7428 (no fix yet).

Currently delta-only splits have 0 bytes as the ³file size², so it grouped
together to make a 16Mb chunk (rather a huge single 0 sized split).

Those patches are the effect of me shaving the yak from the ³1 mapper²
issue.

After which the writer has to follow up on HIVE-9933 to get the locality
of files fixed.

>name are left scattered about, borking queries. Latency is higher with
>streaming than writing to an orc file in hdfs, forcing obscene quantities
>of buckets and orc files smaller than any reasonable orc stripe / hdfs
>block size. The compactor hangs seemingly at random for no reason we¹ve
>been able to discern.

I haven¹t seen these issues yet, but I am not dealing with a large volume
insert rate, so haven¹t produced latency issues there.

Since I work on Hive performance and I haven¹t seen too many bugs filed,
so I haven¹t paid attention to the performance of ACID.

Please file bugs when you find them, so that it appears on the radar for
folks like me.

I¹m poking about because I want a live stream into LLAP to work seamlessly
& return sub-second query results when queried (pre-cache/stage & merge
etc).

>An orc file without a footer is junk data (or, at least, the last stripe
>is junk data). I suppose my question should have been 'what will the hive
>query do when it encounters this? Skip the stripe / file? Error out the
>query? Something else?¹

It should throw an exception, because that¹s a corrupt ORC file.

The trucking demo uses Storm without ACID - this is likely to get better
once we use Apache Falcon to move the data around.

Cheers,
Gopal



Re: External Table with unclosed orc files.

Posted by "Grant Overby (groverby)" <gr...@cisco.com>.
Thanks for the link to the hive streaming bolt. We rolled our own bolt
many moons ago to utilize hive streaming. We’ve tried it against 0.13 and
0.14 . Acid tables have been a real pain for us. We don’t believe they are
production ready. At least in our use cases, Tez crashes for assorted
reasons or only assigns 1 mapper to the partition. Having delta files and
no base files borks mapper assignments.  Files containing flush in their
name are left scattered about, borking queries. Latency is higher with
streaming than writing to an orc file in hdfs, forcing obscene quantities
of buckets and orc files smaller than any reasonable orc stripe / hdfs
block size. The compactor hangs seemingly at random for no reason we’ve
been able to discern.



An orc file without a footer is junk data (or, at least, the last stripe
is junk data). I suppose my question should have been 'what will the hive
query do when it encounters this? Skip the stripe / file? Error out the
query? Something else?’




Grant Overby
Software Engineer
Cisco.com <http://www.cisco.com/>
groverby@cisco.com
Mobile: 865 724 4910




 Think before you print.This email may contain confidential and privileged
material for the sole use of the intended recipient. Any review, use,
distribution or disclosure by others is strictly prohibited. If you are
not the intended recipient (or authorized to receive for the recipient),
please contact the sender by reply email and delete all copies of this
message.
Please click here 
<http://www.cisco.com/web/about/doing_business/legal/cri/index.html> for
Company Registration Information.







On 4/14/15, 4:23 PM, "Gopal Vijayaraghavan" <go...@apache.org> wrote:

>
>> What will Hive do if querying an external table containing orc files
>>that are still being written to?
>
>Doing that directly won¹t work at all. Because ORC files are only readable
>after the Footer is written out, which won¹t be for any open files.
>
>> I won¹t be able to test these scenarios till tomorrow and would like to
>>have some idea of what to expect this afternoon.
>
>If I remember correctly, your previous question was about writing ORC from
>Storm.
>
>If you¹re on a recent version of Storm, I¹d advise you to look at
>storm-hive/ 
>
>https://github.com/apache/storm/tree/master/external/storm-hive
>
>
>Or alternatively, there¹s a ³hortonworks trucking demo² which does a
>partition insert instead.
>
>Cheers,
>Gopal
>
>


Re: External Table with unclosed orc files.

Posted by Gopal Vijayaraghavan <go...@apache.org>.
> What will Hive do if querying an external table containing orc files
>that are still being written to?

Doing that directly won¹t work at all. Because ORC files are only readable
after the Footer is written out, which won¹t be for any open files.

> I won¹t be able to test these scenarios till tomorrow and would like to
>have some idea of what to expect this afternoon.

If I remember correctly, your previous question was about writing ORC from
Storm.

If you¹re on a recent version of Storm, I¹d advise you to look at
storm-hive/ 

https://github.com/apache/storm/tree/master/external/storm-hive


Or alternatively, there¹s a ³hortonworks trucking demo² which does a
partition insert instead.

Cheers,
Gopal



RE: External Table with unclosed orc files.

Posted by Mich Talebzadeh <mi...@peridale.co.uk>.
Hi,

 

I believe in the same way as UNIX file/partitions behave.

 

If the file is opened by the first process writing to it, a swap file will
be created. If the second process is querying it only, then it will see the
data at the time of last save by the first process but not the changes after
last save

 

It will behave much like versioning in an RDBMS.

 

HTH

 

 

Mich Talebzadeh

 

http://talebzadehmich.wordpress.com

 

Author of the books "A Practitioner's Guide to Upgrading to Sybase ASE 15",
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN
978-0-9759693-0-4

Publications due shortly:

Creating in-memory Data Grid for Trading Systems with Oracle TimesTen and
Coherence Cache

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume
one out shortly

 

NOTE: The information in this email is proprietary and confidential. This
message is for the designated recipient only, if you are not the intended
recipient, you should destroy it immediately. Any information in this
message shall not be understood as given or endorsed by Peridale Ltd, its
subsidiaries or their employees, unless expressly so stated. It is the
responsibility of the recipient to ensure that this email is virus free,
therefore neither Peridale Ltd, its subsidiaries nor their employees accept
any responsibility.

 

From: Grant Overby (groverby) [mailto:groverby@cisco.com] 
Sent: 14 April 2015 19:46
To: user@hive.apache.org
Subject: External Table with unclosed orc files.

 

What will Hive do if querying an external table containing orc files that
are still being written to?

 

If the process writing the orc files exits without calling .close()?

 

 

Sorry for taking the cheap way out and asking instead of testing. I couldn't
find anything on this via google. I won't be able to test these scenarios
till tomorrow and would like to have some idea of what to expect this
afternoon.