You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@orc.apache.org by Kyle Dunn <kd...@gmail.com> on 2016/04/22 02:20:03 UTC

Seeking to and reading stripe data

I'm trying to implement a parallel ORC reader with WebHDFS using the C++
client library. My process is as follows:
1) Download the 16kB tail using WebHDFS offset and length
2) From the tail, determine the offsets and lengths for stripes of interest
3) Use stripe information from 2) as WebHDFS offset and length parameters
to read data sections to a local file
4) Append the tail to a local file
5) Use ORC C++ Reader to print contents of local file

I'd like to clarify a couple items:

1) The Hive configuration parameter "orc.stripe.size" seems to suggest the
stripe size is configurable, but constant for all stripes in a given file.
Can someone clarify this? Is orc.stripe.size an upper bound?

2) The Reader class in the C++ client allows me to determine the byte
offset and length for a given stripe yet if I do a partial download of an
ORC file by isolating that offset and length, I get Zlib logic exceptions
when deserializing data from partial stripe downloads (tail is also
appended to stripe data). I've also seen an exception related to a buffer
being undersized.

Is there something I'm missing? Do I need to rewrite the tail? Specify an
offset in the ORC Reader class as well?


Thanks in advance for the help,
Kyle