You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@orc.apache.org by Owen O'Malley <ow...@gmail.com> on 2020/03/03 16:10:59 UTC

Re: ORC C++ write, avoid stradling HDFS blocks

On Tue, Mar 3, 2020 at 6:43 AM Aleksey Yatsenko <al...@gmail.com>
wrote:

> Hello,
>
> First of all, I would like to thank you and your colleagues for ORC
> library, and sorry for direct message.
>

You're welcome. I hope it is ok that I'm cc'ing the ORC dev list.

I am plan to create ORC files using C++ API and I found out that the file
> may contains stripes which cross the HDFS block boundaries. There are no
> corresponding configuration parameters in WriterOptions class and no
> required logic in WriterImpl::add() ( the same as in padStripe() in
> PhysicalFsWriter.java ).
> Could you please clarify whether this functionality will ever be
> implemented or give a couple of tips on how to do it myself :).
>

I haven't heard of anyone implementing HDFS block padding on the C++ side.
It should be relatively easy for you to add, especially if you use the Java
code as an example. As the writer finishes the stripe, you can calculate
how many bytes the finished stripe will be and from there figure out the
number of padding bytes if required. Do make sure that the padding doesn't
exceed a configured threshold to avoid corner cases with more padding than
stripe.

Please do contribute it back to the project.

I also found out that the libhdfspp does not support the "writing" and the
> "short circuit reads" functionality. I plan to use libhdfs3 from Apache
> HAWQ project which promises both of these features.
>

There is always more work to be done.

Thanks,
   Owen O'Malley

Re: ORC C++ write, avoid stradling HDFS blocks

Posted by Aleksey Yatsenko <al...@gmail.com>.
Owen,

Thank you for your answer and advices.
I will publish the patch if I can do something acceptable.

Best regards,
Aleksey Yatsenko.

вт, 3 мар. 2020 г. в 19:11, Owen O'Malley <ow...@gmail.com>:

>
>
> On Tue, Mar 3, 2020 at 6:43 AM Aleksey Yatsenko <
> aleksey.yatsenko@gmail.com> wrote:
>
>> Hello,
>>
>> First of all, I would like to thank you and your colleagues for ORC
>> library, and sorry for direct message.
>>
>
> You're welcome. I hope it is ok that I'm cc'ing the ORC dev list.
>
> I am plan to create ORC files using C++ API and I found out that the file
>> may contains stripes which cross the HDFS block boundaries. There are no
>> corresponding configuration parameters in WriterOptions class and no
>> required logic in WriterImpl::add() ( the same as in padStripe() in
>> PhysicalFsWriter.java ).
>> Could you please clarify whether this functionality will ever be
>> implemented or give a couple of tips on how to do it myself :).
>>
>
> I haven't heard of anyone implementing HDFS block padding on the C++ side.
> It should be relatively easy for you to add, especially if you use the Java
> code as an example. As the writer finishes the stripe, you can calculate
> how many bytes the finished stripe will be and from there figure out the
> number of padding bytes if required. Do make sure that the padding doesn't
> exceed a configured threshold to avoid corner cases with more padding than
> stripe.
>
> Please do contribute it back to the project.
>
> I also found out that the libhdfspp does not support the "writing" and the
>> "short circuit reads" functionality. I plan to use libhdfs3 from Apache
>> HAWQ project which promises both of these features.
>>
>
> There is always more work to be done.
>
> Thanks,
>    Owen O'Malley
>