You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by Vitalii Diravka <vi...@apache.org> on 2021/04/22 11:14:51 UTC

CapacityByteArrayOutputStream vs ConcatenatingByteArrayCollector

Hi Parquet devs

Could you advise me the right way of *CapacityByteArrayOutputStream* usage?
I am going to use it in *ColumnChunkPageWriteStore *instead of
*ByteArrayOutputStream
tempOutputStream* and *ConcatenatingByteArrayCollector buf. *The purpose of
it that CapacityByteArrayOutputStream can use custom *ByteBufferAllocator
allocator *with direct memory usage instead of* bytes[] *in heap
(*DirectByteBufferAllocator
*or similar other one).
The question is that fine way for *ColumnChunkPageWriteStore*? And is it
fine to provide(add) an API for *ColumnChunkPageWriteStore* to use
*CapacityByteArrayOutputStream*?


Kind regards
Vitalii

Re: CapacityByteArrayOutputStream vs ConcatenatingByteArrayCollector

Posted by Gabor Szadovszky <ga...@apache.org>.
Hi Vitalii,

CapacityByteArrayOutputStream is not only about the selectable allocator
but the growing mechanism as well. Based on the its documentation you will
need a good maxCapacityHint which I am not sure you have in case of a
column chunk. We have size limits/hints for pages and row groups but don't
have such things for column chunks. If you set the hint too high you may
end up allocating too much space however, it should not be worse than the
existing ByteArrayOutputStream. However, if you set it too low then you
might end up too many allocations at growing which could hit performance.
If you can come up with good maxCapacityHint and prove with performance
tests that the change is not slower than the original, I am fine with this
update.
About the API. ColumnChunkPageWriteStore is not part of the public API of
parquet-mr. I know it is public from java point of view but it has never
meant to be used directly. It is neither a pro nor a con to add a new
public method just good to know what we are extending. I think, if
performance tests approve, it would be cleaner to simply change the
ByteArrayOutputStream to CapacityByteArrayOutputStream without any new API.

Cheers,
Gabor


On Thu, Apr 22, 2021 at 1:20 PM Vitalii Diravka <vi...@apache.org> wrote:

> Hi Parquet devs
>
> Could you advise me the right way of *CapacityByteArrayOutputStream* usage?
> I am going to use it in *ColumnChunkPageWriteStore *instead of
> *ByteArrayOutputStream
> tempOutputStream* and *ConcatenatingByteArrayCollector buf. *The purpose of
> it that CapacityByteArrayOutputStream can use custom *ByteBufferAllocator
> allocator *with direct memory usage instead of* bytes[] *in heap
> (*DirectByteBufferAllocator
> *or similar other one).
> The question is that fine way for *ColumnChunkPageWriteStore*? And is it
> fine to provide(add) an API for *ColumnChunkPageWriteStore* to use
> *CapacityByteArrayOutputStream*?
>
>
> Kind regards
> Vitalii
>