You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by "Ambalu, Robert" <Ro...@Point72.com> on 2021/06/16 20:04:03 UTC

[C++] Parquet streaming

Apache community, I just want to confirm my understanding of parquet files.
I have a streaming set of data that may be produced in realtime.  Ideally I would stream it into a parquet file ( and if the process crashes, still be able to read some part of what was streamed ).
I can do this with arrow output ( last written batch would be readable ), but as far as I can tell its not possible with parquet writers.
Is my understanding correct?

Thanks
- Rob





DISCLAIMER: This e-mail message and any attachments are intended solely for the use of the individual or entity to which it is addressed and may contain information that is confidential or legally privileged. If you are not the intended recipient, you are hereby notified that any dissemination, distribution, copying or other use of this message or its attachments is strictly prohibited. If you have received this message in error, please notify the sender immediately and permanently delete this message and any attachments.




RE: [C++] Parquet streaming

Posted by "Ambalu, Robert" <Ro...@Point72.com>.
Understood, thank you for the quick response

-----Original Message-----
From: Micah Kornfield <em...@gmail.com> 
Sent: Wednesday, June 16, 2021 4:13 PM
To: dev <de...@arrow.apache.org>
Cc: Shamis, Michael <Mi...@CubistSystematic.com>
Subject: Re: [C++] Parquet streaming

Correct, you cannot recover a partially written parquet file.

This is only really feasible with the Arrow Streaming format and even there
you might run into issues if the data is not synced at the appropriate
place.  The arrow file format requires a footer be written so it has the
same issue.

-Micah

On Wed, Jun 16, 2021 at 1:04 PM Ambalu, Robert <Ro...@point72.com>
wrote:

> Apache community, I just want to confirm my understanding of parquet files.
> I have a streaming set of data that may be produced in realtime.  Ideally
> I would stream it into a parquet file ( and if the process crashes, still
> be able to read some part of what was streamed ).
> I can do this with arrow output ( last written batch would be readable ),
> but as far as I can tell its not possible with parquet writers.
> Is my understanding correct?
>
> Thanks
> - Rob
>
>
>
>
>
> DISCLAIMER: This e-mail message and any attachments are intended solely
> for the use of the individual or entity to which it is addressed and may
> contain information that is confidential or legally privileged. If you are
> not the intended recipient, you are hereby notified that any dissemination,
> distribution, copying or other use of this message or its attachments is
> strictly prohibited. If you have received this message in error, please
> notify the sender immediately and permanently delete this message and any
> attachments.
>
>
>
>





DISCLAIMER: This e-mail message and any attachments are intended solely for the use of the individual or entity to which it is addressed and may contain information that is confidential or legally privileged. If you are not the intended recipient, you are hereby notified that any dissemination, distribution, copying or other use of this message or its attachments is strictly prohibited. If you have received this message in error, please notify the sender immediately and permanently delete this message and any attachments.




Re: [C++] Parquet streaming

Posted by Micah Kornfield <em...@gmail.com>.
Correct, you cannot recover a partially written parquet file.

This is only really feasible with the Arrow Streaming format and even there
you might run into issues if the data is not synced at the appropriate
place.  The arrow file format requires a footer be written so it has the
same issue.

-Micah

On Wed, Jun 16, 2021 at 1:04 PM Ambalu, Robert <Ro...@point72.com>
wrote:

> Apache community, I just want to confirm my understanding of parquet files.
> I have a streaming set of data that may be produced in realtime.  Ideally
> I would stream it into a parquet file ( and if the process crashes, still
> be able to read some part of what was streamed ).
> I can do this with arrow output ( last written batch would be readable ),
> but as far as I can tell its not possible with parquet writers.
> Is my understanding correct?
>
> Thanks
> - Rob
>
>
>
>
>
> DISCLAIMER: This e-mail message and any attachments are intended solely
> for the use of the individual or entity to which it is addressed and may
> contain information that is confidential or legally privileged. If you are
> not the intended recipient, you are hereby notified that any dissemination,
> distribution, copying or other use of this message or its attachments is
> strictly prohibited. If you have received this message in error, please
> notify the sender immediately and permanently delete this message and any
> attachments.
>
>
>
>