You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@arrow.apache.org by Xander Dunn <xa...@xander.ai> on 2021/05/26 16:32:00 UTC

Long-Running Continuous Data Saving to File

I have a very long-running (months) program that is streaming in data
continually, processing it, and saving it to file using Arrow. My current
solution is to buffer several million rows and write them to a new .parquet
file each time. This works, but produces 1000+ files every day.

If I could, I would just append to the same file for each day. I see an
`arrow::fs::FileySystem::OpenAppendStream` - what file formats does this
work with? Can I append to .parquet or .feather files? Googling seems to
indicate these formats can't be appended to.

Using the `parquet::StreamWriter
<https://arrow.apache.org/docs/cpp/parquet.html?highlight=writetable#writetable>`,
could I continually stream rows to a single file throughout the day? What
happens if the program is unexpectedly terminated? Would everything in the
currently open monolithic file be lost? I would be streaming rows to a
single .parquet file for 24 hours.

Thanks,
Xander

Re: Long-Running Continuous Data Saving to File

Posted by Elad Rosenheim <el...@dynamicyield.com>.

I want to add a few notes from my experience with Kafka:

1. There's an ecosystem - having battle-tested consumers that write to
various external systems, with known reliability guarantees, is very
helpful. It's also then possible to have multiple consumers - some batch,
some real-time streaming (e.g. Apache Flink) or analytics (ksqlDB
<https://ksqldb.io/>). People have already given thought to schema
evolution and whatnot, as Weston noted.

2. In terms of operations - yup, it wasn't as easy as I've hoped (mostly
when servers crash and stuff like that). We also have a component that runs
on a VM with Kafka running locally on that machine, used to buffer
downstream writes. That's also a possible setup - you don't *have* to have
a cluster. In this mode the buffer's durability is tied to the machine
being live and the size of the local disk, but it can also "just work" for
years.

btw,
Parquet not supporting append is in line with the bigger picture where HDFS
and object stores (e.g. S3) don't really support appending to files (I
think HDFS now does?), and many "big data" databases are based on LSM
<https://en.wikipedia.org/wiki/Log-structured_merge-tree> which generally
does not require appending to existing files. So there's a whole assumption
of using compaction (and then cleanup) / re-partitioning but not appending
to already-written files.
Elad

On Thu, May 27, 2021 at 1:07 AM Weston Pace <we...@ursacomputing.com>
wrote:

> Elad's advice is very helpful.  This is not a problem that Arrow solves
> today (to the best of my knowledge).  It is a topic that comes up
> periodically[1][2][3].  If a crash happens while your parquet stream writer
> is open then the most likely outcome is that you will be missing the footer
> (this gets written on close) and be unable to read the file (although it
> could presumably be recovered).  The parquet format may be able to support
> an append mode but readers don't typically support it.
>
> I believe a common approach to this problem is to dump out lots of small
> files as the data arrives and then periodically batch them together.  Kafka
> is a great way to do this but it could be done with a single process as
> well.  If you go very far down this path you will likely run into concerns
> like durability and schema evolution so I don't mean to imply that it is
> trivial :)
>
> [1]
> https://stackoverflow.com/questions/47113813/using-pyarrow-how-do-you-append-to-parquet-file
> [2] https://issues.apache.org/jira/browse/PARQUET-1154
> [3]
> https://lists.apache.org/thread.html/r7efad314abec0219016886eaddc7ba79a451087b6324531bdeede1af%40%3Cdev.arrow.apache.org%3E
>
> On Wed, May 26, 2021 at 7:39 AM Elad Rosenheim <el...@dynamicyield.com>
> wrote:
>
>> Hi,
>>
>> While I'm not using the C++ version of Arrow, the issue you're talking
>> about is a very common concern.
>>
>> There are a few points to discuss here:
>>
>> 1. Generally, Parquet files cannot be appended to. You could of course
>> load the file to memory, add more information and re-save, but that's not
>> really what you're looking for... tools like `parquet-tools` can
>> concatenate files together by creating a new file with two (or more) row
>> groups, but that's not a very good solution either. Having multiple row
>> groups in a single file is sometimes desirable, but in this case would just
>> create a less compressed file, most probably.
>>
>> 2. The other concern is reliability - having a process that holds a big
>> batch in memory and then spills them to disk every X minutes/rows/bytes is
>> bound to have issues when things crash/get stuck/need to go down for
>> maintenance. You probably want to have as close to "exactly once"
>> guarantees as possible (the holy grail...). One common solution for this is
>> to write to Kafka, and a have a consumer that periodically reads a batch of
>> messages and stores them to file. This is nowadays provided by Kafka
>> Connect
>> <https://www.confluent.io/blog/apache-kafka-to-amazon-s3-exactly-once/>,
>> thankfully. Anyway, the "exactly once" part stops at this point, and for
>> anything that happens downstream you'd need
>>
>> 3. Then, you're back to the question of many many files per day... there
>> is no magical solution to this. You may need to have a scheduled task that
>> reads files every X hours (or every day?), and re-partitions the data in
>> the way that makes the most sense for processing/querying later - perhaps
>> by date, perhaps by customer, both, etc. There are various tools that help
>> in this.
>>
>> Elad
>>
>> On Wed, May 26, 2021 at 7:32 PM Xander Dunn <xa...@xander.ai> wrote:
>>
>>> I have a very long-running (months) program that is streaming in data
>>> continually, processing it, and saving it to file using Arrow. My current
>>> solution is to buffer several million rows and write them to a new .parquet
>>> file each time. This works, but produces 1000+ files every day.
>>>
>>> If I could, I would just append to the same file for each day. I see an
>>> `arrow::fs::FileySystem::OpenAppendStream` - what file formats does this
>>> work with? Can I append to .parquet or .feather files? Googling seems to
>>> indicate these formats can't be appended to.
>>>
>>> Using the `parquet::StreamWriter
>>> <https://arrow.apache.org/docs/cpp/parquet.html?highlight=writetable#writetable>`,
>>> could I continually stream rows to a single file throughout the day? What
>>> happens if the program is unexpectedly terminated? Would everything in the
>>> currently open monolithic file be lost? I would be streaming rows to a
>>> single .parquet file for 24 hours.
>>>
>>> Thanks,
>>> Xander
>>>
>>>

Re: Long-Running Continuous Data Saving to File

Posted by Xander Dunn <xa...@xander.ai>.

Thanks to both of you, this is helpful.


On Wed, May 26, 2021 at 6:07 PM, Weston Pace <we...@ursacomputing.com>
wrote:

> Elad's advice is very helpful.  This is not a problem that Arrow solves
> today (to the best of my knowledge).  It is a topic that comes up
> periodically[1][2][3].  If a crash happens while your parquet stream writer
> is open then the most likely outcome is that you will be missing the footer
> (this gets written on close) and be unable to read the file (although it
> could presumably be recovered).  The parquet format may be able to support
> an append mode but readers don't typically support it.
>
> I believe a common approach to this problem is to dump out lots of small
> files as the data arrives and then periodically batch them together.  Kafka
> is a great way to do this but it could be done with a single process as
> well.  If you go very far down this path you will likely run into concerns
> like durability and schema evolution so I don't mean to imply that it is
> trivial :)
>
> [1] https://stackoverflow.com/questions/47113813/
> using-pyarrow-how-do-you-append-to-parquet-file
> [2] https://issues.apache.org/jira/browse/PARQUET-1154
> [3] https://lists.apache.org/thread.html/
> r7efad314abec0219016886eaddc7ba79a451087b6324531bdeede1af%40%3Cdev.arrow.
> apache.org%3E
>
> On Wed, May 26, 2021 at 7:39 AM Elad Rosenheim <el...@dynamicyield.com>
> wrote:
>
> Hi,
>
> While I'm not using the C++ version of Arrow, the issue you're talking
> about is a very common concern.
>
> There are a few points to discuss here:
>
> 1. Generally, Parquet files cannot be appended to. You could of course
> load the file to memory, add more information and re-save, but that's not
> really what you're looking for... tools like `parquet-tools` can
> concatenate files together by creating a new file with two (or more) row
> groups, but that's not a very good solution either. Having multiple row
> groups in a single file is sometimes desirable, but in this case would just
> create a less compressed file, most probably.
>
> 2. The other concern is reliability - having a process that holds a big
> batch in memory and then spills them to disk every X minutes/rows/bytes is
> bound to have issues when things crash/get stuck/need to go down for
> maintenance. You probably want to have as close to "exactly once"
> guarantees as possible (the holy grail...). One common solution for this is
> to write to Kafka, and a have a consumer that periodically reads a batch of
> messages and stores them to file. This is nowadays provided by Kafka
> Connect
> <https://www.confluent.io/blog/apache-kafka-to-amazon-s3-exactly-once/>,
> thankfully. Anyway, the "exactly once" part stops at this point, and for
> anything that happens downstream you'd need
>
> 3. Then, you're back to the question of many many files per day... there
> is no magical solution to this. You may need to have a scheduled task that
> reads files every X hours (or every day?), and re-partitions the data in
> the way that makes the most sense for processing/querying later - perhaps
> by date, perhaps by customer, both, etc. There are various tools that help
> in this.
>
> Elad
>
> On Wed, May 26, 2021 at 7:32 PM Xander Dunn <xa...@xander.ai> wrote:
>
> I have a very long-running (months) program that is streaming in data
> continually, processing it, and saving it to file using Arrow. My current
> solution is to buffer several million rows and write them to a new .parquet
> file each time. This works, but produces 1000+ files every day.
>
> If I could, I would just append to the same file for each day. I see an
> `arrow::fs::FileySystem::OpenAppendStream` - what file formats does this
> work with? Can I append to .parquet or .feather files? Googling seems to
> indicate these formats can't be appended to.
>
> Using the `parquet::StreamWriter
> <https://arrow.apache.org/docs/cpp/parquet.html?highlight=writetable#writetable>`,
> could I continually stream rows to a single file throughout the day? What
> happens if the program is unexpectedly terminated? Would everything in the
> currently open monolithic file be lost? I would be streaming rows to a
> single .parquet file for 24 hours.
>
> Thanks,
> Xander
>
>

Re: Long-Running Continuous Data Saving to File

Posted by Weston Pace <we...@ursacomputing.com>.

Elad's advice is very helpful.  This is not a problem that Arrow solves
today (to the best of my knowledge).  It is a topic that comes up
periodically[1][2][3].  If a crash happens while your parquet stream writer
is open then the most likely outcome is that you will be missing the footer
(this gets written on close) and be unable to read the file (although it
could presumably be recovered).  The parquet format may be able to support
an append mode but readers don't typically support it.

I believe a common approach to this problem is to dump out lots of small
files as the data arrives and then periodically batch them together.  Kafka
is a great way to do this but it could be done with a single process as
well.  If you go very far down this path you will likely run into concerns
like durability and schema evolution so I don't mean to imply that it is
trivial :)

[1]
https://stackoverflow.com/questions/47113813/using-pyarrow-how-do-you-append-to-parquet-file
[2] https://issues.apache.org/jira/browse/PARQUET-1154
[3]
https://lists.apache.org/thread.html/r7efad314abec0219016886eaddc7ba79a451087b6324531bdeede1af%40%3Cdev.arrow.apache.org%3E

On Wed, May 26, 2021 at 7:39 AM Elad Rosenheim <el...@dynamicyield.com>
wrote:

> Hi,
>
> While I'm not using the C++ version of Arrow, the issue you're talking
> about is a very common concern.
>
> There are a few points to discuss here:
>
> 1. Generally, Parquet files cannot be appended to. You could of course
> load the file to memory, add more information and re-save, but that's not
> really what you're looking for... tools like `parquet-tools` can
> concatenate files together by creating a new file with two (or more) row
> groups, but that's not a very good solution either. Having multiple row
> groups in a single file is sometimes desirable, but in this case would just
> create a less compressed file, most probably.
>
> 2. The other concern is reliability - having a process that holds a big
> batch in memory and then spills them to disk every X minutes/rows/bytes is
> bound to have issues when things crash/get stuck/need to go down for
> maintenance. You probably want to have as close to "exactly once"
> guarantees as possible (the holy grail...). One common solution for this is
> to write to Kafka, and a have a consumer that periodically reads a batch of
> messages and stores them to file. This is nowadays provided by Kafka
> Connect
> <https://www.confluent.io/blog/apache-kafka-to-amazon-s3-exactly-once/>,
> thankfully. Anyway, the "exactly once" part stops at this point, and for
> anything that happens downstream you'd need
>
> 3. Then, you're back to the question of many many files per day... there
> is no magical solution to this. You may need to have a scheduled task that
> reads files every X hours (or every day?), and re-partitions the data in
> the way that makes the most sense for processing/querying later - perhaps
> by date, perhaps by customer, both, etc. There are various tools that help
> in this.
>
> Elad
>
> On Wed, May 26, 2021 at 7:32 PM Xander Dunn <xa...@xander.ai> wrote:
>
>> I have a very long-running (months) program that is streaming in data
>> continually, processing it, and saving it to file using Arrow. My current
>> solution is to buffer several million rows and write them to a new .parquet
>> file each time. This works, but produces 1000+ files every day.
>>
>> If I could, I would just append to the same file for each day. I see an
>> `arrow::fs::FileySystem::OpenAppendStream` - what file formats does this
>> work with? Can I append to .parquet or .feather files? Googling seems to
>> indicate these formats can't be appended to.
>>
>> Using the `parquet::StreamWriter
>> <https://arrow.apache.org/docs/cpp/parquet.html?highlight=writetable#writetable>`,
>> could I continually stream rows to a single file throughout the day? What
>> happens if the program is unexpectedly terminated? Would everything in the
>> currently open monolithic file be lost? I would be streaming rows to a
>> single .parquet file for 24 hours.
>>
>> Thanks,
>> Xander
>>
>>

Re: Long-Running Continuous Data Saving to File

Posted by Elad Rosenheim <el...@dynamicyield.com>.

Hi,

While I'm not using the C++ version of Arrow, the issue you're talking
about is a very common concern.

There are a few points to discuss here:

1. Generally, Parquet files cannot be appended to. You could of course load
the file to memory, add more information and re-save, but that's not really
what you're looking for... tools like `parquet-tools` can concatenate files
together by creating a new file with two (or more) row groups, but that's
not a very good solution either. Having multiple row groups in a single
file is sometimes desirable, but in this case would just create a less
compressed file, most probably.

2. The other concern is reliability - having a process that holds a big
batch in memory and then spills them to disk every X minutes/rows/bytes is
bound to have issues when things crash/get stuck/need to go down for
maintenance. You probably want to have as close to "exactly once"
guarantees as possible (the holy grail...). One common solution for this is
to write to Kafka, and a have a consumer that periodically reads a batch of
messages and stores them to file. This is nowadays provided by Kafka Connect
<https://www.confluent.io/blog/apache-kafka-to-amazon-s3-exactly-once/>,
thankfully. Anyway, the "exactly once" part stops at this point, and for
anything that happens downstream you'd need

3. Then, you're back to the question of many many files per day... there is
no magical solution to this. You may need to have a scheduled task that
reads files every X hours (or every day?), and re-partitions the data in
the way that makes the most sense for processing/querying later - perhaps
by date, perhaps by customer, both, etc. There are various tools that help
in this.

Elad

On Wed, May 26, 2021 at 7:32 PM Xander Dunn <xa...@xander.ai> wrote:

> I have a very long-running (months) program that is streaming in data
> continually, processing it, and saving it to file using Arrow. My current
> solution is to buffer several million rows and write them to a new .parquet
> file each time. This works, but produces 1000+ files every day.
>
> If I could, I would just append to the same file for each day. I see an
> `arrow::fs::FileySystem::OpenAppendStream` - what file formats does this
> work with? Can I append to .parquet or .feather files? Googling seems to
> indicate these formats can't be appended to.
>
> Using the `parquet::StreamWriter
> <https://arrow.apache.org/docs/cpp/parquet.html?highlight=writetable#writetable>`,
> could I continually stream rows to a single file throughout the day? What
> happens if the program is unexpectedly terminated? Would everything in the
> currently open monolithic file be lost? I would be streaming rows to a
> single .parquet file for 24 hours.
>
> Thanks,
> Xander
>
>