You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@arrow.apache.org by Sam Davis <Sa...@nanoporetech.com> on 2021/07/14 10:39:44 UTC

[Python/C++] Streaming Format to IPC File Format Conversion

Hi,

I'm interested in a use case where there is a long running job producing results as it goes that may die and therefore must be restarted, making sure to continue from the last known-good point.

For this use case, it seems best to use the "IPC Streaming Format" and write out the batches as they are generated.

However, once the job is finished it would also be beneficial to have random access into the file. It seems like this is possible by:

1. Manually creating a file with the correct magic number/padding bytes and then seq'ing past them.
2. Writing batches out as they appear.
3. Doing a pass over the record batches to gather the information required to generate the footer data.

Whilst this seems possible, it doesn't seem like it is a use case that has come up before. However, this does surprise me because adding index information to a "completed" file seems like a genuinely useful thing to want to do.

Has anyone encountered something similar before?

Is there an easier way to achieve this? i.e. does this functionality, or parts of, exist in another language that I can bind to in Python?

Best,

Sam

IMPORTANT NOTICE: The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, re-transmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and delete the material from any computer. Although we routinely screen for viruses, addressees should check this e-mail and any attachment for viruses. We make no warranty as to absence of viruses in this e-mail or any attachments.

Re: [Python/C++] Streaming Format to IPC File Format Conversion

Posted by Micah Kornfield <em...@gmail.com>.

>
> but I see the
> value in providing a way to do random access into a stream-file after
> writing it without having to rewrite the file into the file format


Another path forward seems to be what Sam initially called out as a
workflow.  i.e. create a new API that takes a partially written IPC File
formatted file and allows for "finishing it".  I think the complicated part
is likely determining a resumption point (which is maybe an API input and
people can determine their own system for doing this transactionally.

-Micah

[1] https://github.com/apache/arrow/pull/4815

On Fri, Jul 16, 2021 at 12:19 PM Wes McKinney <we...@gmail.com> wrote:

> hi Micah — makes sense. I agree that starting down the path of "table
> management" in Arrow is probably too much scope creep since the
> requirements (e.g. schema evolution) can vary so much, but I see the
> value in providing a way to do random access into a stream-file after
> writing it without having to rewrite the file into the file format
> (which may be tricky given possible issues with dictionary deltas)
>
> On Wed, Jul 14, 2021 at 10:58 PM Micah Kornfield <em...@gmail.com>
> wrote:
> >
> > I think if we tried to tack this on, I think it might be worth trying to
> go through the design effort to see if something is possible without
> external files.  The stream format also allows more flexibility around
> dictionaries then the file format does, so there is a possibility of
> impedance mismatch.
> >
> > Before we went with our own specification for external metadata it seems
> that looking at integration with something like Iceberg might make sense.
> >
> > My understanding is that  external metadata files are on the path to
> deprecation or at least no recommended in parquet [1].
> >
> > [1]
> https://lists.apache.org/thread.html/r9897237ce76287e66109994320d876d32e11db6acc32490b99a41842%40%3Cdev.parquet.apache.org%3E
> >
> > On Wed, Jul 14, 2021 at 4:53 PM Wes McKinney <we...@gmail.com>
> wrote:
> >>
> >> On Wed, Jul 14, 2021 at 5:40 PM Aldrin <ak...@ucsc.edu> wrote:
> >> >
> >> > Forgive me if I am misunderstanding the context, but my initial
> impression would be that this is solved at a higher layer than the file
> format. While some approaches make sense at
> >> > the file format level, that approach may not be the best. I suspect
> that book-keeping for this type of conversion would be affected by batching
> granularity (can you group multiple
> >> > streamed batches) and what type of process/job it is (is the job at
> the level of like a bash script? is the job at the level of a copy task?).
> >> >
> >> > Some questions and thoughts below:
> >> >
> >> >
> >> >> One thing that occurs to me is whether we could enable the file
> >> >> footer metadata to live in a "sidecar" file to support this use case.
> >> >
> >> >
> >> > This sounds like a good, simple approach that could serve as a
> default. But I feel like this is essentially the same as maintaining an
> independent metadata file, that could be described
> >> > in a cookbook or something. Seems odd to me, personally, to include
> it in the format definition.
> >>
> >> The problem with this is that it is not compliant with our
> >> specification ([1]), so applications would not be able to hope for any
> >> interoperability. Parquet provides for file footer metadata living
> >> separate from the row groups (akin to our record batches), and this is
> >> formalized in the format ([2]). None of the Arrow projects have any
> >> mechanism to deal with the Footer independently — to do something with
> >> that metadata that is not in the project specification is not
> >> something we could support and provide backward/forward
> >> compatibilities for.
> >>
> >> [1]: https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format
> >> [2]:
> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L787
> >>
> >> >
> >> >> 3. Doing a pass over the record batches to gather the information
> required to generate the footer data.
> >> >
> >> >
> >> > Could you maintain footer data incrementally and always write to the
> same spot whenever some number of batches are written to the destination?
> >> >
> >> >
> >> >> 2. Writing batches out as they appear.
> >> >
> >> >
> >> > Might batches be received out of order? Is this long running job
> streaming over a network connection? Might the source be
> distributed/striped over multiple sources/locations?
> >> >
> >> >
> >> >> a use case where there is a long running job producing results as it
> goes that may die and therefore must be restarted
> >> >
> >> >
> >> > Would the long running job only be handling independent streams,
> concurrently? e.g. is it an asynchronous job that handles a single logical
> stream, or does it manage a pool of stream
> >> > for concurrent requesting processes?
> >> >
> >> > Aldrin Montana
> >> > Computer Science PhD Student
> >> > UC Santa Cruz
> >> >
> >> >
> >> > On Wed, Jul 14, 2021 at 2:23 PM Wes McKinney <we...@gmail.com>
> wrote:
> >> >>
> >> >> hi Sam — it's an interesting proposition. Other file formats like
> >> >> Parquet don't make "resuming" particularly easy, either. The magic
> >> >> number at the beginning of an Arrow file means that it's a lot more
> >> >> expensive to turn a stream file into an Arrow-file-file — if we'd
> >> >> thought about this use case, we might have chosen to only put the
> >> >> magic number at the end of the file.
> >> >>
> >> >> It's also not possible to put the file metadata "outside" the stream
> >> >> file. One thing that occurs to me is whether we could enable the file
> >> >> footer metadata to live in a "sidecar" file to support this use case.
> >> >> To enable this, we would have to add a new optional field to Footer
> in
> >> >> File.fbs that indicates the file path that the Footer references.
> This
> >> >> would be null when the footer is part of the same file where the data
> >> >> lives. A function could be implemented to produce this "sidecar
> index"
> >> >> file from a stream file.
> >> >>
> >> >> Not sure on others' thoughts about this.
> >> >>
> >> >> Thanks,
> >> >> Wes
> >> >>
> >> >>
> >> >> On Wed, Jul 14, 2021 at 5:39 AM Sam Davis <
> Sam.Davis@nanoporetech.com> wrote:
> >> >> >
> >> >> > Hi,
> >> >> >
> >> >> > I'm interested in a use case where there is a long running job
> producing results as it goes that may die and therefore must be restarted,
> making sure to continue from the last known-good point.
> >> >> >
> >> >> > For this use case, it seems best to use the "IPC Streaming Format"
> and write out the batches as they are generated.
> >> >> >
> >> >> > However, once the job is finished it would also be beneficial to
> have random access into the file. It seems like this is possible by:
> >> >> >
> >> >> > Manually creating a file with the correct magic number/padding
> bytes and then seq'ing past them.
> >> >> > Writing batches out as they appear.
> >> >> > Doing a pass over the record batches to gather the information
> required to generate the footer data.
> >> >> >
> >> >> >
> >> >> > Whilst this seems possible, it doesn't seem like it is a use case
> that has come up before. However, this does surprise me because adding
> index information to a "completed" file seems like a genuinely useful thing
> to want to do.
> >> >> >
> >> >> > Has anyone encountered something similar before?
> >> >> >
> >> >> > Is there an easier way to achieve this? i.e. does this
> functionality, or parts of, exist in another language that I can bind to in
> Python?
> >> >> >
> >> >> > Best,
> >> >> >
> >> >> > Sam
> >> >> >
> >> >> >
> >> >> > IMPORTANT NOTICE: The information transmitted is intended only for
> the person or entity to which it is addressed and may contain confidential
> and/or privileged material. Any review, re-transmission, dissemination or
> other use of, or taking of any action in reliance upon, this information by
> persons or entities other than the intended recipient is prohibited. If you
> received this in error, please contact the sender and delete the material
> from any computer. Although we routinely screen for viruses, addressees
> should check this e-mail and any attachment for viruses. We make no
> warranty as to absence of viruses in this e-mail or any attachments.
>

Re: [Python/C++] Streaming Format to IPC File Format Conversion

Posted by Wes McKinney <we...@gmail.com>.

hi Micah — makes sense. I agree that starting down the path of "table
management" in Arrow is probably too much scope creep since the
requirements (e.g. schema evolution) can vary so much, but I see the
value in providing a way to do random access into a stream-file after
writing it without having to rewrite the file into the file format
(which may be tricky given possible issues with dictionary deltas)

On Wed, Jul 14, 2021 at 10:58 PM Micah Kornfield <em...@gmail.com> wrote:
>
> I think if we tried to tack this on, I think it might be worth trying to go through the design effort to see if something is possible without external files.  The stream format also allows more flexibility around dictionaries then the file format does, so there is a possibility of impedance mismatch.
>
> Before we went with our own specification for external metadata it seems that looking at integration with something like Iceberg might make sense.
>
> My understanding is that  external metadata files are on the path to deprecation or at least no recommended in parquet [1].
>
> [1] https://lists.apache.org/thread.html/r9897237ce76287e66109994320d876d32e11db6acc32490b99a41842%40%3Cdev.parquet.apache.org%3E
>
> On Wed, Jul 14, 2021 at 4:53 PM Wes McKinney <we...@gmail.com> wrote:
>>
>> On Wed, Jul 14, 2021 at 5:40 PM Aldrin <ak...@ucsc.edu> wrote:
>> >
>> > Forgive me if I am misunderstanding the context, but my initial impression would be that this is solved at a higher layer than the file format. While some approaches make sense at
>> > the file format level, that approach may not be the best. I suspect that book-keeping for this type of conversion would be affected by batching granularity (can you group multiple
>> > streamed batches) and what type of process/job it is (is the job at the level of like a bash script? is the job at the level of a copy task?).
>> >
>> > Some questions and thoughts below:
>> >
>> >
>> >> One thing that occurs to me is whether we could enable the file
>> >> footer metadata to live in a "sidecar" file to support this use case.
>> >
>> >
>> > This sounds like a good, simple approach that could serve as a default. But I feel like this is essentially the same as maintaining an independent metadata file, that could be described
>> > in a cookbook or something. Seems odd to me, personally, to include it in the format definition.
>>
>> The problem with this is that it is not compliant with our
>> specification ([1]), so applications would not be able to hope for any
>> interoperability. Parquet provides for file footer metadata living
>> separate from the row groups (akin to our record batches), and this is
>> formalized in the format ([2]). None of the Arrow projects have any
>> mechanism to deal with the Footer independently — to do something with
>> that metadata that is not in the project specification is not
>> something we could support and provide backward/forward
>> compatibilities for.
>>
>> [1]: https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format
>> [2]: https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L787
>>
>> >
>> >> 3. Doing a pass over the record batches to gather the information required to generate the footer data.
>> >
>> >
>> > Could you maintain footer data incrementally and always write to the same spot whenever some number of batches are written to the destination?
>> >
>> >
>> >> 2. Writing batches out as they appear.
>> >
>> >
>> > Might batches be received out of order? Is this long running job streaming over a network connection? Might the source be distributed/striped over multiple sources/locations?
>> >
>> >
>> >> a use case where there is a long running job producing results as it goes that may die and therefore must be restarted
>> >
>> >
>> > Would the long running job only be handling independent streams, concurrently? e.g. is it an asynchronous job that handles a single logical stream, or does it manage a pool of stream
>> > for concurrent requesting processes?
>> >
>> > Aldrin Montana
>> > Computer Science PhD Student
>> > UC Santa Cruz
>> >
>> >
>> > On Wed, Jul 14, 2021 at 2:23 PM Wes McKinney <we...@gmail.com> wrote:
>> >>
>> >> hi Sam — it's an interesting proposition. Other file formats like
>> >> Parquet don't make "resuming" particularly easy, either. The magic
>> >> number at the beginning of an Arrow file means that it's a lot more
>> >> expensive to turn a stream file into an Arrow-file-file — if we'd
>> >> thought about this use case, we might have chosen to only put the
>> >> magic number at the end of the file.
>> >>
>> >> It's also not possible to put the file metadata "outside" the stream
>> >> file. One thing that occurs to me is whether we could enable the file
>> >> footer metadata to live in a "sidecar" file to support this use case.
>> >> To enable this, we would have to add a new optional field to Footer in
>> >> File.fbs that indicates the file path that the Footer references. This
>> >> would be null when the footer is part of the same file where the data
>> >> lives. A function could be implemented to produce this "sidecar index"
>> >> file from a stream file.
>> >>
>> >> Not sure on others' thoughts about this.
>> >>
>> >> Thanks,
>> >> Wes
>> >>
>> >>
>> >> On Wed, Jul 14, 2021 at 5:39 AM Sam Davis <Sa...@nanoporetech.com> wrote:
>> >> >
>> >> > Hi,
>> >> >
>> >> > I'm interested in a use case where there is a long running job producing results as it goes that may die and therefore must be restarted, making sure to continue from the last known-good point.
>> >> >
>> >> > For this use case, it seems best to use the "IPC Streaming Format" and write out the batches as they are generated.
>> >> >
>> >> > However, once the job is finished it would also be beneficial to have random access into the file. It seems like this is possible by:
>> >> >
>> >> > Manually creating a file with the correct magic number/padding bytes and then seq'ing past them.
>> >> > Writing batches out as they appear.
>> >> > Doing a pass over the record batches to gather the information required to generate the footer data.
>> >> >
>> >> >
>> >> > Whilst this seems possible, it doesn't seem like it is a use case that has come up before. However, this does surprise me because adding index information to a "completed" file seems like a genuinely useful thing to want to do.
>> >> >
>> >> > Has anyone encountered something similar before?
>> >> >
>> >> > Is there an easier way to achieve this? i.e. does this functionality, or parts of, exist in another language that I can bind to in Python?
>> >> >
>> >> > Best,
>> >> >
>> >> > Sam
>> >> >
>> >> >
>> >> > IMPORTANT NOTICE: The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, re-transmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and delete the material from any computer. Although we routinely screen for viruses, addressees should check this e-mail and any attachment for viruses. We make no warranty as to absence of viruses in this e-mail or any attachments.

Re: [Python/C++] Streaming Format to IPC File Format Conversion

Posted by Micah Kornfield <em...@gmail.com>.

I think if we tried to tack this on, I think it might be worth trying to go
through the design effort to see if something is possible without external
files.  The stream format also allows more flexibility around dictionaries
then the file format does, so there is a possibility of impedance mismatch.

Before we went with our own specification for external metadata it seems
that looking at integration with something like Iceberg might make sense.

My understanding is that  external metadata files are on the path to
deprecation or at least no recommended in parquet [1].

[1]
https://lists.apache.org/thread.html/r9897237ce76287e66109994320d876d32e11db6acc32490b99a41842%40%3Cdev.parquet.apache.org%3E

On Wed, Jul 14, 2021 at 4:53 PM Wes McKinney <we...@gmail.com> wrote:

> On Wed, Jul 14, 2021 at 5:40 PM Aldrin <ak...@ucsc.edu> wrote:
> >
> > Forgive me if I am misunderstanding the context, but my initial
> impression would be that this is solved at a higher layer than the file
> format. While some approaches make sense at
> > the file format level, that approach may not be the best. I suspect that
> book-keeping for this type of conversion would be affected by batching
> granularity (can you group multiple
> > streamed batches) and what type of process/job it is (is the job at the
> level of like a bash script? is the job at the level of a copy task?).
> >
> > Some questions and thoughts below:
> >
> >
> >> One thing that occurs to me is whether we could enable the file
> >> footer metadata to live in a "sidecar" file to support this use case.
> >
> >
> > This sounds like a good, simple approach that could serve as a default.
> But I feel like this is essentially the same as maintaining an independent
> metadata file, that could be described
> > in a cookbook or something. Seems odd to me, personally, to include it
> in the format definition.
>
> The problem with this is that it is not compliant with our
> specification ([1]), so applications would not be able to hope for any
> interoperability. Parquet provides for file footer metadata living
> separate from the row groups (akin to our record batches), and this is
> formalized in the format ([2]). None of the Arrow projects have any
> mechanism to deal with the Footer independently — to do something with
> that metadata that is not in the project specification is not
> something we could support and provide backward/forward
> compatibilities for.
>
> [1]: https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format
> [2]:
> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L787
>
> >
> >> 3. Doing a pass over the record batches to gather the information
> required to generate the footer data.
> >
> >
> > Could you maintain footer data incrementally and always write to the
> same spot whenever some number of batches are written to the destination?
> >
> >
> >> 2. Writing batches out as they appear.
> >
> >
> > Might batches be received out of order? Is this long running job
> streaming over a network connection? Might the source be
> distributed/striped over multiple sources/locations?
> >
> >
> >> a use case where there is a long running job producing results as it
> goes that may die and therefore must be restarted
> >
> >
> > Would the long running job only be handling independent streams,
> concurrently? e.g. is it an asynchronous job that handles a single logical
> stream, or does it manage a pool of stream
> > for concurrent requesting processes?
> >
> > Aldrin Montana
> > Computer Science PhD Student
> > UC Santa Cruz
> >
> >
> > On Wed, Jul 14, 2021 at 2:23 PM Wes McKinney <we...@gmail.com>
> wrote:
> >>
> >> hi Sam — it's an interesting proposition. Other file formats like
> >> Parquet don't make "resuming" particularly easy, either. The magic
> >> number at the beginning of an Arrow file means that it's a lot more
> >> expensive to turn a stream file into an Arrow-file-file — if we'd
> >> thought about this use case, we might have chosen to only put the
> >> magic number at the end of the file.
> >>
> >> It's also not possible to put the file metadata "outside" the stream
> >> file. One thing that occurs to me is whether we could enable the file
> >> footer metadata to live in a "sidecar" file to support this use case.
> >> To enable this, we would have to add a new optional field to Footer in
> >> File.fbs that indicates the file path that the Footer references. This
> >> would be null when the footer is part of the same file where the data
> >> lives. A function could be implemented to produce this "sidecar index"
> >> file from a stream file.
> >>
> >> Not sure on others' thoughts about this.
> >>
> >> Thanks,
> >> Wes
> >>
> >>
> >> On Wed, Jul 14, 2021 at 5:39 AM Sam Davis <Sa...@nanoporetech.com>
> wrote:
> >> >
> >> > Hi,
> >> >
> >> > I'm interested in a use case where there is a long running job
> producing results as it goes that may die and therefore must be restarted,
> making sure to continue from the last known-good point.
> >> >
> >> > For this use case, it seems best to use the "IPC Streaming Format"
> and write out the batches as they are generated.
> >> >
> >> > However, once the job is finished it would also be beneficial to have
> random access into the file. It seems like this is possible by:
> >> >
> >> > Manually creating a file with the correct magic number/padding bytes
> and then seq'ing past them.
> >> > Writing batches out as they appear.
> >> > Doing a pass over the record batches to gather the information
> required to generate the footer data.
> >> >
> >> >
> >> > Whilst this seems possible, it doesn't seem like it is a use case
> that has come up before. However, this does surprise me because adding
> index information to a "completed" file seems like a genuinely useful thing
> to want to do.
> >> >
> >> > Has anyone encountered something similar before?
> >> >
> >> > Is there an easier way to achieve this? i.e. does this functionality,
> or parts of, exist in another language that I can bind to in Python?
> >> >
> >> > Best,
> >> >
> >> > Sam
> >> >
> >> >
> >> > IMPORTANT NOTICE: The information transmitted is intended only for
> the person or entity to which it is addressed and may contain confidential
> and/or privileged material. Any review, re-transmission, dissemination or
> other use of, or taking of any action in reliance upon, this information by
> persons or entities other than the intended recipient is prohibited. If you
> received this in error, please contact the sender and delete the material
> from any computer. Although we routinely screen for viruses, addressees
> should check this e-mail and any attachment for viruses. We make no
> warranty as to absence of viruses in this e-mail or any attachments.
>

Re: [Python/C++] Streaming Format to IPC File Format Conversion

Posted by Wes McKinney <we...@gmail.com>.

On Wed, Jul 14, 2021 at 5:40 PM Aldrin <ak...@ucsc.edu> wrote:
>
> Forgive me if I am misunderstanding the context, but my initial impression would be that this is solved at a higher layer than the file format. While some approaches make sense at
> the file format level, that approach may not be the best. I suspect that book-keeping for this type of conversion would be affected by batching granularity (can you group multiple
> streamed batches) and what type of process/job it is (is the job at the level of like a bash script? is the job at the level of a copy task?).
>
> Some questions and thoughts below:
>
>
>> One thing that occurs to me is whether we could enable the file
>> footer metadata to live in a "sidecar" file to support this use case.
>
>
> This sounds like a good, simple approach that could serve as a default. But I feel like this is essentially the same as maintaining an independent metadata file, that could be described
> in a cookbook or something. Seems odd to me, personally, to include it in the format definition.

The problem with this is that it is not compliant with our
specification ([1]), so applications would not be able to hope for any
interoperability. Parquet provides for file footer metadata living
separate from the row groups (akin to our record batches), and this is
formalized in the format ([2]). None of the Arrow projects have any
mechanism to deal with the Footer independently — to do something with
that metadata that is not in the project specification is not
something we could support and provide backward/forward
compatibilities for.

[1]: https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format
[2]: https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L787

>
>> 3. Doing a pass over the record batches to gather the information required to generate the footer data.
>
>
> Could you maintain footer data incrementally and always write to the same spot whenever some number of batches are written to the destination?
>
>
>> 2. Writing batches out as they appear.
>
>
> Might batches be received out of order? Is this long running job streaming over a network connection? Might the source be distributed/striped over multiple sources/locations?
>
>
>> a use case where there is a long running job producing results as it goes that may die and therefore must be restarted
>
>
> Would the long running job only be handling independent streams, concurrently? e.g. is it an asynchronous job that handles a single logical stream, or does it manage a pool of stream
> for concurrent requesting processes?
>
> Aldrin Montana
> Computer Science PhD Student
> UC Santa Cruz
>
>
> On Wed, Jul 14, 2021 at 2:23 PM Wes McKinney <we...@gmail.com> wrote:
>>
>> hi Sam — it's an interesting proposition. Other file formats like
>> Parquet don't make "resuming" particularly easy, either. The magic
>> number at the beginning of an Arrow file means that it's a lot more
>> expensive to turn a stream file into an Arrow-file-file — if we'd
>> thought about this use case, we might have chosen to only put the
>> magic number at the end of the file.
>>
>> It's also not possible to put the file metadata "outside" the stream
>> file. One thing that occurs to me is whether we could enable the file
>> footer metadata to live in a "sidecar" file to support this use case.
>> To enable this, we would have to add a new optional field to Footer in
>> File.fbs that indicates the file path that the Footer references. This
>> would be null when the footer is part of the same file where the data
>> lives. A function could be implemented to produce this "sidecar index"
>> file from a stream file.
>>
>> Not sure on others' thoughts about this.
>>
>> Thanks,
>> Wes
>>
>>
>> On Wed, Jul 14, 2021 at 5:39 AM Sam Davis <Sa...@nanoporetech.com> wrote:
>> >
>> > Hi,
>> >
>> > I'm interested in a use case where there is a long running job producing results as it goes that may die and therefore must be restarted, making sure to continue from the last known-good point.
>> >
>> > For this use case, it seems best to use the "IPC Streaming Format" and write out the batches as they are generated.
>> >
>> > However, once the job is finished it would also be beneficial to have random access into the file. It seems like this is possible by:
>> >
>> > Manually creating a file with the correct magic number/padding bytes and then seq'ing past them.
>> > Writing batches out as they appear.
>> > Doing a pass over the record batches to gather the information required to generate the footer data.
>> >
>> >
>> > Whilst this seems possible, it doesn't seem like it is a use case that has come up before. However, this does surprise me because adding index information to a "completed" file seems like a genuinely useful thing to want to do.
>> >
>> > Has anyone encountered something similar before?
>> >
>> > Is there an easier way to achieve this? i.e. does this functionality, or parts of, exist in another language that I can bind to in Python?
>> >
>> > Best,
>> >
>> > Sam
>> >
>> >
>> > IMPORTANT NOTICE: The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, re-transmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and delete the material from any computer. Although we routinely screen for viruses, addressees should check this e-mail and any attachment for viruses. We make no warranty as to absence of viruses in this e-mail or any attachments.

Re: [Python/C++] Streaming Format to IPC File Format Conversion

Posted by Aldrin <ak...@ucsc.edu>.

Forgive me if I am misunderstanding the context, but my initial impression
would be that this is solved at a higher layer than the file format. While
some approaches make sense at
the file format level, that approach may not be the best. I suspect that
book-keeping for this type of conversion would be affected by batching
granularity (can you group multiple
streamed batches) and what type of process/job it is (is the job at the
level of like a bash script? is the job at the level of a copy task?).

Some questions and thoughts below:


One thing that occurs to me is whether we could enable the file
> footer metadata to live in a "sidecar" file to support this use case.
>

This sounds like a good, simple approach that could serve as a default. But
I feel like this is essentially the same as maintaining an independent
metadata file, that could be described
in a cookbook or something. Seems odd to me, personally, to include it in
the format definition.


3. Doing a pass over the record batches to gather the information required
> to generate the footer data.
>

Could you maintain footer data incrementally and always write to the same
spot whenever some number of batches are written to the destination?


2. Writing batches out as they appear.
>

Might batches be received out of order? Is this long running job streaming
over a network connection? Might the source be distributed/striped over
multiple sources/locations?


a use case where there is a long running job producing results as it goes
> that may die and therefore must be restarted
>

Would the long running job only be handling independent streams,
concurrently? e.g. is it an asynchronous job that handles a single logical
stream, or does it manage a pool of stream
for concurrent requesting processes?

Aldrin Montana
Computer Science PhD Student
UC Santa Cruz


On Wed, Jul 14, 2021 at 2:23 PM Wes McKinney <we...@gmail.com> wrote:

> hi Sam — it's an interesting proposition. Other file formats like
> Parquet don't make "resuming" particularly easy, either. The magic
> number at the beginning of an Arrow file means that it's a lot more
> expensive to turn a stream file into an Arrow-file-file — if we'd
> thought about this use case, we might have chosen to only put the
> magic number at the end of the file.
>
> It's also not possible to put the file metadata "outside" the stream
> file. One thing that occurs to me is whether we could enable the file
> footer metadata to live in a "sidecar" file to support this use case.
> To enable this, we would have to add a new optional field to Footer in
> File.fbs that indicates the file path that the Footer references. This
> would be null when the footer is part of the same file where the data
> lives. A function could be implemented to produce this "sidecar index"
> file from a stream file.
>
> Not sure on others' thoughts about this.
>
> Thanks,
> Wes
>
>
> On Wed, Jul 14, 2021 at 5:39 AM Sam Davis <Sa...@nanoporetech.com>
> wrote:
> >
> > Hi,
> >
> > I'm interested in a use case where there is a long running job producing
> results as it goes that may die and therefore must be restarted, making
> sure to continue from the last known-good point.
> >
> > For this use case, it seems best to use the "IPC Streaming Format" and
> write out the batches as they are generated.
> >
> > However, once the job is finished it would also be beneficial to have
> random access into the file. It seems like this is possible by:
> >
> > Manually creating a file with the correct magic number/padding bytes and
> then seq'ing past them.
> > Writing batches out as they appear.
> > Doing a pass over the record batches to gather the information required
> to generate the footer data.
> >
> >
> > Whilst this seems possible, it doesn't seem like it is a use case that
> has come up before. However, this does surprise me because adding index
> information to a "completed" file seems like a genuinely useful thing to
> want to do.
> >
> > Has anyone encountered something similar before?
> >
> > Is there an easier way to achieve this? i.e. does this functionality, or
> parts of, exist in another language that I can bind to in Python?
> >
> > Best,
> >
> > Sam
> >
> >
> > IMPORTANT NOTICE: The information transmitted is intended only for the
> person or entity to which it is addressed and may contain confidential
> and/or privileged material. Any review, re-transmission, dissemination or
> other use of, or taking of any action in reliance upon, this information by
> persons or entities other than the intended recipient is prohibited. If you
> received this in error, please contact the sender and delete the material
> from any computer. Although we routinely screen for viruses, addressees
> should check this e-mail and any attachment for viruses. We make no
> warranty as to absence of viruses in this e-mail or any attachments.
>

Re: [Python/C++] Streaming Format to IPC File Format Conversion

Posted by Wes McKinney <we...@gmail.com>.

hi Sam — it's an interesting proposition. Other file formats like
Parquet don't make "resuming" particularly easy, either. The magic
number at the beginning of an Arrow file means that it's a lot more
expensive to turn a stream file into an Arrow-file-file — if we'd
thought about this use case, we might have chosen to only put the
magic number at the end of the file.

It's also not possible to put the file metadata "outside" the stream
file. One thing that occurs to me is whether we could enable the file
footer metadata to live in a "sidecar" file to support this use case.
To enable this, we would have to add a new optional field to Footer in
File.fbs that indicates the file path that the Footer references. This
would be null when the footer is part of the same file where the data
lives. A function could be implemented to produce this "sidecar index"
file from a stream file.

Not sure on others' thoughts about this.

Thanks,
Wes

On Wed, Jul 14, 2021 at 5:39 AM Sam Davis <Sa...@nanoporetech.com> wrote:
>
> Hi,
>
> I'm interested in a use case where there is a long running job producing results as it goes that may die and therefore must be restarted, making sure to continue from the last known-good point.
>
> For this use case, it seems best to use the "IPC Streaming Format" and write out the batches as they are generated.
>
> However, once the job is finished it would also be beneficial to have random access into the file. It seems like this is possible by:
>
> Manually creating a file with the correct magic number/padding bytes and then seq'ing past them.
> Writing batches out as they appear.
> Doing a pass over the record batches to gather the information required to generate the footer data.
>
>
> Whilst this seems possible, it doesn't seem like it is a use case that has come up before. However, this does surprise me because adding index information to a "completed" file seems like a genuinely useful thing to want to do.
>
> Has anyone encountered something similar before?
>
> Is there an easier way to achieve this? i.e. does this functionality, or parts of, exist in another language that I can bind to in Python?
>
> Best,
>
> Sam
>
>
> IMPORTANT NOTICE: The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, re-transmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and delete the material from any computer. Although we routinely screen for viruses, addressees should check this e-mail and any attachment for viruses. We make no warranty as to absence of viruses in this e-mail or any attachments.