You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Ryan Skraba via user <us...@flink.apache.org> on 2023/05/26 16:38:59 UTC

Bulk storage of protobuf records in files

Hello all!

I discovered while investigating FLINK-32008[1] that we can write to the
filesystem connector with the protobuf format, but today, the resulting
file is pretty unlikely to be useful or rereadable.

There's no real standard for storing many protobuf messages in a single
file container, although the documentation mentions writing size-delimited
messages sequentially[2].  In practice, I've never encountered protobuf
binaries stored on filesystems without using some other sort of "framing"
(like how parquet can be accessed with either an Avro or a protobuf
oriented API).

Does anyone have any use cases for bulk storage of protobuf messages on a
filesystem?  Should these files just be considered temporary storage for
Flink jobs, or do they need to be compatible with other systems?  Is there
a splittable / compressable file format?

The alternative might be to just forbid file storage for protobuf
messages!  Any opinions?

All my best, Ryan Skraba

[1]: https://issues.apache.org/jira/browse/FLINK-32008
[2]: https://protobuf.dev/programming-guides/techniques/#streaming

Re: Bulk storage of protobuf records in files

Posted by Shammon FY <zj...@gmail.com>.
Hi Ryan,

What I usually encounter is writing Protobuf format data to systems such as
Kafka, and I have never encountered writing to a file yet.

Best,
Shammon FY


On Mon, Jun 5, 2023 at 10:50 PM Martijn Visser <ma...@apache.org>
wrote:

> Hey Ryan,
>
> I've never encountered a use case for writing Protobuf encoded files to a
> filesystem.
>
> Best regards,
>
> Martijn
>
> On Fri, May 26, 2023 at 6:39 PM Ryan Skraba via user <
> user@flink.apache.org> wrote:
>
>> Hello all!
>>
>> I discovered while investigating FLINK-32008[1] that we can write to the
>> filesystem connector with the protobuf format, but today, the resulting
>> file is pretty unlikely to be useful or rereadable.
>>
>> There's no real standard for storing many protobuf messages in a single
>> file container, although the documentation mentions writing size-delimited
>> messages sequentially[2].  In practice, I've never encountered protobuf
>> binaries stored on filesystems without using some other sort of "framing"
>> (like how parquet can be accessed with either an Avro or a protobuf
>> oriented API).
>>
>> Does anyone have any use cases for bulk storage of protobuf messages on a
>> filesystem?  Should these files just be considered temporary storage for
>> Flink jobs, or do they need to be compatible with other systems?  Is there
>> a splittable / compressable file format?
>>
>> The alternative might be to just forbid file storage for protobuf
>> messages!  Any opinions?
>>
>> All my best, Ryan Skraba
>>
>> [1]: https://issues.apache.org/jira/browse/FLINK-32008
>> [2]: https://protobuf.dev/programming-guides/techniques/#streaming
>>
>

Re: Bulk storage of protobuf records in files

Posted by Martijn Visser <ma...@apache.org>.
Hey Ryan,

I've never encountered a use case for writing Protobuf encoded files to a
filesystem.

Best regards,

Martijn

On Fri, May 26, 2023 at 6:39 PM Ryan Skraba via user <us...@flink.apache.org>
wrote:

> Hello all!
>
> I discovered while investigating FLINK-32008[1] that we can write to the
> filesystem connector with the protobuf format, but today, the resulting
> file is pretty unlikely to be useful or rereadable.
>
> There's no real standard for storing many protobuf messages in a single
> file container, although the documentation mentions writing size-delimited
> messages sequentially[2].  In practice, I've never encountered protobuf
> binaries stored on filesystems without using some other sort of "framing"
> (like how parquet can be accessed with either an Avro or a protobuf
> oriented API).
>
> Does anyone have any use cases for bulk storage of protobuf messages on a
> filesystem?  Should these files just be considered temporary storage for
> Flink jobs, or do they need to be compatible with other systems?  Is there
> a splittable / compressable file format?
>
> The alternative might be to just forbid file storage for protobuf
> messages!  Any opinions?
>
> All my best, Ryan Skraba
>
> [1]: https://issues.apache.org/jira/browse/FLINK-32008
> [2]: https://protobuf.dev/programming-guides/techniques/#streaming
>