You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@avro.apache.org by Gokay Tosunoglu <go...@yahoo.com.INVALID> on 2022/05/12 06:07:00 UTC

Avro Big Data Question from a developer

Hi there,I have a C# appilcation which deals with big data. For example like 1 file is more than 100 GB. This big file is CSV type, I am reading this customer data by buffers and checking end of line to read next buffer.I want to support avro too, but i couldn't find how can i read a 100 GB avro file and/or how to divide file into buffers.Can any of you send me a way to do it or a sample code maybe?Thanks in advance.Gokay Tosunoglu

Re: Avro Big Data Question from a developer

Posted by Martin Grigorov <mg...@apache.org>.
Hi Gokay,

I am not sure whether you received Fokko's response since you are/were not
subscribed to the mailing list (I know because I moderated your first
email).
Please check
https://lists.apache.org/thread/58c9v7qbof3jzgxzx6qf9h436zcp79wp

On Thu, May 12, 2022 at 11:44 AM Gokay Tosunoglu
<go...@yahoo.com.invalid> wrote:

>  Hi there,
> I am new to avro. I have a C# appilcation which deals with big data. For
> example like 1 file is more than 100 GB. This big file is CSV type, to be
> able to read the data, I am using bufferedstream and checking end of line
> to read next buffer.I want to support avro file type too, but i couldn't
> find how can i read a 100 GB avro file and/or how to divide file into
> buffers.Can any of you send me a way to do it or a sample code maybe?Thanks
> in advance.Gokay Tosunoglu

Avro Big Data Question from a developer

Posted by Gokay Tosunoglu <go...@yahoo.com.INVALID>.
 Hi there,
I am new to avro. I have a C# appilcation which deals with big data. For example like 1 file is more than 100 GB. This big file is CSV type, to be able to read the data, I am using bufferedstream and checking end of line to read next buffer.I want to support avro file type too, but i couldn't find how can i read a 100 GB avro file and/or how to divide file into buffers.Can any of you send me a way to do it or a sample code maybe?Thanks in advance.Gokay Tosunoglu  

Re: Avro Big Data Question from a developer

Posted by "Driesprong, Fokko" <fo...@apache.org>.
Hi Gokay,

That's some CSV file. That will probably be much smaller in Avro.

An Avro file is a so-called Object Container File. This was implemented in
the MapReduce era to make sure that the workload for each of the workers is
roughly the same. Which makes it easier to tune the memory requirements. An
Avro file actually contains one or more containers, which are individually
compressed. The blocks are separated using the synchronization marker. More
info can be found here: https://avro.apache.org/docs/current/spec.html

Also, the C# code gives some good pointers:
https://github.com/apache/avro/blob/42edbd721fedc0ed6cde89ab3b64a9ac606aa74f/lang/csharp/src/apache/main/File/DataFileWriter.cs

You could read (and uncompress) the blocks one by one to keep the memory
constant or use this to parallelize the reading of the file, which might
significantly improve the throughput of the application.

Kind regards,
Fokko Driesprong



Op do 12 mei 2022 om 08:16 schreef Gokay Tosunoglu
<go...@yahoo.com.invalid>:

> Hi there,I have a C# appilcation which deals with big data. For example
> like 1 file is more than 100 GB. This big file is CSV type, I am reading
> this customer data by buffers and checking end of line to read next
> buffer.I want to support avro too, but i couldn't find how can i read a 100
> GB avro file and/or how to divide file into buffers.Can any of you send me
> a way to do it or a sample code maybe?Thanks in advance.Gokay Tosunoglu