You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by "Traverse, Martin" <ma...@accenture.com.INVALID> on 2023/03/28 20:21:27 UTC

RE: [External] Re: row counts in footer of IPC file format

Hello,

I could take a shot at the Java one if you like?

I'm actually working in the codebase at the moment on something related that I was going to offer as a PR once it's ready. We use the Java Arrow library as the core of our data service, the VSR is our intermediate representation and we translate to/from various formats and across various storage backends. We really need non-blocking data read to make that efficient and scalable, so I've made alternate implementations of the Readers where you can feed in data as a series of ByteBuffer objects instead of calling loadNextBatch(). For streams this means feeding in bytes and buffering until a batch is available, for files we're reading the block info from the footer and then feeding in buffers (slices) for each block. I was able to reuse all the same serialization helpers etc.

Does this sound useful? If it does then I can raise a PR for Arrow when it's done. No worries if not and we just keep the non-blocking readers in our own codebase. They're not a lot of code either way.

Happy to take a shot at the row counts after that, weekend time probably. If I sketched out a draft PR would you be happy to take a look and tell me if I'm on the right lines?

Kind regards,

Martin Traverse
Technical Architect
UKI Risk
Tel: +44 7305 120 791
Email: martin.traverse@accenture.com

My regular office hours are 10:00 - 18:30 UK time, Monday - Thursday












-----Original Message-----
From: Weston Pace <we...@gmail.com>
Sent: 28 March 2023 17:35
To: dev@arrow.apache.org
Subject: [External] Re: row counts in footer of IPC file format

This message is from an EXTERNAL SENDER - be CAUTIOUS, particularly with links and attachments.

I suspect the next step will be to create two implementations and create test files for the integration test suite.  These will be required before we can vote on this.

Are either of you interested in contributing an implementation (C++, Rust, Java, and Go have been the usual suspects in the past but JS or C# should be viable too)?  In the past, once an implementation & test files have been created for one language, it has been easier to drum up a volunteer to create a second implementation.

________________________________

This message is for the designated recipient only and may contain privileged, proprietary, or otherwise confidential information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the e-mail by you is prohibited. Where allowed by local law, electronic communications with Accenture and its affiliates, including e-mail and instant messaging (including content), may be scanned by our systems for the purposes of information security and assessment of internal compliance with Accenture policy. Your privacy is important to us. Accenture uses your personal data only in compliance with data protection laws. For further information on how Accenture processes your personal data, please see our privacy statement at https://www.accenture.com/us-en/privacy-policy.
______________________________________________________________________________________

www.accenture.com

Re: [External] Re: row counts in footer of IPC file format

Posted by David Dali Susanibar Arce <da...@gmail.com>.
Hi Team,

Hi Martin, could be a good input to validate if this new Java functionality
is already implemented in other languages like C++ to consider that as a
must-have, also to check how it is aligned with your current
implementation. Anyway, I'm really interested in the PR review.

Related to row counts, I'm also interested in the PR review.


Best regards


David

El mar, 28 mar 2023 a las 15:21, Traverse, Martin
(<ma...@accenture.com.invalid>) escribió:

> Hello,
>
> I could take a shot at the Java one if you like?
>
> I'm actually working in the codebase at the moment on something related
> that I was going to offer as a PR once it's ready. We use the Java Arrow
> library as the core of our data service, the VSR is our intermediate
> representation and we translate to/from various formats and across various
> storage backends. We really need non-blocking data read to make that
> efficient and scalable, so I've made alternate implementations of the
> Readers where you can feed in data as a series of ByteBuffer objects
> instead of calling loadNextBatch(). For streams this means feeding in bytes
> and buffering until a batch is available, for files we're reading the block
> info from the footer and then feeding in buffers (slices) for each block. I
> was able to reuse all the same serialization helpers etc.
>
> Does this sound useful? If it does then I can raise a PR for Arrow when
> it's done. No worries if not and we just keep the non-blocking readers in
> our own codebase. They're not a lot of code either way.
>
> Happy to take a shot at the row counts after that, weekend time probably.
> If I sketched out a draft PR would you be happy to take a look and tell me
> if I'm on the right lines?
>
> Kind regards,
>
> Martin Traverse
> Technical Architect
> UKI Risk
> Tel: +44 7305 120 791
> Email: martin.traverse@accenture.com
>
> My regular office hours are 10:00 - 18:30 UK time, Monday - Thursday
>
>
>
>
>
>
>
>
>
>
>
>
> -----Original Message-----
> From: Weston Pace <we...@gmail.com>
> Sent: 28 March 2023 17:35
> To: dev@arrow.apache.org
> Subject: [External] Re: row counts in footer of IPC file format
>
> This message is from an EXTERNAL SENDER - be CAUTIOUS, particularly with
> links and attachments.
>
> I suspect the next step will be to create two implementations and create
> test files for the integration test suite.  These will be required before
> we can vote on this.
>
> Are either of you interested in contributing an implementation (C++, Rust,
> Java, and Go have been the usual suspects in the past but JS or C# should
> be viable too)?  In the past, once an implementation & test files have been
> created for one language, it has been easier to drum up a volunteer to
> create a second implementation.
>
> ________________________________
>
> This message is for the designated recipient only and may contain
> privileged, proprietary, or otherwise confidential information. If you have
> received it in error, please notify the sender immediately and delete the
> original. Any other use of the e-mail by you is prohibited. Where allowed
> by local law, electronic communications with Accenture and its affiliates,
> including e-mail and instant messaging (including content), may be scanned
> by our systems for the purposes of information security and assessment of
> internal compliance with Accenture policy. Your privacy is important to us.
> Accenture uses your personal data only in compliance with data protection
> laws. For further information on how Accenture processes your personal
> data, please see our privacy statement at
> https://www.accenture.com/us-en/privacy-policy.
>
> ______________________________________________________________________________________
>
> www.accenture.com
>