You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@arrow.apache.org by William Smith - Network <sm...@network.lilly.com> on 2020/09/17 17:45:31 UTC

Add columns to persisted data sets -- have methods been developed?

Hello,

Please forgive what I'm sure is a frequent question, but I have not been able to find a reasonable solution to what I'm sure is a very standard issue.  I expect to have what I think must be a very common pattern:  a pipeline element retrieves a large data set, performs an expensive computation to create one or more new columns, and wants to save out the expanded data set for downstream pipeline elements which will consume some of the new data and some of the old.

As I understand, there is no way to alter a persisted data set.  Is that correct?  If so, how do others address this situation?  The obvious answer is to write a new data set, but that approach wastes space and encourages data duplication.  One could write out the new columns only to a new dataset, but then we have to manage links between data sets.  Is that managed by Arrow?  If not, are there standard extensions for managing the links, or is there a better way?

Thanks,
Bill

William F. Smith
Bioinformatician
BCforward
Lilly Biotechnology Center
10290 Campus Point Dr.
San Diego, CA 92121
smith_william1@network.lilly.com<ma...@network.lilly.com>

CONFIDENTIALITY NOTICE: This email message (including all attachments) is for the sole use of the intended recipient(s) and may contain confidential information. Any unauthorized review, use, disclosure, copying or distribution is strictly prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.