You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Ying Zhou <yz...@gmail.com> on 2021/01/28 07:15:49 UTC

[C++] Random table generator and table converter

Hi,

For the C++ tests for the ORC writer there are two functions I need which can significantly shorten the tests, namely a generic table generator and a table converter. 

For the former I know there is arrow/testing/random.h which can generate random arrays. Shall I generate random struct arrays using ArrayOf and then expand them into RecordBatches or alternatively shall I generate each array separately using ArrayOf and then combine them? By the way I haven’t found any function that can directly generate an Arrow Table using a schema, size and null_probability. Is there any need for such functionality? If this is useful for purposes beyond ORC/Parquet/CSV/etc IO maybe we should write one.

For the latter what I need is a table converter that can recursively convert every instance of LargeBinary and FixedSizeBinary into Binary, every instance of LargeString into String, every instance of Date64 into Timestamp (unit = MILLI), every instance of LargeList and FixedSizeList into List and maybe every instance of Map into List of Structs in a table to independently produce the expected ORCReader(ORCWriter(Table)) so that I can verify that the ORCWriter is working as intended. For this problem I have at least two possible approaches: either perform the conversion mainly at array level or do so mainly at scalar level. Which one is better?

Thanks,
Ying

P.S. Thanks Antoine and Uwe for the very helpful reviews! The current codebase is already very different from the one when it was last reviewed. :)
P.S.S. The table converter is unavoidable due to Arrow having a lot more types than ORC.

Re: [C++] Random table generator and table converter

Posted by Antoine Pitrou <an...@python.org>.
Hi Ying,

Le 28/01/2021 à 08:15, Ying Zhou a écrit :
> 
> 
> By the way I haven’t found any function that can directly generate an Arrow Table using a schema, size and null_probability. Is there any need for such functionality? If this is useful for purposes beyond ORC/Parquet/CSV/etc IO maybe we should write one.

Yes, that would probably be generally useful for testing.

> For the latter what I need is a table converter that can recursively convert every instance of LargeBinary and FixedSizeBinary into Binary, every instance of LargeString into String, every instance of Date64 into Timestamp (unit = MILLI), every instance of LargeList and FixedSizeList into List and maybe every instance of Map into List of Structs in a table to independently produce the expected ORCReader(ORCWriter(Table)) so that I can verify that the ORCWriter is working as intended. For this problem I have at least two possible approaches: either perform the conversion mainly at array level or do so mainly at scalar level.

Do you know the Cast() API? See arrow/compute/cast.h for details.

Regards

Antoine.