You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Ben Kietzman (Jira)" <ji...@apache.org> on 2021/02/23 15:49:00 UTC

[jira] [Updated] (ARROW-11745) [C++] Improve configurability of random data generation

     [ https://issues.apache.org/jira/browse/ARROW-11745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ben Kietzman updated ARROW-11745:
---------------------------------
    Description: 
{{arrow::random::RandomArrayGenerator}} is useful for stress testing and benchmarking. Arrays of primitives can be generated with little boilerplate, however it is cumbersome to specify creation of nested arrays or record batches which are necessary for testing $n column operations such as group_by.

My ideal API for random generation takes only a FieldVector, a number of rows, and a seed as arguments. Other options (such as minimum, maximum, unique count, null probability, etc) are specified using field metadata so that they can be provided uniformly or granularly as necessary for a given test case:
{code:c++}
auto random_batch = Generate({
  field("i32", int32()), // i32 may take any value between INT_MAX and INT_MIN
                         // and will be null with default probability 0.01
  field("f32", float32(), false), // f32 will be entirely valid
  field("probability", float64(), true, key_value_metadata({
    // custom random generation properties:
    {"min", "0.0"},
    {"max", "1.0"},
    {"null_probability", "0.0001"},
  }),
  field("list_i32", list(
    field("item", int32(), true, key_value_metadata({
      // custom random generation properties can also be specified for nested fields:
      {"min", "0"},
      {"max", "1"},
    })
  )),
}, num_rows, 0xdeadbeef);
{code}

  was:
{{arrow::random::RandomArrayGenerator}} is useful for stress testing and benchmarking. Arrays of primitives can be generated with little boilerplate, however it is cumbersome to specify creation of nested arrays or record batches which are necessary for testing $n column operations such as group_by.

My ideal API for random generation takes only a FieldVector, a number of rows, and a seed as arguments. Other options (such as minimum, maximum, unique count, null probability, etc) are specified using field metadata so that they can be provided uniformly or granularly as necessary for a given test case:
{code:c++}
auto random_batch = Generate({
  field("i32", int32()), // i32 may take any value between INT_MAX and INT_MIN
                         // and will be null with default probability 0.01
  field("f32", float32(), false), // f32 will be entirely valid
  field("probability", float64(), true, key_value_metadata({
    // custom random generation properties:
    {"min", "0.0"},
    {"max", "1.0"},
    {"null_probability", "0.0001"},
  }),
  field("list_i32", list(
    field("item", int32(), true, key_value_metadata({
      // custom random generation properties can also be specified for null fields:
      {"min", "0"},
      {"max", "1"},
    })
  )),
}, num_rows, 0xdeadbeef);
{code}


> [C++] Improve configurability of random data generation
> -------------------------------------------------------
>
>                 Key: ARROW-11745
>                 URL: https://issues.apache.org/jira/browse/ARROW-11745
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>    Affects Versions: 3.0.0
>            Reporter: Ben Kietzman
>            Assignee: Ben Kietzman
>            Priority: Major
>
> {{arrow::random::RandomArrayGenerator}} is useful for stress testing and benchmarking. Arrays of primitives can be generated with little boilerplate, however it is cumbersome to specify creation of nested arrays or record batches which are necessary for testing $n column operations such as group_by.
> My ideal API for random generation takes only a FieldVector, a number of rows, and a seed as arguments. Other options (such as minimum, maximum, unique count, null probability, etc) are specified using field metadata so that they can be provided uniformly or granularly as necessary for a given test case:
> {code:c++}
> auto random_batch = Generate({
>   field("i32", int32()), // i32 may take any value between INT_MAX and INT_MIN
>                          // and will be null with default probability 0.01
>   field("f32", float32(), false), // f32 will be entirely valid
>   field("probability", float64(), true, key_value_metadata({
>     // custom random generation properties:
>     {"min", "0.0"},
>     {"max", "1.0"},
>     {"null_probability", "0.0001"},
>   }),
>   field("list_i32", list(
>     field("item", int32(), true, key_value_metadata({
>       // custom random generation properties can also be specified for nested fields:
>       {"min", "0"},
>       {"max", "1"},
>     })
>   )),
> }, num_rows, 0xdeadbeef);
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)