You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Andrew Lamb (Jira)" <ji...@apache.org> on 2020/10/02 11:46:00 UTC

[jira] [Updated] (ARROW-10159) [Rust][DataFusion] Add support for Dictionary types in data fusion

     [ https://issues.apache.org/jira/browse/ARROW-10159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrew Lamb updated ARROW-10159:
--------------------------------
    Description: 
We have a system that need to process low cardinality string data (aka there are only a few distinct values, but there are many millions of values).

Using a `StringArray` is very expensive as the same string value is copied over and over again. The `DictionaryArray` was exactly designed to handle this situatio:  rather than repeating each string, it uses indexes into a dictionary and thus repeats integer values. 

Sadly, DataFusion does not support processing on `DictionaryArray` types for several reasons.

This test (to be added to `arrow/rust/datafusion/tests/sql.rs`) shows what I would like to be possible:

{code}

#[tokio::test]
async fn query_on_string_dictionary() -> Result<()> {
    // ensure that data fusion can operate on dictionary types
    // Use StringDictionary (32 bit indexes = keys)
    let field_type = DataType::Dictionary(
        Box::new(DataType::Int32),
        Box::new(DataType::Utf8),
    );
    let schema = Arc::new(Schema::new(vec![Field::new("d1", field_type, true)]));


    let keys_builder = PrimitiveBuilder::<Int32Type>::new(10);
    let values_builder = StringBuilder::new(10);
    let mut builder = StringDictionaryBuilder::new(
        keys_builder, values_builder
    );

    builder.append("one")?;
    builder.append_null()?;
    builder.append("three")?;
    let array = Arc::new(builder.finish());

    let data = RecordBatch::try_new(
        schema.clone(),
        vec![array],
    )?;

    let table = MemTable::new(schema, vec![vec![data]])?;
    let mut ctx = ExecutionContext::new();
    ctx.register_table("test", Box::new(table));


    // Basic SELECT
    let sql = "SELECT * FROM test";
    let actual = execute(&mut ctx, sql).await.join("\n");
    let expected = "\"one\"\nNULL\n\"three\"".to_string();
    assert_eq!(expected, actual);

    // basic filtering
    let sql = "SELECT * FROM test WHERE d1 IS NOT NULL";
    let actual = execute(&mut ctx, sql).await.join("\n");
    let expected = "\"one\"\n\"three\"".to_string();
    assert_eq!(expected, actual);

    // filtering with constant
    let sql = "SELECT * FROM test WHERE d1 = 'three'";
    let actual = execute(&mut ctx, sql).await.join("\n");
    let expected = "\"three\"".to_string();
    assert_eq!(expected, actual);

    // Expression evaluation
    let sql = "SELECT concat(d1, '-foo') FROM test";
    let actual = execute(&mut ctx, sql).await.join("\n");
    let expected = "\"one-foo\"\nNULL\n\"three-foo\"".to_string();
    assert_eq!(expected, actual);

    // aggregation
    let sql = "SELECT COUNT(d1) FROM test";
    let actual = execute(&mut ctx, sql).await.join("\n");
    let expected = "2".to_string();
    assert_eq!(expected, actual);


    Ok(())
}
{code}

However, it errors immediately:

{code}

---- query_on_string_dictionary stdout ----
thread 'query_on_string_dictionary' panicked at 'assertion failed: `(left == right)`
  left: `"\"one\"\nNULL\n\"three\""`,
 right: `"???\nNULL\n???"`', datafusion/tests/sql.rs:989:5
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

{code{

This ticket tracks adding proper support Dictionary types to DataFusion. I will break the work down into several smaller subtasks

  was:
We have a system that need to process low cardinality string data (aka there are only a few distinct values, but there are many millions of values).

Using a `StringArray` is very expensive as the same string value is copied over and over again. The `DictionaryArray` was exactly designed to handle this situation where rather than repeating each string the data uses indexes into a dictionary and thus repeats integer values. 

Sadly, DataFusion does not support processing on `DictionaryArray` types for several reasons.

This test (to be added to `arrow/rust/datafusion/tests/sql.rs`) shows what I would like to be possible:

{code}

#[tokio::test]
async fn query_on_string_dictionary() -> Result<()> {
    // ensure that data fusion can operate on dictionary types
    // Use StringDictionary (32 bit indexes = keys)
    let field_type = DataType::Dictionary(
        Box::new(DataType::Int32),
        Box::new(DataType::Utf8),
    );
    let schema = Arc::new(Schema::new(vec![Field::new("d1", field_type, true)]));


    let keys_builder = PrimitiveBuilder::<Int32Type>::new(10);
    let values_builder = StringBuilder::new(10);
    let mut builder = StringDictionaryBuilder::new(
        keys_builder, values_builder
    );

    builder.append("one")?;
    builder.append_null()?;
    builder.append("three")?;
    let array = Arc::new(builder.finish());

    let data = RecordBatch::try_new(
        schema.clone(),
        vec![array],
    )?;

    let table = MemTable::new(schema, vec![vec![data]])?;
    let mut ctx = ExecutionContext::new();
    ctx.register_table("test", Box::new(table));


    // Basic SELECT
    let sql = "SELECT * FROM test";
    let actual = execute(&mut ctx, sql).await.join("\n");
    let expected = "\"one\"\nNULL\n\"three\"".to_string();
    assert_eq!(expected, actual);

    // basic filtering
    let sql = "SELECT * FROM test WHERE d1 IS NOT NULL";
    let actual = execute(&mut ctx, sql).await.join("\n");
    let expected = "\"one\"\n\"three\"".to_string();
    assert_eq!(expected, actual);

    // filtering with constant
    let sql = "SELECT * FROM test WHERE d1 = 'three'";
    let actual = execute(&mut ctx, sql).await.join("\n");
    let expected = "\"three\"".to_string();
    assert_eq!(expected, actual);

    // Expression evaluation
    let sql = "SELECT concat(d1, '-foo') FROM test";
    let actual = execute(&mut ctx, sql).await.join("\n");
    let expected = "\"one-foo\"\nNULL\n\"three-foo\"".to_string();
    assert_eq!(expected, actual);

    // aggregation
    let sql = "SELECT COUNT(d1) FROM test";
    let actual = execute(&mut ctx, sql).await.join("\n");
    let expected = "2".to_string();
    assert_eq!(expected, actual);


    Ok(())
}
{code}

However, it errors immediately:

{code}

---- query_on_string_dictionary stdout ----
thread 'query_on_string_dictionary' panicked at 'assertion failed: `(left == right)`
  left: `"\"one\"\nNULL\n\"three\""`,
 right: `"???\nNULL\n???"`', datafusion/tests/sql.rs:989:5
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

{code{

This ticket tracks adding proper support Dictionary types to DataFusion. I will break the work down into several smaller subtasks


> [Rust][DataFusion] Add support for Dictionary types in data fusion
> ------------------------------------------------------------------
>
>                 Key: ARROW-10159
>                 URL: https://issues.apache.org/jira/browse/ARROW-10159
>             Project: Apache Arrow
>          Issue Type: New Feature
>            Reporter: Andrew Lamb
>            Priority: Major
>
> We have a system that need to process low cardinality string data (aka there are only a few distinct values, but there are many millions of values).
> Using a `StringArray` is very expensive as the same string value is copied over and over again. The `DictionaryArray` was exactly designed to handle this situatio:  rather than repeating each string, it uses indexes into a dictionary and thus repeats integer values. 
> Sadly, DataFusion does not support processing on `DictionaryArray` types for several reasons.
> This test (to be added to `arrow/rust/datafusion/tests/sql.rs`) shows what I would like to be possible:
> {code}
> #[tokio::test]
> async fn query_on_string_dictionary() -> Result<()> {
>     // ensure that data fusion can operate on dictionary types
>     // Use StringDictionary (32 bit indexes = keys)
>     let field_type = DataType::Dictionary(
>         Box::new(DataType::Int32),
>         Box::new(DataType::Utf8),
>     );
>     let schema = Arc::new(Schema::new(vec![Field::new("d1", field_type, true)]));
>     let keys_builder = PrimitiveBuilder::<Int32Type>::new(10);
>     let values_builder = StringBuilder::new(10);
>     let mut builder = StringDictionaryBuilder::new(
>         keys_builder, values_builder
>     );
>     builder.append("one")?;
>     builder.append_null()?;
>     builder.append("three")?;
>     let array = Arc::new(builder.finish());
>     let data = RecordBatch::try_new(
>         schema.clone(),
>         vec![array],
>     )?;
>     let table = MemTable::new(schema, vec![vec![data]])?;
>     let mut ctx = ExecutionContext::new();
>     ctx.register_table("test", Box::new(table));
>     // Basic SELECT
>     let sql = "SELECT * FROM test";
>     let actual = execute(&mut ctx, sql).await.join("\n");
>     let expected = "\"one\"\nNULL\n\"three\"".to_string();
>     assert_eq!(expected, actual);
>     // basic filtering
>     let sql = "SELECT * FROM test WHERE d1 IS NOT NULL";
>     let actual = execute(&mut ctx, sql).await.join("\n");
>     let expected = "\"one\"\n\"three\"".to_string();
>     assert_eq!(expected, actual);
>     // filtering with constant
>     let sql = "SELECT * FROM test WHERE d1 = 'three'";
>     let actual = execute(&mut ctx, sql).await.join("\n");
>     let expected = "\"three\"".to_string();
>     assert_eq!(expected, actual);
>     // Expression evaluation
>     let sql = "SELECT concat(d1, '-foo') FROM test";
>     let actual = execute(&mut ctx, sql).await.join("\n");
>     let expected = "\"one-foo\"\nNULL\n\"three-foo\"".to_string();
>     assert_eq!(expected, actual);
>     // aggregation
>     let sql = "SELECT COUNT(d1) FROM test";
>     let actual = execute(&mut ctx, sql).await.join("\n");
>     let expected = "2".to_string();
>     assert_eq!(expected, actual);
>     Ok(())
> }
> {code}
> However, it errors immediately:
> {code}
> ---- query_on_string_dictionary stdout ----
> thread 'query_on_string_dictionary' panicked at 'assertion failed: `(left == right)`
>   left: `"\"one\"\nNULL\n\"three\""`,
>  right: `"???\nNULL\n???"`', datafusion/tests/sql.rs:989:5
> note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
> {code{
> This ticket tracks adding proper support Dictionary types to DataFusion. I will break the work down into several smaller subtasks



--
This message was sent by Atlassian Jira
(v8.3.4#803005)