You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by "kesavkolla (via GitHub)" <gi...@apache.org> on 2023/02/16 20:02:25 UTC

[GitHub] [arrow-datafusion] kesavkolla opened a new issue, #3617: Feature request for support for struct and arry data types

kesavkolla opened a new issue, #3617:
URL: https://github.com/apache/arrow-datafusion/issues/3617

   datafusion doesn't support all possible data types the arrow supports. What is the roadmap for supporting for structs, lists etc...? It would be good to support some pushdowns to the complex data to arrow.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] kesavkolla commented on issue #3617: Feature request for support for struct and arry data types

Posted by GitBox <gi...@apache.org>.

kesavkolla commented on issue #3617:
URL: https://github.com/apache/arrow-datafusion/issues/3617#issuecomment-1270090538

   I want to be able to specify so.e kind of path expression like select a.b.c for nested structs. For list types also some notation to access the index.
   
   My data is a heavy nested and list structs. Currently I can't query them individual fields can't use nested columns in filters.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #3617: Feature request for support for struct and arry data types

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on issue #3617:
URL: https://github.com/apache/arrow-datafusion/issues/3617#issuecomment-1433631918

   Yeah, it seems to work just fine for me in datafusion-cli. Thus I think we should close this ticket in datafusion. I am not sure what is going on with ballista. 
   
   ```
   alamb@MacBook-Pro-8:~/Downloads$ datafusion-cli
   DataFusion CLI v18.0.0
   ❯ select text['status'] from 'part-00000-f6337bce-7fcd-4021-9f9d-040413ea83f8-c000.snappy.parquet' limit 10;
   +----------------------------------------------------------------------------------+
   | part-00000-f6337bce-7fcd-4021-9f9d-040413ea83f8-c000.snappy.parquet.text[status] |
   +----------------------------------------------------------------------------------+
   |                                                                                  |
   | generated                                                                        |
   | generated                                                                        |
   | generated                                                                        |
   | generated                                                                        |
   | generated                                                                        |
   | generated                                                                        |
   | generated                                                                        |
   | generated                                                                        |
   | generated                                                                        |
   +----------------------------------------------------------------------------------+
   10 rows in set. Query took 0.010 seconds.
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #3617: Feature request for support for struct and arry data types

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on issue #3617:
URL: https://github.com/apache/arrow-datafusion/issues/3617#issuecomment-1433657976

   Thanks @ahmedriza !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #3617: Feature request for support for struct and arry data types

Posted by GitBox <gi...@apache.org>.

alamb commented on issue #3617:
URL: https://github.com/apache/arrow-datafusion/issues/3617#issuecomment-1261454756

   You may find more information on https://github.com/apache/arrow-datafusion/issues/2326 -- it would be great to get some idea of what you are trying to do / what datafusion can't do for you today


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #3617: Feature request for support for struct and arry data types

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on issue #3617:
URL: https://github.com/apache/arrow-datafusion/issues/3617#issuecomment-1431989592

   > If we take the Parquet provided by @kesavkolla, we have the following column, text whose Parquet schema is:
   
   Hi @ahmedriza  -- I am not sure what your system is doing exactly, but that error appears to be related to protobuf serialization
   
   I looked at `to_proto` and it seems like it has the right code
   https://github.com/apache/arrow-datafusion/blob/d05647c65e14d865b854a845ae970797a6086e2c/datafusion/proto/src/logical_plan/to_proto.rs#L859C30-L867
   
   Could you share the file you are using on this ticket so I can give it a try? Maybe we have fixed this in another version of DataFusion
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #3617: Feature request for support for struct and arry data types

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on issue #3617:
URL: https://github.com/apache/arrow-datafusion/issues/3617#issuecomment-1604486543

   https://github.com/apache/arrow-datafusion/issues/2326 is tracking such support


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] ahmedriza commented on issue #3617: Feature request for support for struct and arry data types

Posted by "ahmedriza (via GitHub)" <gi...@apache.org>.

ahmedriza commented on issue #3617:
URL: https://github.com/apache/arrow-datafusion/issues/3617#issuecomment-1433656477

   I'll take a look to see if I can find out where `ballista` is going wrong, although, of course, ultimately, the call ends in `datafusion`. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #3617: Feature request for support for struct and arry data types

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on issue #3617:
URL: https://github.com/apache/arrow-datafusion/issues/3617#issuecomment-1433645989

   No I take it back, what we should really do is probably start categorizing what works and what doesn't for these structureed types. Like array access via `["foo"]` works, but not list access, for example. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #3617: Feature request for support for struct and arry data types

Posted by GitBox <gi...@apache.org>.

alamb commented on issue #3617:
URL: https://github.com/apache/arrow-datafusion/issues/3617#issuecomment-1270418129

   > I want to be able to specify so.e kind of path expression like select a.b.c for nested structs. For list types also some notation to access the index.
   
   Have you tried the `[]` syntax?
   
   Like `struct_column["b"]["c"]` for nested structs and `list_column[3]` for list access? I was pleasantly surprised that it worked when I last tried it


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] ahmedriza commented on issue #3617: Feature request for support for struct and arry data types

Posted by "ahmedriza (via GitHub)" <gi...@apache.org>.

ahmedriza commented on issue #3617:
URL: https://github.com/apache/arrow-datafusion/issues/3617#issuecomment-1432204767

   @alamb Apologies, I should have been more clear.  The Parquet file mentioned was in https://github.com/apache/arrow-datafusion/issues/2439. Attaching here as well. 
   
   The above mentioned error was when I ran the SQL from `ballista` and I've checked that `ballista` on the master branch is currently using `datafusion` version `18.0.0`.  
   
   Hence, I just wrote two little tests, using the `datafusion` and the `ballista` context respectively. 
   
   SQL from the `datafusion` context works, whilst the one that uses the `ballista` context fails.   Test code:
   ```rust
   #[tokio::test]
   async fn test_datafusion_sql() {
       let ctx = SessionContext::new();
       let filename = "part-00000-f6337bce-7fcd-4021-9f9d-040413ea83f8-c000.snappy.parquet";
       ctx.register_parquet("t", filename, ParquetReadOptions::default()).await.unwrap();
       let df = ctx.sql("select t.text['status'] from t").await.unwrap();
       df.show().await.unwrap();
   }
   ```
   Output:
   ```
   +----------------+
   | t.text[status] |
   +----------------+
   |                |
   | generated      |
   | generated      |
   | generated      |
   | generated      |
   | generated      |
   | generated      |
   | generated      |
   | generated      |
   | generated      |
   +----------------+
   ```
   ```rust
   #[tokio::test]
   async fn test_ballista_sql() {
       let config = BallistaConfig::builder().build().unwrap();
       let ctx = BallistaContext::standalone(&config, 10).await.unwrap();
       let filename = "part-00000-f6337bce-7fcd-4021-9f9d-040413ea83f8-c000.snappy.parquet";
       ctx.register_parquet("t", filename, ParquetReadOptions::default()).await.unwrap();
       let df = ctx.sql("select t.text['status'] from t").await.unwrap();
       df.show().await.unwrap();
   }
   ```
   Output:
   ```
   thread 'query::test::test_ballista_sql' panicked at 'called `Result::unwrap()` on an `Err` value: ArrowError(ExternalError(Execution("Job QeRwZCh failed: Error planning job QeRwZCh: DataFusionError(Internal(\"physical_plan::to_proto() unsupported expression GetIndexedFieldExpr { arg: Column { name: \\\"text\\\", index: 0 }, key: Utf8(\\\"status\\\") }\"))")))', src/query.rs:44:25
   ```
   I am a bit surprised by the failure from the `ballista` version. 
   
   I've checked my `Cargo.toml` and the `cargo tree` output as well to double check that there really is just `datafusion` version `18.0.0` that's being used.
   
   ```
   ballista = { git = "https://github.com/apache/arrow-ballista", features = ["s3"] }
   ballista-cli = { git = "https://github.com/apache/arrow-ballista", features = ["s3"] }
   ballista-core = { git = "https://github.com/apache/arrow-ballista", features = ["s3"] }
   datafusion = "18.0.0"
   
   futures = "0.3"
   object_store = "0.5"
   tokio = { version = "1", features = ["full"] }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] ahmedriza commented on issue #3617: Feature request for support for struct and arry data types

Posted by "ahmedriza (via GitHub)" <gi...@apache.org>.

ahmedriza commented on issue #3617:
URL: https://github.com/apache/arrow-datafusion/issues/3617#issuecomment-1431227784

   If we take the Parquet provided by @kesavkolla, we have the following column, `text` whose Parquet schema is:
   ```
    |-- text: struct (nullable = true)
    |    |-- id: string (nullable = true)
    |    |-- extension: array (nullable = true)
    |    |    |-- element: string (containsNull = true)
    |    |-- status: string (nullable = true)
    |    |-- div: string (nullable = true)
   ```
   and sample data:
   ```
   +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   | text                                                                                                                                                                                   |
   +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   |                                                                                                                                                                                        |
   | {"id": null, "extension": null, "status": "generated", ...
   ```
   I tried the following SQL to select one of the fields in the `struct`:
   ```
   ctx.register_parquet("t", "t.parquet", ParquetReadOptions::default()).await?;
   ctx.sql("select t.text['id'] from t").await?;
   ```
   
   However, this resulted in the following error:
   ```
   Error: Arrow error: External error: Execution error: Job zlH3pzz failed: Error planning job zlH3pzz: DataFusionError(Internal("physical_plan::to_proto() unsupported expression GetIndexedFieldExpr { arg: Column { name: \"text\", index: 0 }, key: Utf8(\"id\") }"))
   ```
   
   Looking at `datafusion/proto/src/physical_plan/to_proto.rs` it does appear that this is not supported at present.  Or perhaps I have made a mistake in my SQL?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] kesavkolla commented on issue #3617: Feature request for support for struct and arry data types

Posted by GitBox <gi...@apache.org>.

kesavkolla commented on issue #3617:
URL: https://github.com/apache/arrow-datafusion/issues/3617#issuecomment-1270434094

   I tried for list list_column[0] it didn't work.
   
   I get following exception:
   
   ```
   thread 'tokio-runtime-worker' panicked at 'called `Result::unwrap()` on an `Err` value: ArrowError(ComputeError("concat requires input of at least one array"))', /home/kesav/.cargo/git/checkouts/arrow-datafusion-71ae82d9dec9a01c/8df5496/datafusion/common/src/scalar.rs:1383:18
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb closed issue #3617: Feature request for support for struct and arry data types

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb closed issue #3617: Feature request for support for struct and arry data types
URL: https://github.com/apache/arrow-datafusion/issues/3617


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] ahmedriza commented on issue #3617: Feature request for support for struct and arry data types

Posted by "ahmedriza (via GitHub)" <gi...@apache.org>.

ahmedriza commented on issue #3617:
URL: https://github.com/apache/arrow-datafusion/issues/3617#issuecomment-1433915462

   After sorting out the `protobuf` serialisation of the `GetIndexedFieldExpr`, `ballista` now works. Tested the fixes on a branch at https://github.com/ahmedriza/arrow-datafusion/tree/get_indexed_proto.  Will raise a PR after checking it a bit more carefully.
   
   Now this code works as expected:
   ```rust
   async fn ballista_query()  {
       let config = BallistaConfig::new().unwrap();
       let ctx = BallistaContext::standalone(&config, 10).await.unwrap();
       ctx.register_parquet("t", part-00000-f6337bce-7fcd-4021-9f9d-040413ea83f8-c000.snappy.parquet,
           ParquetReadOptions::default()).await.unwrap();
       let df = ctx.sql("select id, a[0], a[1], a[2], a[3], a[100] from t").await.unwrap();
       df.show().await.unwrap();
   
       Ok(())
   }
   ```
   
   ```
   +----+--------+--------+--------+--------+----------+
   | id | t.a[0] | t.a[1] | t.a[2] | t.a[3] | t.a[100] |
   +----+--------+--------+--------+--------+----------+
   | 1  |        | 1.71   | 2.71   | 3.71   |          |
   +----+--------+--------+--------+--------+----------+
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org