You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/04/08 14:34:20 UTC

[GitHub] [arrow-datafusion] Cheappie opened a new issue, #2179: [Question] Are composite types supported ?

Cheappie opened a new issue, #2179:
URL: https://github.com/apache/arrow-datafusion/issues/2179

   Hi, I have seen in unit tests that It is possible to store arrow data types in parquet using ArrowWriter. I have created composite type like UUID to check if query against such data will work, but it fails. It looks like as if schema couldn't be read correctly or simply understood.
   
   This page mentions that nested types are not supported https://arrow.apache.org/datafusion/user-guide/sql/sql_status.html,
   but there is a way to serialize them via ArrowWriter.
   
   What is the current status ?
   
   ```
   const TABLE_ABS_PATH: &str = "/abc/table/...";
   const SAMPLE_PATH: &str = "/abc/table/sample1.parquet";
   
   #[tokio::main]
   async fn main() -> Result<()> {
       write();
       println!();
       read();
       println!();
       sql().await?;
   
       Ok(())
   }
   
   fn write() {
       let file = std::fs::File::create(Path::new(SAMPLE_PATH)).unwrap();
   
       let uuid_structure = DataType::Struct(vec![
           Field::new("most", DataType::Int64, false),
           Field::new("least", DataType::Int64, false),
       ]);
   
       let uuid = Field::new("UUID", uuid_structure.clone(), false);
   
       let schema = Arc::new(Schema::new(vec![uuid]));
   
       let mut writer = ArrowWriter::try_new(file, schema.clone(), None).expect("...");
   
       let data = StructArray::from(vec![
           (
               Field::new("most", DataType::Int64, false),
               Arc::new(Int64Array::from(vec![1, 3])) as ArrayRef,
           ),
           (
               Field::new("least", DataType::Int64, false),
               Arc::new(Int64Array::from(vec![2, 4])) as ArrayRef,
           ),
       ]);
   
       let rd =
           datafusion::arrow::record_batch::RecordBatch::try_new(schema.clone(), vec![Arc::new(data)])
               .expect("...");
   
       writer.write(&rd);
   
       writer.close();
   }
   
   fn read() {
       use parquet::file::reader::{FileReader, SerializedFileReader};
       use std::{fs::File, path::Path};
   
       let f = File::open(Path::new(SAMPLE_PATH)).unwrap();
       let reader1 = SerializedFileReader::new(f).expect("...");
       let mut pqrd = ParquetFileArrowReader::new(Arc::new(reader1));
   
       let result = pqrd.get_record_reader(60).expect("...");
       for batch in result {
           let batch = batch.unwrap();
           println!("{:?}", batch);
       }
   }
   
   async fn sql() -> Result<()> {
       let mut ctx = ExecutionContext::new();
   
       let file_format = ParquetFormat::default().with_enable_pruning(false);
       let listing_options = ListingOptions {
           file_extension: DEFAULT_PARQUET_EXTENSION.to_owned(),
           format: Arc::new(file_format),
           table_partition_cols: vec![],
           collect_stat: false,
           target_partitions: 1,
       };
   
       let uuid_structure = DataType::Struct(vec![
           Field::new("most", DataType::Int64, false),
           Field::new("least", DataType::Int64, false),
       ]);
   
       let uuid = Field::new("UUID", uuid_structure.clone(), false);
   
       let schema = Arc::new(Schema::new(vec![uuid]));
   
       ctx.register_listing_table(
           "FANCY_TABLE",
           &format!("file://{}", TABLE_ABS_PATH),
           listing_options,
           Some(schema),
       )
       .await
       .unwrap();
   
       let df = ctx.sql("SELECT * FROM FANCY_TABLE").await?;
   
       df.show().await?;
   
       Ok(())
   }
   ```
   
   Error:
   ```
   Invalid argument error: column types must match schema types, expected Struct([Field { name: \"most\", data_type: Int64, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: None }, Field { name: \"least\", data_type: Int64, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: None }]) but found Struct([Field { name: \"most\", data_type: Int64, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: None }]) at column index 0")
   ```
   
   Content:
   ```
   RecordBatch { schema: Schema { fields: [Field { name: "UUID", data_type: Struct([Field { name: "most", data_type: Int64, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: None }, Field { name: "least", data_type: Int64, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: None }]), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: None }], metadata: {} }, columns: [StructArray
   [
   -- child 0: "most" (Int64)
   PrimitiveArray<Int64>
   [
     1,
     3,
   ]
   -- child 1: "least" (Int64)
   PrimitiveArray<Int64>
   [
     2,
     4,
   ]
   ]] }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on issue #2179: [Question] Are composite types supported ?

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #2179:
URL: https://github.com/apache/arrow-datafusion/issues/2179#issuecomment-1107831171

   Thanks @Cheappie  -- I agree that the support in DataFusion is not complete. I have filed https://github.com/apache/arrow-datafusion/issues/2326 to start tracking what else is needed. Thanks again. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] Cheappie commented on issue #2179: [Question] Are composite types supported ?

Posted by GitBox <gi...@apache.org>.
Cheappie commented on issue #2179:
URL: https://github.com/apache/arrow-datafusion/issues/2179#issuecomment-1112063686

   thanks @alamb for taking over on that issue, I will keep my fingers crossed that this feature will arrive in near future 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] Cheappie commented on issue #2179: [Question] Are composite types supported ?

Posted by GitBox <gi...@apache.org>.
Cheappie commented on issue #2179:
URL: https://github.com/apache/arrow-datafusion/issues/2179#issuecomment-1103159840

   @alamb well in my case It is impossible to run query because datafusion(ArrowReader) fails to load parquet file that was serialized via ArrowWriter. 
   
   From what we can see in error below, somehow schema is missing field. But actually ArrowReader reads schema correctly. Just a bit later one field is lost somewhere or maybe struct is incorrectly interpreted somewhere in datafusion.
   ```
   expected: Struct([Field { name: \"most\"}, Field { name: \"least\"}]) 
   but found: Struct([Field { name: \"most\"]) at column index 0")
   ```
   
   Whats even more interesting using same ArrowReader(ParquetFileArrowReader) as datafusion uses internally, I was able to read this parquet file without issues and access both columns of struct using below snippet.
   ```
       let rd = SerializedFileReader::new(file).expect("...");
       let mut pqrd = ParquetFileArrowReader::new(Arc::new(rd));
   
       for batch in pqrd.get_record_reader(60).expect("...") {
           let batch = batch.unwrap();
           let col = batch.column(0);
           let child_data = col.data().child_data();
           println!("{:?}", child_data[0].buffers().get(0));
           println!("{:?}", child_data[1].buffers().get(0));
       }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on issue #2179: [Question] Are composite types supported ?

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #2179:
URL: https://github.com/apache/arrow-datafusion/issues/2179#issuecomment-1520321360

   Tracking in https://github.com/apache/arrow-datafusion/issues/2326


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] Cheappie commented on issue #2179: [Question] Are composite types supported ?

Posted by GitBox <gi...@apache.org>.
Cheappie commented on issue #2179:
URL: https://github.com/apache/arrow-datafusion/issues/2179#issuecomment-1100189949

   It would be great to receive some feedback, I am not sure but described issue might be a bug living somewhere in arrow reader 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on issue #2179: [Question] Are composite types supported ?

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #2179:
URL: https://github.com/apache/arrow-datafusion/issues/2179#issuecomment-1100858104

   Sorry @Cheappie  -- I think the answer to your question is that this feature is not ready yet. We have some support like https://github.com/apache/arrow-datafusion/issues/119 but perhaps there is more needed as well. 
   
   I wonder if you have tried the following formulations:
   
   ```sql
   SELECT UUID.most  FROM FANCY_TABLE
   ```
   
   ```sql
   SELECT UUID["most"]  FROM FANCY_TABLE
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb closed issue #2179: [Question] Are composite types supported ?

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb closed issue #2179: [Question] Are composite types supported ?
URL: https://github.com/apache/arrow-datafusion/issues/2179


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org