You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "krishna-prasad-s (via GitHub)" <gi...@apache.org> on 2023/09/15 02:09:00 UTC

[GitHub] [arrow-datafusion] krishna-prasad-s opened a new issue, #7564: Mismatch in timestamp format between dataframe and recordbatch.

krishna-prasad-s opened a new issue, #7564:
URL: https://github.com/apache/arrow-datafusion/issues/7564

   ### Describe the bug
   
   I'm trying to read an pre-existing delta table (using delta-rs as the TableProvider) .
   
    when I do `self.ctx.read_table(table).unwrap();` I get  a dataframe where a timestamp field is `data_type: Timestamp(Microsecond, None)`. 
   
   Now I collect recordbatches by `df.collect().await;` here when I inspect the schema of a record batch I see that the datatype for the field has changed to `data_type: Timestamp(Nanosecond, None)`
   
   When I use the DeltaTable writer (from delta.rs) its comparing doing a diff of the arrow schema's between the table and the recordbatch and this shows a mistmatch. the only difference was this and the write fails.
   
   
   ### To Reproduce
   
   1. Create an deltatable (not using datafusion)
   2. open it using deltafusion and collect the dataframe
   3. compare the schema between the delta table and record bactch.
   4
   
   ### Expected behavior
   
   The recordbatch schema should be `data_type: Timestamp(Microsecond, None)`.
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on issue #7564: Mismatch in timestamp format between dataframe and recordbatch.

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #7564:
URL: https://github.com/apache/arrow-datafusion/issues/7564#issuecomment-1722213522

   Can you provide an example of how you do
   
   > open it using deltafusion and collect the dataframe
   
   I wonder if this issue is in datafusion or something in delta.rs 🤔 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on issue #7564: Mismatch in timestamp format between dataframe and recordbatch when read from a delta table

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #7564:
URL: https://github.com/apache/arrow-datafusion/issues/7564#issuecomment-1728375000

   I looked around in the delta.rs code, and while I am not super familiar with how DeltaLake works, it seems like the code in delta.rs treats all timestamps as though it has microsecond precision timestamps:
   
   https://github.com/delta-io/delta-rs/blob/a74589be7c39315360925049c716d1d70b906970/rust/src/delta_arrow.rs#L122-L125
   
   Perhaps this issue is related:  https://github.com/delta-io/delta/issues/643
   
   > let record_batches: Vec<RecordBatch> = cast_df.collect().await.unwrap(); this works but when I see the schema in record batch , I see this now as data_type: Timestamp(Nanosecond, None).
   
   What is the definition of `cast_df`? 
   
   I see `dataframe defined above, but not `cast_df`
   ```
    let dataframe = self.ctx.read_table(table).unwrap()
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] krishna-prasad-s commented on issue #7564: Mismatch in timestamp format between dataframe and recordbatch when read from a delta table

Posted by "krishna-prasad-s (via GitHub)" <gi...@apache.org>.
krishna-prasad-s commented on issue #7564:
URL: https://github.com/apache/arrow-datafusion/issues/7564#issuecomment-1723193554

   Sure,
    First I open an deltatable  (this is from delta-rs with datafusion enabled/)
   `let delta_table = delta::open_table_with_storage_options(table_path, storage_options).await;`
   table_path is on ADLS (`abfss://container@az21p1sewe01.dfs.core.windows.net/im/lev/table/`) and 
   storage options is the data for acess.
   
   One I have this table, I try to use the schema from the table and create a local table 
   schema is retrieved as
   
   ```
    let schema = table.get_schema().unwrap().clone();
   
   ```
   
   and another table is created as
   
   ```
           let response = DeltaTableBuilder::from_uri(table_path)
               .with_storage_options(storage_options.unwrap_or_default())
               .build();
   
           match response {
               Ok(table) => {
                   let mut config = DeltaOps(table)
                       .create()
                       .with_columns(schema.get_fields().clone()); 
   
                   if let Some(name) = name {
                       config = config.with_table_name(name);
                   }               
                   
                   let built_table = config.into_future().await?;
   
                   return Ok(built_table);
               },
               Err(err) => return Err(err),
           }  
   
   ```
   
   I now try to get the data frame from the original table
   
   ```
    let dataframe = self.ctx.read_table(table).unwrap()
   ```
   Here when I inspect the schema it's still returned as `data_type: Timestamp(Microsecond, None)`
   
   now I tried to get record batches `let record_batches: Vec<RecordBatch> = cast_df.collect().await.unwrap();` this works but when I see the schema in record batch , I see this now as `data_type: Timestamp(Nanosecond, None)`.
   
   I even attempted to make a cast when I do the select of the dataframe.  ` let cast_expr = c.cast_to( &DataType::Timestamp(TimeUnit::Microsecond, None), data_schema).unwrap(); `
   
   with something like this on the select statement, but the result is still the same.
   
   
   
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] krishna-prasad-s commented on issue #7564: Mismatch in timestamp format between dataframe and recordbatch when read from a delta table

Posted by "krishna-prasad-s (via GitHub)" <gi...@apache.org>.
krishna-prasad-s commented on issue #7564:
URL: https://github.com/apache/arrow-datafusion/issues/7564#issuecomment-1729079889

   Hi , 
    I tried casting in `df.select`.
   ` let mut cast_df = df.select(fix_timestamp_to_micro(&sel_schema)).unwrap();`
   and in the function.
   ```
   fn fix_timestamp_to_micro(data_schema: &DFSchema) -> Vec<Expr> {
   
       let mut exprs : Vec<Expr> = Vec::new();
       for column in data_schema.fields() {
           if column.data_type() == &DataType::Timestamp(TimeUnit::Nanosecond, None) || column.data_type() == &DataType::Timestamp(TimeUnit::Microsecond, None) {
               let mut c = col(column.name());
               let cast_expr = c.cast_to( &DataType::Timestamp(TimeUnit::Microsecond, None), data_schema).unwrap();
               exprs.push(cast_expr);
       
           } else {
               exprs.push(col(column.name()));
           }
       }
       exprs  
   }
   ```
   It didn't work. What I was also trying was to see if If I could place a break point in delta-rs when the collect happened. What I tried to see was if datafusion hit any delta-rs function when doing collect. I could not find any.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org