You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by "tv42 (via GitHub)" <gi...@apache.org> on 2023/12/09 02:27:42 UTC

[I] DataFrame::cache errors with `Plan("Mismatch between schema and batches")` but work when not cached [arrow-datafusion]

tv42 opened a new issue, #8476:
URL: https://github.com/apache/arrow-datafusion/issues/8476

   ### Describe the bug
   
   `Dataframe::cache` gives an error where an execution that doesn't first cache results succeeds.
   
   I would have expected caching to have no effect on success/failure.
   
   ### To Reproduce
   
   ```rust
   use datafusion::prelude::SessionContext;
   
   #[tokio::main]
   async fn main() -> Result<(), Box<dyn std::error::Error>> {
       let sql = "SELECT CASE WHEN true THEN NULL ELSE 1 END;";
       let ctx = SessionContext::new();
       let plan = ctx.state().create_logical_plan(sql).await?;
       let df = ctx.execute_logical_plan(plan).await?;
       // Comment out the next line to make the error go away.
       let df = df.cache().await?;
       let batches = df.collect().await?;
       let display = datafusion::arrow::util::pretty::pretty_format_batches(&batches).unwrap();
       println!("{}", display);
       Ok(())
   }
   ```
   
   ### Expected behavior
   
   Behavior with and without `let df = df.cache().await?` to be functionally same, only changing performance and memory use.
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] DataFrame::cache errors with `Plan("Mismatch between schema and batches")` but query works when not cached [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on issue #8476:
URL: https://github.com/apache/arrow-datafusion/issues/8476#issuecomment-1850197001

   In general, DataType::Null is normally resolved (via Coercion) to an actual type of the target schema
   
   Maybe we could apply the same approach here


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] DataFrame::cache errors with `Plan("Mismatch between schema and batches")` but query works when not cached [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb closed issue #8476: DataFrame::cache errors with `Plan("Mismatch between schema and batches")` but query works when not cached
URL: https://github.com/apache/arrow-datafusion/issues/8476


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] DataFrame::cache errors with `Plan("Mismatch between schema and batches")` but query works when not cached [arrow-datafusion]

Posted by "Asura7969 (via GitHub)" <gi...@apache.org>.

Asura7969 commented on issue #8476:
URL: https://github.com/apache/arrow-datafusion/issues/8476#issuecomment-1849431378

   The reason is that schema comparison：
   https://github.com/apache/arrow-datafusion/blob/d091b55be6a4ce552023ef162b5d081136d3ff6d/datafusion/core/src/datasource/memory.rs#L68
   
   schema：
   ```
   Field { name: "CASE WHEN Boolean(true) THEN NULL ELSE Int64(1) END", data_type: Null, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }
   ```
   
   batches_schema：
   ```
   Field { name: "CASE WHEN Boolean(true) THEN NULL ELSE Int64(1) END", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }
   ```
   data_type：`Null` & `Int64`
   
   Can we use the optimized schema to compare，like：
   before：
   ```rust
       pub async fn cache(self) -> Result<DataFrame> {
           let context = SessionContext::new_with_state(self.session_state.clone());
           let mem_table = MemTable::try_new(
               SchemaRef::from(self.schema().clone()),
               self.collect_partitioned().await?,
           )?;
   
           context.read_table(Arc::new(mem_table))
       }
   ```
   after:
   ```rust
       pub async fn cache(self) -> Result<DataFrame> {
           let context = SessionContext::new_with_state(self.session_state.clone());
           let physical_plan = self.create_physical_plan().await?;
           let mem_table = MemTable::try_new(
               physical_plan.schema(),
               self.collect_partitioned().await?,
           )?;
   
           context.read_table(Arc::new(mem_table))
       }
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org