You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "zeodtr (via GitHub)" <gi...@apache.org> on 2023/02/02 09:35:29 UTC

[GitHub] [arrow-datafusion] zeodtr opened a new issue, #5157: Optimizer: Avoid too many string cloning in the optimizer

zeodtr opened a new issue, #5157:
URL: https://github.com/apache/arrow-datafusion/issues/5157

   **Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
   
   I'm not sure this is a feature request, but at least this is not a bug (albeit it's a performance problem), so I write this issue as a feature request.
   
   I'm benchmarking the optimizers of DataFusion and Calcite.  
   I intended to compare the quality of the optimized plans between them, assuming that DataFusion's optimizing speed would be faster (since it's written in Rust).  
   But to my surprise, I found that Calcite's optimizer is way faster (~ 20x) in some cases.
   
   The case is as follows:
   
   * Query statement is very simple: "select column_1 from table_1"
   * table_1 has about 700 columns (Yes, it has so many columns, that's the problem).
   
   While Calcite finished the optimization in about 7 msec, DataFusion's optimizer took about 120 msec.  
   At first, the number was worse, but it settled to about 120 msec when I set the global allocator to mimalloc. (I've tried snmalloc and it was somewhat faster - about 100 msec. But somehow snmalloc did not play well with valgrind, I chose mimalloc at least temporarily)
   
   I ran the test program with valgrind / callgrind and drew the call graph.  The graph showed that about half of the execution time is being spent on `<alloc::string::String as core::clone::Clone>::clone`. The call count was 3,930,814.
   
   I ran the optimizer for another table with fewer columns (about 200 columns), and it took much less time - about 12msec.   
   
   So, I suspect that the optimizer becomes slow (at least for a table with many columns) because it clones the strings related to the schema of the table too many times.
   
   **Describe the solution you'd like**
   
   Perhaps removing unnecessary cloning may help. Or, make the fields immutable and manage them with reference counted smart pointers.
   
   **Describe alternatives you've considered**
   
   No alternatives.
   
   **Additional context**
   
   None.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] mslapek commented on issue #5157: Optimizer is slow: Avoid too many string cloning in the optimizer

Posted by "mslapek (via GitHub)" <gi...@apache.org>.
mslapek commented on issue #5157:
URL: https://github.com/apache/arrow-datafusion/issues/5157#issuecomment-1475276695

   **TBH** String interning would be the best ⭐️ - no cloning, quick `Eq`/`Hash`...
   
   Instead of `String` we could use `StringId`:
   
   ```rust
   #[derive(Hash, PartialEq, Eq, Copy, Clone)]
   struct StringId(usize);
   ```
   
   And some `HashMap<String, StringId>` to assign strings to unique IDs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] tustvold commented on issue #5157: Optimizer is slow: Avoid too many string cloning in the optimizer

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on issue #5157:
URL: https://github.com/apache/arrow-datafusion/issues/5157#issuecomment-1484168398

   https://github.com/apache/arrow-rs/issues/3955 may also be relevant here, I'm optimistic that we can drastically reduce the overheads without needing to reach for exotic solutions like string interning


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] tustvold commented on issue #5157: Optimizer: Avoid too many string cloning in the optimizer

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on issue #5157:
URL: https://github.com/apache/arrow-datafusion/issues/5157#issuecomment-1414055056

   Possibly related - https://github.com/apache/arrow-datafusion/issues/4680


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on issue #5157: Optimizer is slow: Avoid too many string cloning in the optimizer

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #5157:
URL: https://github.com/apache/arrow-datafusion/issues/5157#issuecomment-1476734081

   > What do you think about 🎯 precomputed hash + Arc<str>?
   
   Seems reasonable to me. Maybe as some intermediate state we could have a `StringInterner` structure that computed the hash + `Arc<str>` and tried to reuse any previously known about. 
   
   Then we could thread through a StringInterner when possible (e.g. on the optimizer or the sql planner) but also be able to make these strings easily without one (so we could incrementally update the code)
   
   Any such solution, I think, should strive *very hard* to keep the burden of working in the DataFusion codebase low (e.g. should be easy to work with for anyone used to String, be well documented, etc)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on issue #5157: Optimizer: Avoid too many string cloning in the optimizer

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #5157:
URL: https://github.com/apache/arrow-datafusion/issues/5157#issuecomment-1414339134

   I looked at the trace and here are my observations:
   
   As @tustvold  has said, if we can have `DFSchema` / `DFField`  that don't copy the values https://github.com/apache/arrow-datafusion/issues/4680 around that would help immensly
   
   A large amount of the allocations come from  `DFSchema::merge` -- see https://github.com/apache/arrow-datafusion/blob/224c682101949da57aebc36e92e5a881ef3040d4/datafusion/common/src/dfschema.rs#L135-L151
   
   ![Screenshot 2023-02-02 at 3 26 45 PM](https://user-images.githubusercontent.com/490673/216442910-599d96a1-7ffe-450c-b955-66e731c7aaf5.png)
   
   And a large part of that is how it ignores errors with `.ok()` where were quite expensive to produce
   
   It also appears there is copying going on in unwrap_cast_in_comparison and common subexpr eliminiate
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on issue #5157: Optimizer is slow: Avoid too many string cloning in the optimizer

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #5157:
URL: https://github.com/apache/arrow-datafusion/issues/5157#issuecomment-1426833574

   Proposed adding in https://github.com/apache/arrow-datafusion/pull/5256


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on issue #5157: Optimizer is slow: Avoid too many string cloning in the optimizer

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #5157:
URL: https://github.com/apache/arrow-datafusion/issues/5157#issuecomment-1476051767

   > TBH String interning would be the best ⭐️ - no cloning, quick Eq/Hash...
   
   Agreed though then we have to thread the HashMap through
   
   Another possibility would be to use `Arc<str>` (aka refcounted strings)  -- not quite as cheap to clone / create but also don't need any external context
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] mslapek commented on issue #5157: Optimizer is slow: Avoid too many string cloning in the optimizer

Posted by "mslapek (via GitHub)" <gi...@apache.org>.
mslapek commented on issue #5157:
URL: https://github.com/apache/arrow-datafusion/issues/5157#issuecomment-1476661030

   > Another possibility would be to use `Arc<str>` (aka refcounted strings) -- not quite as cheap to clone / create but also don't need any external context
   
   Looks like a good tradeoff.
   
   What do you think about 🎯 **precomputed hash** +  `Arc<str>`?
   
   * constant time cloning
   * constant time hashing (#5623 proposes to use `LogicalPlan` hashing in the optimizer loop)
   
   Performance almost-like string interning.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Optimizer is slow: Avoid too many string cloning in the optimizer [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #5157:
URL: https://github.com/apache/arrow-datafusion/issues/5157#issuecomment-1919475055

   > @alamb I'm wondering if #9020 closes this one
   
   It is a good call @comphead  -- It certainly helps 😅. 
   
   > Speeds up benchmarks in sql_planner.rs by 18-75%. 
   
   (BTW these are the benchmarks that @zeodtr contributed ❤️ )
   
   I am also quite excited about @matthewmturner 's work in https://github.com/apache/arrow-datafusion/pull/8905
   
   @zeodtr / @mslapek  I wonder if you have an opinion about this. Shall we keep it open? Shall we close it in favor of tickets that describe specific improvements?  Do you have a chance to test with more recent versions of DataFusion?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Optimizer is slow: Avoid too many string cloning in the optimizer [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #5157:
URL: https://github.com/apache/arrow-datafusion/issues/5157#issuecomment-1922237389

   > as I explained in https://github.com/apache/arrow-datafusion/issues/7698#issuecomment-1815885644. I
   
   I forgot out great that description was. FYI @matthewmturner  for your inspiration


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on issue #5157: Optimizer is slow: Avoid too many string cloning in the optimizer

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #5157:
URL: https://github.com/apache/arrow-datafusion/issues/5157#issuecomment-1433582559

   Here is one possible task that would help: https://github.com/apache/arrow-datafusion/issues/5309


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on issue #5157: Optimizer: Avoid too many string cloning in the optimizer

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #5157:
URL: https://github.com/apache/arrow-datafusion/issues/5157#issuecomment-1423033205

   Thank you @zeodtr  -- I intend to port this into the datafusion benchmark suite eventually. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] ygf11 commented on issue #5157: Optimizer: Avoid too many string cloning in the optimizer

Posted by "ygf11 (via GitHub)" <gi...@apache.org>.
ygf11 commented on issue #5157:
URL: https://github.com/apache/arrow-datafusion/issues/5157#issuecomment-1414654588

   > And a large part of that is how it ignores errors with .ok() where were quite expensive to produce
   
   
   The result type of `field_with_name` is `Result<&DFField>`, it is not clear enough I think.
   
   For the method `field_with_name`, there are three result for the given name:
   1. There is no such field, return FieldNotFound error.
   2. Find only one field, return this field.
   3. Find one more field, return Ambiguous reference error.
   
   The first and third both will return error, in most times we should distingwish them, but it is not easy.
   
   Maybe we can change to `Result<Option<&Field>>`.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on issue #5157: Optimizer is slow: Avoid too many string cloning in the optimizer

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #5157:
URL: https://github.com/apache/arrow-datafusion/issues/5157#issuecomment-1474890961

   Filed https://github.com/apache/arrow-datafusion/issues/5637 to track making the optimizer faster in general


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] zeodtr commented on issue #5157: Optimizer: Avoid too many string cloning in the optimizer

Posted by "zeodtr (via GitHub)" <gi...@apache.org>.
zeodtr commented on issue #5157:
URL: https://github.com/apache/arrow-datafusion/issues/5157#issuecomment-1414719714

   @alamb 
   I've simplified the benchmark program as follows. (I would like to upload the file as an attachment, but it's impossible in my current environment) It is a single file, named `main.rs`. A lot of it is based on `datafusion/sql/examples/sql.rs` except for the optimizer part. The optimizer part is based on https://crates.io/crates/datafusion-optimizer.
   
   ```rust
   use std::{collections::HashMap, sync::Arc};
   
   use datafusion::{
       arrow::datatypes::{DataType, Field, Schema},
       common::Result,
       config::ConfigOptions,
       error::DataFusionError,
       logical_expr::{
           logical_plan::builder::LogicalTableSource, AggregateUDF, LogicalPlan, ScalarUDF,
           TableSource,
       },
       optimizer::{optimizer::Optimizer, OptimizerContext, OptimizerRule},
       sql::{
           planner::{ContextProvider, SqlToRel},
           sqlparser::{dialect::GenericDialect, parser::Parser},
           TableReference,
       },
   };
   
   #[global_allocator]
   static GLOBAL: mimalloc::MiMalloc = mimalloc::MiMalloc;
   
   fn main() {
       let sql = "select column1 from table1";
       let schema_provider = TestSchemaProvider::new();
   
       let now = std::time::Instant::now();
   
       let dialect = GenericDialect {};
       let ast = Parser::parse_sql(&dialect, sql).unwrap();
       let statement = &ast[0];
       let sql_to_rel = SqlToRel::new(&schema_provider);
       let plan = sql_to_rel.sql_statement_to_plan(statement.clone()).unwrap();
   
       println!(
           "elapsed time after creating a logical plan: {}",
           now.elapsed().as_millis()
       );
   
       let config = OptimizerContext::default();
       let optimizer = Optimizer::new();
       let optimized_plan = optimizer.optimize(&plan, &config, observe).unwrap();
   
       println!(
           "elapsed time after optimization: {}\n",
           now.elapsed().as_millis()
       );
   
       println!("plan:\n{:?}\n", plan);
       println!("optimized plan:\n{:?}", optimized_plan);
   }
   
   fn observe(_plan: &LogicalPlan, _rule: &dyn OptimizerRule) {}
   
   struct TestSchemaProvider {
       options: ConfigOptions,
       tables: HashMap<String, Arc<dyn TableSource>>,
   }
   
   impl TestSchemaProvider {
       pub fn new() -> Self {
           let mut tables = HashMap::new();
           tables.insert(
               "table1".to_string(),
               create_table_source({
                   let mut fields = Vec::new();
                   for num in 0..700 {
                       fields.push(Field::new(
                           format!("column{}", num + 1),
                           DataType::Int32,
                           false,
                       ))
                   }
                   fields
               }),
           );
   
           Self {
               options: Default::default(),
               tables,
           }
       }
   }
   
   fn create_table_source(fields: Vec<Field>) -> Arc<dyn TableSource> {
       Arc::new(LogicalTableSource::new(Arc::new(
           Schema::new_with_metadata(fields, HashMap::new()),
       )))
   }
   
   impl ContextProvider for TestSchemaProvider {
       fn get_table_provider(&self, name: TableReference) -> Result<Arc<dyn TableSource>> {
           match self.tables.get(name.table()) {
               Some(table) => Ok(table.clone()),
               _ => Err(DataFusionError::Plan(format!(
                   "Table not found: {}",
                   name.table()
               ))),
           }
       }
   
       fn get_function_meta(&self, _name: &str) -> Option<Arc<ScalarUDF>> {
           None
       }
   
       fn get_aggregate_meta(&self, _name: &str) -> Option<Arc<AggregateUDF>> {
           None
       }
   
       fn get_variable_type(&self, _variable_names: &[String]) -> Option<DataType> {
           None
       }
   
       fn options(&self) -> &ConfigOptions {
           &self.options
       }
   }
   ```
   
   And the content of Carfo.toml is as follows:
   
   ```toml
   [package]
   name = "simple_optimizer_test"
   version = "0.1.0"
   edition = "2021"
   
   # See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html
   
   [dependencies]
   datafusion = "17.0.0"
   mimalloc = "0.1.34"
   
   [profile.release]
   debug = 2
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on issue #5157: Optimizer: Avoid too many string cloning in the optimizer

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #5157:
URL: https://github.com/apache/arrow-datafusion/issues/5157#issuecomment-1414340013

   
   @zeodtr Would you be willing to contribute your benchmark program -- we have https://github.com/apache/arrow-datafusion/blob/master/datafusion/core/benches/sql_planner.rs but maybe we could have a more end to end plan creation test 🤔 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Optimizer is slow: Avoid too many string cloning in the optimizer [arrow-datafusion]

Posted by "zeodtr (via GitHub)" <gi...@apache.org>.
zeodtr commented on issue #5157:
URL: https://github.com/apache/arrow-datafusion/issues/5157#issuecomment-1920195231

   @alarmb Since there is an epic issue(#5637), this issue can be closed in favor of more specific issues.
   Thanks for your great work.
   By the way, at the moment, I'm satisfied with the performance of logical planning and optimization routines after applying some optimizations here and there, as I explained in https://github.com/apache/arrow-datafusion/issues/7698#issuecomment-1815885644. It would be nice if these optimizations could be incorporated into the official source code, perhaps with some adjustments if needed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org