You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by "zeodtr (via GitHub)" <gi...@apache.org> on 2023/02/02 09:35:29 UTC

[GitHub] [arrow-datafusion] zeodtr opened a new issue, #5157: Optimizer: Avoid too many string cloning in the optimizer

zeodtr opened a new issue, #5157:
URL: https://github.com/apache/arrow-datafusion/issues/5157

**Is your feature request related to a problem or challenge? Please describe what you are trying to do.**

I'm not sure this is a feature request, but at least this is not a bug (albeit it's a performance problem), so I write this issue as a feature request.

I'm benchmarking the optimizers of DataFusion and Calcite.
I intended to compare the quality of the optimized plans between them, assuming that DataFusion's optimizing speed would be faster (since it's written in Rust).
But to my surprise, I found that Calcite's optimizer is way faster (~ 20x) in some cases.

The case is as follows:

* Query statement is very simple: "select column_1 from table_1"
* table_1 has about 700 columns (Yes, it has so many columns, that's the problem).

While Calcite finished the optimization in about 7 msec, DataFusion's optimizer took about 120 msec.
At first, the number was worse, but it settled to about 120 msec when I set the global allocator to mimalloc. (I've tried snmalloc and it was somewhat faster - about 100 msec. But somehow snmalloc did not play well with valgrind, I chose mimalloc at least temporarily)

I ran the test program with valgrind / callgrind and drew the call graph. The graph showed that about half of the execution time is being spent on `<alloc::string::String as core::clone::Clone>::clone`. The call count was 3,930,814.

I ran the optimizer for another table with fewer columns (about 200 columns), and it took much less time - about 12msec.

So, I suspect that the optimizer becomes slow (at least for a table with many columns) because it clones the strings related to the schema of the table too many times.

**Describe the solution you'd like**

Perhaps removing unnecessary cloning may help. Or, make the fields immutable and manage them with reference counted smart pointers.

**Describe alternatives you've considered**

No alternatives.

**Additional context**

None.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] mslapek commented on issue #5157: Optimizer is slow: Avoid too many string cloning in the optimizer

Posted by "mslapek (via GitHub)" <gi...@apache.org>.

mslapek commented on issue #5157:
URL: https://github.com/apache/arrow-datafusion/issues/5157#issuecomment-1475276695

   **TBH** String interning would be the best ⭐️ - no cloning, quick `Eq`/`Hash`...
   
   Instead of `String` we could use `StringId`:
   
   ```rust
   #[derive(Hash, PartialEq, Eq, Copy, Clone)]
   struct StringId(usize);
   ```
   
   And some `HashMap<String, StringId>` to assign strings to unique IDs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] tustvold commented on issue #5157: Optimizer is slow: Avoid too many string cloning in the optimizer

Posted by "tustvold (via GitHub)" <gi...@apache.org>.

tustvold commented on issue #5157:
URL: https://github.com/apache/arrow-datafusion/issues/5157#issuecomment-1484168398

   https://github.com/apache/arrow-rs/issues/3955 may also be relevant here, I'm optimistic that we can drastically reduce the overheads without needing to reach for exotic solutions like string interning


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org