You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by "alamb (via GitHub)" <gi...@apache.org> on 2023/02/16 19:07:34 UTC

[GitHub] [arrow-datafusion] alamb opened a new issue, #5309: Make a faster way to check column existence in optimizer (not `is_err()`)

alamb opened a new issue, #5309:
URL: https://github.com/apache/arrow-datafusion/issues/5309

   **Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
   Related to https://github.com/apache/arrow-datafusion/issues/5157
   
   There are many places in the code that use fallible functions on `DFSchema` to check if a column exists:
   https://docs.rs/datafusion-common/18.0.0/datafusion_common/struct.DFSchema.html#method.index_of
   https://docs.rs/datafusion-common/18.0.0/datafusion_common/struct.DFSchema.html#method.index_of_column_by_name
   https://docs.rs/datafusion-common/18.0.0/datafusion_common/struct.DFSchema.html#method.field_from_column
   
   For example, there is code that looks like this (call `is_ok()` or `is_err()`and totally discards the error with the string)
   ```rust
   input_schema.field_from_column(col).is_ok()
   ```
   
   This is problematic because they return a DataFusionError that not only has an allocated `String` but also often has a nice error message. You can see them appearing in the trace on https://github.com/apache/arrow-datafusion/issues/5157
   
   As part of making the optimizer faster Related to https://github.com/apache/arrow-datafusion/issues/5157 we need to avoid these string allocations,
   
   Thus I propose:
   
   1. Add new functions for checking that return a bool rather than an error
   2. Replace the use of `is_err()` with 
   
   Find the field with the given qualified column
   
   For example, 
   ```rust
   impl DFSchema {
     // existing function that returns Result
     pub fn field_from_column(&self, column: &Column) -> Result<&DFField> {...}
   
     // new function that returns bool  <---- Add this new function
     pub fn has_column(&self, column: &Column) -> bool {...}
   }
   ```
   
   And then replace in the code that have the pattern
   
   ```rust
   input_schema.field_from_column(col).is_ok()
   ```
   
   With 
   ```rust
   input_schema.has_column(col)
   ```
   
   
   
   **Describe the solution you'd like**
   Ideally someone would do this transition one function on DFSchema at a time (not one giant PR please 🙏 )
   
   **Describe alternatives you've considered**
   There are more involved proposals for larger changes to DFSchema but simply avoiding this check might help a lot
   
   **Additional context**
   I think this is a good first exercise as the desire is well spelled out and it is a software engineering exercise rather than requires deep datafusion expertise


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] Make a faster way to check column existence in optimizer (not `is_err()`) [arrow-datafusion]

Posted by "matthewmturner (via GitHub)" <gi...@apache.org>.

matthewmturner commented on issue #5309:
URL: https://github.com/apache/arrow-datafusion/issues/5309#issuecomment-1872538847

   @alamb Sounds good will do that


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] Make a faster way to check column existence in optimizer (not `is_err()`) [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on issue #5309:
URL: https://github.com/apache/arrow-datafusion/issues/5309#issuecomment-1889885280

   > * I profiled the benchmark for a simple query on a wide table (700 columns) and a significant amount of the cpu time is  (~87%) is now coming from `has_column_with_qualified_name` (first screenshot below). 87% in the case of creating physical plan and 66% of creating unoptimized logical plan (second screenshot).
   > 
   > Given this seems to be hotspot for wide tables do you think best next step would be looking into improving lookup time by adding a btree (or whatever) or should we improve the foundation and work on updating the schema first? from what ive seen updating the schema may make adding the index easier so that may be a good start.
   
   Yes I agree getting DFSchema into better shape (e.g. not actually copying so many things) would likely make this task easier
   
   It also looks like `has_column_with_qualified_name` is always being called from `DFSchema::merge` I wonder if we can figure out why that needs to be called so much. My bet is that most of the callsites dont' actually add any new fields. Maybe we can quickly check if the pass didn't many any changes to the children, then there is no need to call DFSchema::merge
   
   Or maybe we can find some way to quickly compare if two schemas are the same 🤔 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] suxiaogang223 commented on issue #5309: Make a faster way to check column existence in optimizer (not `is_err()`)

Posted by "suxiaogang223 (via GitHub)" <gi...@apache.org>.

suxiaogang223 commented on issue #5309:
URL: https://github.com/apache/arrow-datafusion/issues/5309#issuecomment-1434251996

   i'm happy to do this🌝


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] Make a faster way to check column existence in optimizer (not `is_err()`) [arrow-datafusion]

Posted by "matthewmturner (via GitHub)" <gi...@apache.org>.

matthewmturner commented on issue #5309:
URL: https://github.com/apache/arrow-datafusion/issues/5309#issuecomment-1872551464

   @alamb aha i had plans to profile that *exact* thing as a starting point.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] suxiaogang223 commented on issue #5309: Make a faster way to check column existence in optimizer (not `is_err()`)

Posted by "suxiaogang223 (via GitHub)" <gi...@apache.org>.

suxiaogang223 commented on issue #5309:
URL: https://github.com/apache/arrow-datafusion/issues/5309#issuecomment-1435152254

   I think returning enum will be same as returning result, because the caller also have to handle Ambiguous and return an `Err`.
   
   Returning `Result<bool>` can also avoid str allocating in `field_from_name().is_err()`. The code will be like this:
   ```rust
   If schema.had_column(col)? {...}
   ```
   
   Maybe the key question is that do we need      to check 'Ambiguous error' each time the had_column called? Actually we can just check Ambiguous err once at begin.
   
   I'm not sure if my thinking is correct, need your advice @alamb 🤓


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] Make a faster way to check column existence in optimizer (not `is_err()`) [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on issue #5309:
URL: https://github.com/apache/arrow-datafusion/issues/5309#issuecomment-1872941331

   > simply replace them with has_column_with_unqualified_name / has_column_with_qualified_name which return booleans.
   
   The other thing to do would be to look into making DFSchema cheaper to copy/create, for example using an Arc instead of `OwnedTableReference` (much as @tustvold  did for  `Field` in arrow-rs's `Fields`) so that copying a `DFField` doesn't require copying around strings
   
   https://github.com/apache/arrow-datafusion/blob/848f6c395afef790880112f809b1443949d4bb0b/datafusion/common/src/dfschema.rs#L810


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] ygf11 commented on issue #5309: Make a faster way to check column existence in optimizer (not `is_err()`)

Posted by "ygf11 (via GitHub)" <gi...@apache.org>.

ygf11 commented on issue #5309:
URL: https://github.com/apache/arrow-datafusion/issues/5309#issuecomment-1434373565

   Seems it is relative to the pr #5287.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] suxiaogang223 commented on issue #5309: Make a faster way to check column existence in optimizer (not `is_err()`)

Posted by "suxiaogang223 (via GitHub)" <gi...@apache.org>.

suxiaogang223 commented on issue #5309:
URL: https://github.com/apache/arrow-datafusion/issues/5309#issuecomment-1435067876

   > The key for performance is not to return a `DataFusionError` with a allocated string
   
   Maybe we can use `assert` to assume that the "Ambiguous reference error" should not happen in `had_column`?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] Make a faster way to check column existence in optimizer (not `is_err()`) [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on issue #5309:
URL: https://github.com/apache/arrow-datafusion/issues/5309#issuecomment-1872541162

   https://github.com/apache/arrow-datafusion/pull/8665#issuecomment-1872540911 might be instructive


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] Make a faster way to check column existence in optimizer (not `is_err()`) [arrow-datafusion]

Posted by "matthewmturner (via GitHub)" <gi...@apache.org>.

matthewmturner commented on issue #5309:
URL: https://github.com/apache/arrow-datafusion/issues/5309#issuecomment-1878900218

   Just from updating the merge function we already see considerable improvements
   
   <img width="1025" alt="image" src="https://github.com/apache/arrow-datafusion/assets/22136083/4da671ef-48d5-410b-98e1-6309915572b5">
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #5309: Make a faster way to check column existence in optimizer (not `is_err()`)

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on issue #5309:
URL: https://github.com/apache/arrow-datafusion/issues/5309#issuecomment-1435059041

   The key for performance is not to return a `DataFusionError` with a allocated string


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] Make a faster way to check column existence in optimizer (not `is_err()`) [arrow-datafusion]

Posted by "matthewmturner (via GitHub)" <gi...@apache.org>.

matthewmturner commented on issue #5309:
URL: https://github.com/apache/arrow-datafusion/issues/5309#issuecomment-1889561301

   @alamb 
   I've been looking into this more for places where we can replace unused results with booleans but nothing stuck out for that (let me know if you know or your intuition say otherwise).  I've also been using the great analysis from @zeodotr in https://github.com/apache/arrow-datafusion/issues/7698#issuecomment-1815885644 to guide some of my review.
   
   A couple things:
   
   - I looked at optimization 6 from @zeodotr's list and I wasnt able to find `columnize_expr` as a hot spot in the context of creating physical plan (I tried reproducing on a wide table with several aggregates) which i believe is the use case they had (i didnt create 3000+ aggregates though like they have).  it shows up as ~3% of cpu of creating unoptimized logical plan.
   - I profiled the benchmark for a simple query on a wide table (700 columns) and a significant amount of the cpu time is  (~87%) is now coming from `has_column_with_qualified_name` (first screenshot below). 87% in the case of creating physical plan and 66% of creating unoptimized logical plan (second screenshot).
   
   Given this seems to be hotspot for wide tables do you think best next step would be looking into improving lookup time by adding a btree or should we improve the foundation and work on updating the schema first?
   
   <img width="1728" alt="image" src="https://github.com/apache/arrow-datafusion/assets/22136083/d5892e02-68f4-4276-88c8-31587acdf4ee">
   
   <img width="1728" alt="image" src="https://github.com/apache/arrow-datafusion/assets/22136083/cb3ea6e6-061c-4cf0-a753-661a88b37988">


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] suxiaogang223 commented on issue #5309: Make a faster way to check column existence in optimizer (not `is_err()`)

Posted by "suxiaogang223 (via GitHub)" <gi...@apache.org>.

suxiaogang223 commented on issue #5309:
URL: https://github.com/apache/arrow-datafusion/issues/5309#issuecomment-1434939180

   > Seems it is relative to the pr #5287.
   
   Yes, I think the method had_column also should distinguish FieldNotFound and Ambiguous reference error.
   Maybe the new method should be
   ```rust
   pub fn has_column(&self, column: &Column) -> Result<bool> {...}
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] suxiaogang223 commented on issue #5309: Make a faster way to check column existence in optimizer (not `is_err()`)

Posted by "suxiaogang223 (via GitHub)" <gi...@apache.org>.

suxiaogang223 commented on issue #5309:
URL: https://github.com/apache/arrow-datafusion/issues/5309#issuecomment-1434934544

   > #5287
   Yes, I think the method `had_column` also should distinguish FieldNotFound and Ambiguous reference error.
   Maybe the new method should be
   ```rust
   pub fn has_column(&self, column: &Column) -> Result<bool> {...}
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #5309: Make a faster way to check column existence in optimizer (not `is_err()`)

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on issue #5309:
URL: https://github.com/apache/arrow-datafusion/issues/5309#issuecomment-1435083129

   >  Maybe we can use assert to assume that the "Ambiguous reference error" should not happen in had_column?
   
   How about returning an enum like
   
   ```rust
   
   enum FoundColumn {
     Found,
     NotFound,
     Ambiguous
   }
   
   
   pub fn has_column(&self, column: &Column) -> FoundColumn {...}
   
   ``` 
   
   ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #5309: Make a faster way to check column existence in optimizer (not `is_err()`)

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on issue #5309:
URL: https://github.com/apache/arrow-datafusion/issues/5309#issuecomment-1435671974

   Not sure -- will try and check out https://github.com/apache/arrow-datafusion/pull/5328 shortly


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] Make a faster way to check column existence in optimizer (not `is_err()`) [arrow-datafusion]

Posted by "matthewmturner (via GitHub)" <gi...@apache.org>.

matthewmturner commented on issue #5309:
URL: https://github.com/apache/arrow-datafusion/issues/5309#issuecomment-1871747178

   @alamb i can pick this up


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] Make a faster way to check column existence in optimizer (not `is_err()`) [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on issue #5309:
URL: https://github.com/apache/arrow-datafusion/issues/5309#issuecomment-1872535091

   > @alamb i can pick this up
   
   Thank you @matthewmturner  -- I think this is a "tip of the iceberg" type bug where there are many places in the optimizzer that use DFSchema that could be made faster.
   
   Thus I suggest if possible, taking some time to map out a plan to incrementally improve the situation over time


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] Make a faster way to check column existence in optimizer (not `is_err()`) [arrow-datafusion]

Posted by "matthewmturner (via GitHub)" <gi...@apache.org>.

matthewmturner commented on issue #5309:
URL: https://github.com/apache/arrow-datafusion/issues/5309#issuecomment-1872623993

I tried reproducing your results with Instruments but wasnt able to get to the granularity that you had that showed DFSchema as being heavy.

However, I put together a flamegraph and came to similar conclusion. In the below image the blocks in purple are for my search of `DFSchema`. Of those, there was a lot of `merge` and `field_with_qualified_name` (which is often called by `merge`) - this appears to be consistent with your profiling. It also looks like all uses of DFSchema are during the optimization pass which is consistent with your observation.

Based on this, and how `field_with_name` / `field_with_qualified_name` are used within merge I think I may be able to simply replace them with `has_column_with_unqualified_name` / `has_column_with_qualified_name` which return booleans.

Im hoping, time permitting, to also do some memory / allocations profiling to make sure these types of change have the desired effect.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] Make a faster way to check column existence in optimizer (not `is_err()`) [arrow-datafusion]

Posted by "matthewmturner (via GitHub)" <gi...@apache.org>.

matthewmturner commented on issue #5309:
URL: https://github.com/apache/arrow-datafusion/issues/5309#issuecomment-1878806604

   @alamb sorry for delay here, I went down a rabbit hole of trying to get some good memory / allocation benchmarks as a i really wanted to be able measure / compare cause (allocations) instead of symptom (time).  made good progress but dont want to hold this up any longer and can continue that work separately.
   
   The low hanging fruit and what this issue was created for seems to be updating those function calls so I think I will start with that and separately we can look into updating how schemas are handled - if thats okay with you.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] Make a faster way to check column existence in optimizer (not `is_err()`) [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on issue #5309:
URL: https://github.com/apache/arrow-datafusion/issues/5309#issuecomment-1879228505

   > @alamb sorry for delay here, I went down a rabbit hole of trying to get some good memory / allocation benchmarks as a i really wanted to be able measure / compare cause (allocations) instead of symptom (time). made good progress but dont want to hold this up any longer and can continue that work separately.
   > 
   > The low hanging fruit and what this issue was created for seems to be updating those function calls so I think I will start with that and separately we can look into updating how schemas are handled - if thats okay with you.
   
   Sounds like a great plan -- thank you!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org