You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/07/01 19:13:30 UTC

[GitHub] [arrow-datafusion] AssHero opened a new pull request, #2823: Eliminate multi limit-offset nodes to emptyRelation

AssHero opened a new pull request, #2823:
URL: https://github.com/apache/arrow-datafusion/pull/2823

   # Which issue does this PR close?
   Closes #2822 
   
    # Rationale for this change
   For logical plans, if upper offset is equal or greater than current's limit, replaces with an [LogicalPlan::EmptyRelation].
   
   For this query: select * from (select * from t1  limit 2 offset 2 ) a  order by a.column1 offset 2, we get 
   the Logical Plan:
   ----------------Limit: skip=2, fetch=None                                 
   -----------------Sort: #a.column1 ASC NULLS LAST                         
   -------------------Projection: #a.column1                                
   ---------------------EmptyRelation                                       
   
   # What changes are included in this PR?
   add rules to eliminate multi limit-offset nodes to emptyRelation in datafusion/optimizer/src/eliminate_limit.rs
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] AssHero commented on pull request #2823: Eliminate multi limit-offset nodes to emptyRelation

Posted by GitBox <gi...@apache.org>.
AssHero commented on PR #2823:
URL: https://github.com/apache/arrow-datafusion/pull/2823#issuecomment-1173337466

   Improvements according to the comments.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] AssHero commented on a diff in pull request #2823: Eliminate multi limit-offset nodes to emptyRelation

Posted by GitBox <gi...@apache.org>.
AssHero commented on code in PR #2823:
URL: https://github.com/apache/arrow-datafusion/pull/2823#discussion_r912578893


##########
datafusion/optimizer/src/eliminate_limit.rs:
##########
@@ -35,35 +35,80 @@ impl EliminateLimit {
     }
 }
 
+fn eliminate_limit(
+    _optimizer: &EliminateLimit,
+    upper_offset: usize,
+    plan: &LogicalPlan,
+    _optimizer_config: &OptimizerConfig,
+) -> Result<LogicalPlan> {
+    match plan {
+        LogicalPlan::Limit(Limit {
+            skip: offset,
+            fetch: limit,
+            input,
+            ..
+        }) => {
+            // If upper's offset is equal or greater than current's limit,
+            // replaces with an [LogicalPlan::EmptyRelation].
+            // For such query: select * from (select * from xxx limit 5) a limit 2 offset 5;
+            match limit {
+                Some(limit) => {
+                    if *limit == 0 || upper_offset >= *limit {
+                        return Ok(LogicalPlan::EmptyRelation(EmptyRelation {
+                            produce_one_row: false,
+                            schema: input.schema().clone(),
+                        }));
+                    }
+                }
+                None => {}
+            }
+
+            let offset = match offset {
+                Some(offset) => *offset,
+                None => 0,
+            };
+
+            let expr = plan.expressions();
+
+            // apply the optimization to all inputs of the plan
+            let inputs = plan.inputs();
+            let new_inputs = inputs
+                .iter()
+                .map(|plan| eliminate_limit(_optimizer, offset, plan, _optimizer_config))
+                .collect::<Result<Vec<_>>>()?;
+
+            from_plan(plan, &expr, &new_inputs)
+        }
+        // Rest: recurse and find possible LIMIT 0/Multi LIMIT OFFSET nodes
+        _ => {
+            // For those plans(projection/sort/..) which do not affect the output rows of sub-plans, we still use upper_offset;
+            // otherwise, use 0 instead.
+            let offset = match plan {

Review Comment:
   For join plan, the upper_offset is set to 0, and the limit node for the join is not affected. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] AssHero commented on a diff in pull request #2823: Eliminate multi limit-offset nodes to emptyRelation

Posted by GitBox <gi...@apache.org>.
AssHero commented on code in PR #2823:
URL: https://github.com/apache/arrow-datafusion/pull/2823#discussion_r913725860


##########
datafusion/optimizer/src/eliminate_limit.rs:
##########
@@ -128,4 +195,44 @@ mod tests {
             \n    TableScan: test";
         assert_optimized_plan_eq(&plan, expected);
     }
+

Review Comment:
   yes, I'll add more test cases.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] ming535 commented on a diff in pull request #2823: Eliminate multi limit-offset nodes to emptyRelation

Posted by GitBox <gi...@apache.org>.
ming535 commented on code in PR #2823:
URL: https://github.com/apache/arrow-datafusion/pull/2823#discussion_r913716649


##########
datafusion/optimizer/src/eliminate_limit.rs:
##########
@@ -35,35 +39,98 @@ impl EliminateLimit {
     }
 }
 
+/// Ancestor indicates the current ancestor in the LogicalPlan tree
+/// when traversing down related to "eliminate limit".
+enum Ancestor {
+    /// Limit
+    FromLimit { skip: Option<usize> },
+    /// Other nodes that don't affect the adjustment of "Limit"
+    NotRelevant,
+}
+
+/// replaces LIMIT 0 with an [LogicalPlan::EmptyRelation]
+/// replaces LIMIT node whose ancestor LIMIT's skip is greater than or equal to current's fetch
+/// with an [LogicalPlan::EmptyRelation]
+fn eliminate_limit(
+    _optimizer: &EliminateLimit,
+    ancestor: &Ancestor,
+    plan: &LogicalPlan,
+    _optimizer_config: &OptimizerConfig,
+) -> Result<LogicalPlan> {
+    match plan {

Review Comment:
   `SELECT * FROM (SELECT * FROM test_a LIMIT 5) AS test_b OFFSET 5;`
   In the above query, the current plan is the one with only `fetch` (the subquery).
   
   I thought when the current node is `Limit` with only `skip`, it should be handled the same as `Projection`, `Sort`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #2823: Eliminate multi limit-offset nodes to emptyRelation

Posted by GitBox <gi...@apache.org>.
alamb commented on code in PR #2823:
URL: https://github.com/apache/arrow-datafusion/pull/2823#discussion_r914054928


##########
datafusion/optimizer/src/eliminate_limit.rs:
##########
@@ -35,35 +39,98 @@ impl EliminateLimit {
     }
 }
 
+/// Ancestor indicates the current ancestor in the LogicalPlan tree
+/// when traversing down related to "eliminate limit".
+enum Ancestor {
+    /// Limit
+    FromLimit { skip: Option<usize> },
+    /// Other nodes that don't affect the adjustment of "Limit"
+    NotRelevant,
+}
+
+/// replaces LIMIT 0 with an [LogicalPlan::EmptyRelation]
+/// replaces LIMIT node whose ancestor LIMIT's skip is greater than or equal to current's fetch
+/// with an [LogicalPlan::EmptyRelation]
+fn eliminate_limit(
+    _optimizer: &EliminateLimit,
+    ancestor: &Ancestor,
+    plan: &LogicalPlan,
+    _optimizer_config: &OptimizerConfig,
+) -> Result<LogicalPlan> {
+    match plan {
+        LogicalPlan::Limit(Limit {
+            skip, fetch, input, ..
+        }) => {
+            let ancestor_skip = match ancestor {
+                Ancestor::FromLimit { skip, .. } => skip.unwrap_or(0),
+                _ => 0,
+            };
+            // If ancestor's skip is equal or greater than current's fetch,
+            // replaces with an [LogicalPlan::EmptyRelation].
+            // For such query, the inner query(select * from xxx limit 5) should be optimized as an EmptyRelation:
+            // select * from (select * from xxx limit 5) a limit 2 offset 5;
+            match fetch {
+                Some(fetch) => {
+                    if *fetch == 0 || ancestor_skip >= *fetch {
+                        return Ok(LogicalPlan::EmptyRelation(EmptyRelation {
+                            produce_one_row: false,
+                            schema: input.schema().clone(),
+                        }));
+                    }
+                }
+                None => {}
+            }
+
+            let expr = plan.expressions();
+
+            // apply the optimization to all inputs of the plan
+            let inputs = plan.inputs();
+            let new_inputs = inputs
+                .iter()
+                .map(|plan| {
+                    eliminate_limit(
+                        _optimizer,
+                        &Ancestor::FromLimit { skip: *skip },
+                        plan,
+                        _optimizer_config,
+                    )
+                })
+                .collect::<Result<Vec<_>>>()?;
+
+            from_plan(plan, &expr, &new_inputs)
+        }
+        // Rest: recurse and find possible LIMIT 0/Multi LIMIT OFFSET nodes
+        _ => {
+            // For those plans(projection/sort/..) which do not affect the output rows of sub-plans, we still use ancestor;
+            // otherwise, use NotRelevant instead.
+            let ancestor = match plan {
+                LogicalPlan::Projection { .. } | LogicalPlan::Sort { .. } => ancestor,
+                _ => &Ancestor::NotRelevant,

Review Comment:
   I wonder if `Ancester::Unknown` would make the intent clearer compared to `Ancester::NotRelevant`?



##########
datafusion/optimizer/src/eliminate_limit.rs:
##########
@@ -128,4 +200,114 @@ mod tests {
             \n    TableScan: test";
         assert_optimized_plan_eq(&plan, expected);
     }
+
+    #[test]
+    fn limit_fetch_with_ancestor_limit_skip() {
+        let table_scan = test_table_scan().unwrap();
+        let plan = LogicalPlanBuilder::from(table_scan)
+            .aggregate(vec![col("a")], vec![sum(col("b"))])
+            .unwrap()
+            .limit(None, Some(2))
+            .unwrap()
+            .limit(Some(2), None)
+            .unwrap()
+            .build()
+            .unwrap();
+
+        // No aggregate / scan / limit
+        let expected = "Limit: skip=2, fetch=None\
+            \n  EmptyRelation";
+        assert_optimized_plan_eq(&plan, expected);
+    }
+
+    #[test]
+    fn multi_limit_offset_sort_eliminate() {
+        let table_scan = test_table_scan().unwrap();
+        let plan = LogicalPlanBuilder::from(table_scan)
+            .aggregate(vec![col("a")], vec![sum(col("b"))])
+            .unwrap()
+            .limit(None, Some(2))
+            .unwrap()
+            .sort(vec![col("a")])
+            .unwrap()
+            .limit(Some(2), Some(1))
+            .unwrap()
+            .build()
+            .unwrap();
+
+        let expected = "Limit: skip=2, fetch=1\
+            \n  Sort: #test.a\
+            \n    EmptyRelation";
+        assert_optimized_plan_eq(&plan, expected);
+    }
+
+    #[test]
+    fn limit_fetch_with_ancestor_limit_fetch() {

Review Comment:
   👍 



##########
datafusion/optimizer/src/eliminate_limit.rs:
##########
@@ -35,35 +39,98 @@ impl EliminateLimit {
     }
 }
 
+/// Ancestor indicates the current ancestor in the LogicalPlan tree
+/// when traversing down related to "eliminate limit".
+enum Ancestor {
+    /// Limit
+    FromLimit { skip: Option<usize> },
+    /// Other nodes that don't affect the adjustment of "Limit"
+    NotRelevant,
+}
+
+/// replaces LIMIT 0 with an [LogicalPlan::EmptyRelation]
+/// replaces LIMIT node whose ancestor LIMIT's skip is greater than or equal to current's fetch
+/// with an [LogicalPlan::EmptyRelation]
+fn eliminate_limit(
+    _optimizer: &EliminateLimit,
+    ancestor: &Ancestor,
+    plan: &LogicalPlan,
+    _optimizer_config: &OptimizerConfig,
+) -> Result<LogicalPlan> {
+    match plan {
+        LogicalPlan::Limit(Limit {
+            skip, fetch, input, ..
+        }) => {
+            let ancestor_skip = match ancestor {
+                Ancestor::FromLimit { skip, .. } => skip.unwrap_or(0),
+                _ => 0,
+            };
+            // If ancestor's skip is equal or greater than current's fetch,
+            // replaces with an [LogicalPlan::EmptyRelation].
+            // For such query, the inner query(select * from xxx limit 5) should be optimized as an EmptyRelation:
+            // select * from (select * from xxx limit 5) a limit 2 offset 5;
+            match fetch {
+                Some(fetch) => {
+                    if *fetch == 0 || ancestor_skip >= *fetch {

Review Comment:
   👍 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] ming535 commented on a diff in pull request #2823: Eliminate multi limit-offset nodes to emptyRelation

Posted by GitBox <gi...@apache.org>.
ming535 commented on code in PR #2823:
URL: https://github.com/apache/arrow-datafusion/pull/2823#discussion_r913821663


##########
datafusion/optimizer/src/eliminate_limit.rs:
##########
@@ -35,35 +39,98 @@ impl EliminateLimit {
     }
 }
 
+/// Ancestor indicates the current ancestor in the LogicalPlan tree
+/// when traversing down related to "eliminate limit".
+enum Ancestor {
+    /// Limit
+    FromLimit { skip: Option<usize> },
+    /// Other nodes that don't affect the adjustment of "Limit"
+    NotRelevant,
+}
+
+/// replaces LIMIT 0 with an [LogicalPlan::EmptyRelation]
+/// replaces LIMIT node whose ancestor LIMIT's skip is greater than or equal to current's fetch
+/// with an [LogicalPlan::EmptyRelation]
+fn eliminate_limit(
+    _optimizer: &EliminateLimit,
+    ancestor: &Ancestor,
+    plan: &LogicalPlan,
+    _optimizer_config: &OptimizerConfig,
+) -> Result<LogicalPlan> {
+    match plan {

Review Comment:
   > Limit with only skip should be handled here. For query: SELECT * FROM (SELECT * FROM test_a LIMIT 5) AS test_b OFFSET 5;
   > 
   > what do you think about this?
   
   I see, I misunderstood your code.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] AssHero commented on a diff in pull request #2823: Eliminate multi limit-offset nodes to emptyRelation

Posted by GitBox <gi...@apache.org>.
AssHero commented on code in PR #2823:
URL: https://github.com/apache/arrow-datafusion/pull/2823#discussion_r913725374


##########
datafusion/optimizer/src/eliminate_limit.rs:
##########
@@ -35,35 +39,98 @@ impl EliminateLimit {
     }
 }
 
+/// Ancestor indicates the current ancestor in the LogicalPlan tree
+/// when traversing down related to "eliminate limit".
+enum Ancestor {
+    /// Limit
+    FromLimit { skip: Option<usize> },
+    /// Other nodes that don't affect the adjustment of "Limit"
+    NotRelevant,
+}
+
+/// replaces LIMIT 0 with an [LogicalPlan::EmptyRelation]
+/// replaces LIMIT node whose ancestor LIMIT's skip is greater than or equal to current's fetch
+/// with an [LogicalPlan::EmptyRelation]
+fn eliminate_limit(
+    _optimizer: &EliminateLimit,
+    ancestor: &Ancestor,
+    plan: &LogicalPlan,
+    _optimizer_config: &OptimizerConfig,
+) -> Result<LogicalPlan> {
+    match plan {

Review Comment:
   The ancestor LIMIT node is OFFSET 5 for query "SELECT * FROM (SELECT * FROM test_a LIMIT 5) AS test_b OFFSET 5". 
   For this case, I think we should use 5 as ancestor's skip to optimizer inputs.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] AssHero commented on a diff in pull request #2823: Eliminate multi limit-offset nodes to emptyRelation

Posted by GitBox <gi...@apache.org>.
AssHero commented on code in PR #2823:
URL: https://github.com/apache/arrow-datafusion/pull/2823#discussion_r912580580


##########
datafusion/optimizer/src/eliminate_limit.rs:
##########
@@ -35,35 +35,80 @@ impl EliminateLimit {
     }
 }
 
+fn eliminate_limit(
+    _optimizer: &EliminateLimit,
+    upper_offset: usize,
+    plan: &LogicalPlan,
+    _optimizer_config: &OptimizerConfig,
+) -> Result<LogicalPlan> {
+    match plan {
+        LogicalPlan::Limit(Limit {
+            skip: offset,
+            fetch: limit,
+            input,

Review Comment:
   I'll fix this, Thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] ming535 commented on a diff in pull request #2823: Eliminate multi limit-offset nodes to emptyRelation

Posted by GitBox <gi...@apache.org>.
ming535 commented on code in PR #2823:
URL: https://github.com/apache/arrow-datafusion/pull/2823#discussion_r913473583


##########
datafusion/optimizer/src/eliminate_limit.rs:
##########
@@ -35,35 +39,98 @@ impl EliminateLimit {
     }
 }
 
+/// Ancestor indicates the current ancestor in the LogicalPlan tree
+/// when traversing down related to "eliminate limit".
+enum Ancestor {
+    /// Limit
+    FromLimit { skip: Option<usize> },
+    /// Other nodes that don't affect the adjustment of "Limit"
+    NotRelevant,
+}
+
+/// replaces LIMIT 0 with an [LogicalPlan::EmptyRelation]
+/// replaces LIMIT node whose ancestor LIMIT's skip is greater than or equal to current's fetch
+/// with an [LogicalPlan::EmptyRelation]
+fn eliminate_limit(
+    _optimizer: &EliminateLimit,
+    ancestor: &Ancestor,
+    plan: &LogicalPlan,
+    _optimizer_config: &OptimizerConfig,
+) -> Result<LogicalPlan> {
+    match plan {

Review Comment:
   How about
   ```
   match plan {
     LogicalPlan::Limit(Limit {skip, Some(fetch), input, ..}})...
   }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] AssHero commented on a diff in pull request #2823: Eliminate multi limit-offset nodes to emptyRelation

Posted by GitBox <gi...@apache.org>.
AssHero commented on code in PR #2823:
URL: https://github.com/apache/arrow-datafusion/pull/2823#discussion_r913668512


##########
datafusion/optimizer/src/eliminate_limit.rs:
##########
@@ -35,35 +39,98 @@ impl EliminateLimit {
     }
 }
 
+/// Ancestor indicates the current ancestor in the LogicalPlan tree
+/// when traversing down related to "eliminate limit".
+enum Ancestor {
+    /// Limit
+    FromLimit { skip: Option<usize> },
+    /// Other nodes that don't affect the adjustment of "Limit"
+    NotRelevant,
+}
+
+/// replaces LIMIT 0 with an [LogicalPlan::EmptyRelation]
+/// replaces LIMIT node whose ancestor LIMIT's skip is greater than or equal to current's fetch
+/// with an [LogicalPlan::EmptyRelation]
+fn eliminate_limit(
+    _optimizer: &EliminateLimit,
+    ancestor: &Ancestor,
+    plan: &LogicalPlan,
+    _optimizer_config: &OptimizerConfig,
+) -> Result<LogicalPlan> {
+    match plan {

Review Comment:
   Limit with only skip should be handled here.
   For query: SELECT * FROM (SELECT * FROM test_a LIMIT 5) AS test_b OFFSET 5;
   
   what do you think about this?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #2823: Eliminate multi limit-offset nodes to emptyRelation

Posted by GitBox <gi...@apache.org>.
alamb commented on PR #2823:
URL: https://github.com/apache/arrow-datafusion/pull/2823#issuecomment-1172880840

   cc @ming535 - would you like to review this PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] ming535 commented on a diff in pull request #2823: Eliminate multi limit-offset nodes to emptyRelation

Posted by GitBox <gi...@apache.org>.
ming535 commented on code in PR #2823:
URL: https://github.com/apache/arrow-datafusion/pull/2823#discussion_r913718598


##########
datafusion/optimizer/src/eliminate_limit.rs:
##########
@@ -128,4 +195,44 @@ mod tests {
             \n    TableScan: test";
         assert_optimized_plan_eq(&plan, expected);
     }
+

Review Comment:
   By the way, can we add more tests here to handle all combination of `skip` and `fetch` between ancestor and child?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #2823: Eliminate multi limit-offset nodes to emptyRelation

Posted by GitBox <gi...@apache.org>.
alamb commented on PR #2823:
URL: https://github.com/apache/arrow-datafusion/pull/2823#issuecomment-1176189226

   Thanks again all!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] ming535 commented on pull request #2823: Eliminate multi limit-offset nodes to emptyRelation

Posted by GitBox <gi...@apache.org>.
ming535 commented on PR #2823:
URL: https://github.com/apache/arrow-datafusion/pull/2823#issuecomment-1172921561

   @AssHero Hi thanks for your work! I looked into this PR, and have some concerns on logical plan node like `Join` when this optimization is used and left some comments.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] AssHero commented on a diff in pull request #2823: Eliminate multi limit-offset nodes to emptyRelation

Posted by GitBox <gi...@apache.org>.
AssHero commented on code in PR #2823:
URL: https://github.com/apache/arrow-datafusion/pull/2823#discussion_r914353805


##########
datafusion/optimizer/src/eliminate_limit.rs:
##########
@@ -35,35 +39,98 @@ impl EliminateLimit {
     }
 }
 
+/// Ancestor indicates the current ancestor in the LogicalPlan tree
+/// when traversing down related to "eliminate limit".
+enum Ancestor {
+    /// Limit
+    FromLimit { skip: Option<usize> },
+    /// Other nodes that don't affect the adjustment of "Limit"
+    NotRelevant,
+}
+
+/// replaces LIMIT 0 with an [LogicalPlan::EmptyRelation]
+/// replaces LIMIT node whose ancestor LIMIT's skip is greater than or equal to current's fetch
+/// with an [LogicalPlan::EmptyRelation]
+fn eliminate_limit(
+    _optimizer: &EliminateLimit,
+    ancestor: &Ancestor,
+    plan: &LogicalPlan,
+    _optimizer_config: &OptimizerConfig,
+) -> Result<LogicalPlan> {
+    match plan {
+        LogicalPlan::Limit(Limit {
+            skip, fetch, input, ..
+        }) => {
+            let ancestor_skip = match ancestor {
+                Ancestor::FromLimit { skip, .. } => skip.unwrap_or(0),
+                _ => 0,
+            };
+            // If ancestor's skip is equal or greater than current's fetch,
+            // replaces with an [LogicalPlan::EmptyRelation].
+            // For such query, the inner query(select * from xxx limit 5) should be optimized as an EmptyRelation:
+            // select * from (select * from xxx limit 5) a limit 2 offset 5;
+            match fetch {
+                Some(fetch) => {
+                    if *fetch == 0 || ancestor_skip >= *fetch {
+                        return Ok(LogicalPlan::EmptyRelation(EmptyRelation {
+                            produce_one_row: false,
+                            schema: input.schema().clone(),
+                        }));
+                    }
+                }
+                None => {}
+            }
+
+            let expr = plan.expressions();
+
+            // apply the optimization to all inputs of the plan
+            let inputs = plan.inputs();
+            let new_inputs = inputs
+                .iter()
+                .map(|plan| {
+                    eliminate_limit(
+                        _optimizer,
+                        &Ancestor::FromLimit { skip: *skip },
+                        plan,
+                        _optimizer_config,
+                    )
+                })
+                .collect::<Result<Vec<_>>>()?;
+
+            from_plan(plan, &expr, &new_inputs)
+        }
+        // Rest: recurse and find possible LIMIT 0/Multi LIMIT OFFSET nodes
+        _ => {
+            // For those plans(projection/sort/..) which do not affect the output rows of sub-plans, we still use ancestor;
+            // otherwise, use NotRelevant instead.
+            let ancestor = match plan {
+                LogicalPlan::Projection { .. } | LogicalPlan::Sort { .. } => ancestor,
+                _ => &Ancestor::NotRelevant,

Review Comment:
   > I wonder if `Ancester::Unknown` would make the intent clearer compared to `Ancester::NotRelevant`?
   
   This is consistent with limit_push_down.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] ming535 commented on a diff in pull request #2823: Eliminate multi limit-offset nodes to emptyRelation

Posted by GitBox <gi...@apache.org>.
ming535 commented on code in PR #2823:
URL: https://github.com/apache/arrow-datafusion/pull/2823#discussion_r912374937


##########
datafusion/optimizer/src/eliminate_limit.rs:
##########
@@ -35,35 +35,80 @@ impl EliminateLimit {
     }
 }
 
+fn eliminate_limit(
+    _optimizer: &EliminateLimit,
+    upper_offset: usize,
+    plan: &LogicalPlan,
+    _optimizer_config: &OptimizerConfig,
+) -> Result<LogicalPlan> {
+    match plan {
+        LogicalPlan::Limit(Limit {
+            skip: offset,
+            fetch: limit,
+            input,
+            ..
+        }) => {
+            // If upper's offset is equal or greater than current's limit,

Review Comment:
   // If ancestor's `offset`/`skip` is equal or greater than current's `fetch`,
   // replaces the current plan with an `LogicalPlan::EmptyRelation`.
   // For example, in the following query, the subquery `test_b` should be optimized as an empty table: SELECT * FROM (SELECT * FROM test_a LIMIT 5) AS test_b LIMIT 2 OFFSET 5;
   



##########
datafusion/optimizer/src/eliminate_limit.rs:
##########
@@ -35,35 +35,80 @@ impl EliminateLimit {
     }
 }
 
+fn eliminate_limit(
+    _optimizer: &EliminateLimit,
+    upper_offset: usize,

Review Comment:
   maybe we can use `ancestor_skip` instead of `upper_offset` to make it consistent with code in `limit_push_down`



##########
datafusion/optimizer/src/eliminate_limit.rs:
##########
@@ -35,35 +35,80 @@ impl EliminateLimit {
     }
 }
 

Review Comment:
   I think we should also add some comments in top this file to clarify the intention of this optimizer rule in line 18 and line 27.
   ```
   //! Optimizer rule to replace...
   ```



##########
datafusion/optimizer/src/eliminate_limit.rs:
##########
@@ -35,35 +35,80 @@ impl EliminateLimit {
     }
 }
 
+fn eliminate_limit(
+    _optimizer: &EliminateLimit,
+    upper_offset: usize,
+    plan: &LogicalPlan,
+    _optimizer_config: &OptimizerConfig,
+) -> Result<LogicalPlan> {
+    match plan {
+        LogicalPlan::Limit(Limit {
+            skip: offset,
+            fetch: limit,
+            input,
+            ..
+        }) => {
+            // If upper's offset is equal or greater than current's limit,
+            // replaces with an [LogicalPlan::EmptyRelation].
+            // For such query: select * from (select * from xxx limit 5) a limit 2 offset 5;
+            match limit {
+                Some(limit) => {
+                    if *limit == 0 || upper_offset >= *limit {
+                        return Ok(LogicalPlan::EmptyRelation(EmptyRelation {
+                            produce_one_row: false,
+                            schema: input.schema().clone(),
+                        }));
+                    }
+                }
+                None => {}
+            }
+
+            let offset = match offset {
+                Some(offset) => *offset,
+                None => 0,
+            };
+
+            let expr = plan.expressions();
+
+            // apply the optimization to all inputs of the plan
+            let inputs = plan.inputs();
+            let new_inputs = inputs
+                .iter()
+                .map(|plan| eliminate_limit(_optimizer, offset, plan, _optimizer_config))
+                .collect::<Result<Vec<_>>>()?;
+
+            from_plan(plan, &expr, &new_inputs)
+        }
+        // Rest: recurse and find possible LIMIT 0/Multi LIMIT OFFSET nodes
+        _ => {
+            // For those plans(projection/sort/..) which do not affect the output rows of sub-plans, we still use upper_offset;
+            // otherwise, use 0 instead.
+            let offset = match plan {

Review Comment:
   Consider this case: a `Join` has an input node that has `limit 10`, and the ancestor of `Join` has `offset 11`. If the optimizer is allowed to optimize in this situation, then the result might be wrong.



##########
datafusion/optimizer/src/eliminate_limit.rs:
##########
@@ -35,35 +35,80 @@ impl EliminateLimit {
     }
 }
 
+fn eliminate_limit(
+    _optimizer: &EliminateLimit,
+    upper_offset: usize,
+    plan: &LogicalPlan,
+    _optimizer_config: &OptimizerConfig,
+) -> Result<LogicalPlan> {
+    match plan {
+        LogicalPlan::Limit(Limit {
+            skip: offset,
+            fetch: limit,
+            input,

Review Comment:
   may be we should keep use `skip` and `fetch` instead of `offset` and `limit` to keep it consistent with code in `limit_push_down`



##########
datafusion/optimizer/src/eliminate_limit.rs:
##########
@@ -35,35 +35,80 @@ impl EliminateLimit {
     }
 }
 
+fn eliminate_limit(
+    _optimizer: &EliminateLimit,
+    upper_offset: usize,
+    plan: &LogicalPlan,
+    _optimizer_config: &OptimizerConfig,
+) -> Result<LogicalPlan> {
+    match plan {
+        LogicalPlan::Limit(Limit {
+            skip: offset,
+            fetch: limit,
+            input,
+            ..
+        }) => {
+            // If upper's offset is equal or greater than current's limit,
+            // replaces with an [LogicalPlan::EmptyRelation].
+            // For such query: select * from (select * from xxx limit 5) a limit 2 offset 5;
+            match limit {
+                Some(limit) => {
+                    if *limit == 0 || upper_offset >= *limit {
+                        return Ok(LogicalPlan::EmptyRelation(EmptyRelation {
+                            produce_one_row: false,
+                            schema: input.schema().clone(),
+                        }));
+                    }
+                }
+                None => {}
+            }
+
+            let offset = match offset {

Review Comment:
   you may want to use this `let skip = skip.unwrap_or(0)`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] AssHero commented on a diff in pull request #2823: Eliminate multi limit-offset nodes to emptyRelation

Posted by GitBox <gi...@apache.org>.
AssHero commented on code in PR #2823:
URL: https://github.com/apache/arrow-datafusion/pull/2823#discussion_r912578893


##########
datafusion/optimizer/src/eliminate_limit.rs:
##########
@@ -35,35 +35,80 @@ impl EliminateLimit {
     }
 }
 
+fn eliminate_limit(
+    _optimizer: &EliminateLimit,
+    upper_offset: usize,
+    plan: &LogicalPlan,
+    _optimizer_config: &OptimizerConfig,
+) -> Result<LogicalPlan> {
+    match plan {
+        LogicalPlan::Limit(Limit {
+            skip: offset,
+            fetch: limit,
+            input,
+            ..
+        }) => {
+            // If upper's offset is equal or greater than current's limit,
+            // replaces with an [LogicalPlan::EmptyRelation].
+            // For such query: select * from (select * from xxx limit 5) a limit 2 offset 5;
+            match limit {
+                Some(limit) => {
+                    if *limit == 0 || upper_offset >= *limit {
+                        return Ok(LogicalPlan::EmptyRelation(EmptyRelation {
+                            produce_one_row: false,
+                            schema: input.schema().clone(),
+                        }));
+                    }
+                }
+                None => {}
+            }
+
+            let offset = match offset {
+                Some(offset) => *offset,
+                None => 0,
+            };
+
+            let expr = plan.expressions();
+
+            // apply the optimization to all inputs of the plan
+            let inputs = plan.inputs();
+            let new_inputs = inputs
+                .iter()
+                .map(|plan| eliminate_limit(_optimizer, offset, plan, _optimizer_config))
+                .collect::<Result<Vec<_>>>()?;
+
+            from_plan(plan, &expr, &new_inputs)
+        }
+        // Rest: recurse and find possible LIMIT 0/Multi LIMIT OFFSET nodes
+        _ => {
+            // For those plans(projection/sort/..) which do not affect the output rows of sub-plans, we still use upper_offset;
+            // otherwise, use 0 instead.
+            let offset = match plan {

Review Comment:
   For join plan, the upper_offset is set to 0, and the limit node for the join is not affected. 
   
    let offset = match plan {
                   LogicalPlan::Projection { .. } | LogicalPlan::Sort { .. } => upper_offset,
                   _ => 0,  //  join/agg/filter and any others...
               };



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] AssHero commented on a diff in pull request #2823: Eliminate multi limit-offset nodes to emptyRelation

Posted by GitBox <gi...@apache.org>.
AssHero commented on code in PR #2823:
URL: https://github.com/apache/arrow-datafusion/pull/2823#discussion_r913725374


##########
datafusion/optimizer/src/eliminate_limit.rs:
##########
@@ -35,35 +39,98 @@ impl EliminateLimit {
     }
 }
 
+/// Ancestor indicates the current ancestor in the LogicalPlan tree
+/// when traversing down related to "eliminate limit".
+enum Ancestor {
+    /// Limit
+    FromLimit { skip: Option<usize> },
+    /// Other nodes that don't affect the adjustment of "Limit"
+    NotRelevant,
+}
+
+/// replaces LIMIT 0 with an [LogicalPlan::EmptyRelation]
+/// replaces LIMIT node whose ancestor LIMIT's skip is greater than or equal to current's fetch
+/// with an [LogicalPlan::EmptyRelation]
+fn eliminate_limit(
+    _optimizer: &EliminateLimit,
+    ancestor: &Ancestor,
+    plan: &LogicalPlan,
+    _optimizer_config: &OptimizerConfig,
+) -> Result<LogicalPlan> {
+    match plan {

Review Comment:
   The ancestor LIMIT node is OFFSET 5 for query "SELECT * FROM (SELECT * FROM test_a LIMIT 5) AS test_b OFFSET 5". 
   For this case, we should use 5 as ancestor's skip.
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] AssHero commented on pull request #2823: Eliminate multi limit-offset nodes to emptyRelation

Posted by GitBox <gi...@apache.org>.
AssHero commented on PR #2823:
URL: https://github.com/apache/arrow-datafusion/pull/2823#issuecomment-1173107028

   > @AssHero Hi thanks for your work! I looked into this PR, and have some concerns on logical plan node like `Join` when this optimization is used and left some comments.
   
   @ming535  Thanks!But I didn't see any comments here, please let me know if I miss something?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] AssHero commented on a diff in pull request #2823: Eliminate multi limit-offset nodes to emptyRelation

Posted by GitBox <gi...@apache.org>.
AssHero commented on code in PR #2823:
URL: https://github.com/apache/arrow-datafusion/pull/2823#discussion_r913725374


##########
datafusion/optimizer/src/eliminate_limit.rs:
##########
@@ -35,35 +39,98 @@ impl EliminateLimit {
     }
 }
 
+/// Ancestor indicates the current ancestor in the LogicalPlan tree
+/// when traversing down related to "eliminate limit".
+enum Ancestor {
+    /// Limit
+    FromLimit { skip: Option<usize> },
+    /// Other nodes that don't affect the adjustment of "Limit"
+    NotRelevant,
+}
+
+/// replaces LIMIT 0 with an [LogicalPlan::EmptyRelation]
+/// replaces LIMIT node whose ancestor LIMIT's skip is greater than or equal to current's fetch
+/// with an [LogicalPlan::EmptyRelation]
+fn eliminate_limit(
+    _optimizer: &EliminateLimit,
+    ancestor: &Ancestor,
+    plan: &LogicalPlan,
+    _optimizer_config: &OptimizerConfig,
+) -> Result<LogicalPlan> {
+    match plan {

Review Comment:
   The ancestor LIMIT node is OFFSET 5 for query "SELECT * FROM (SELECT * FROM test_a LIMIT 5) AS test_b OFFSET 5". 
   For this case, we should use 5 as ancestor's skip to optimizer inputs.
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] AssHero commented on pull request #2823: Eliminate multi limit-offset nodes to emptyRelation

Posted by GitBox <gi...@apache.org>.
AssHero commented on PR #2823:
URL: https://github.com/apache/arrow-datafusion/pull/2823#issuecomment-1175170548

   Add more test cases.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb merged pull request #2823: Eliminate multi limit-offset nodes to emptyRelation

Posted by GitBox <gi...@apache.org>.
alamb merged PR #2823:
URL: https://github.com/apache/arrow-datafusion/pull/2823


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org