You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/08/18 19:05:40 UTC

[GitHub] [arrow] jorgecarleitao opened a new pull request #7993: ARROW-9760: [Rust] [DataFusion] Added DataFrame::explain

jorgecarleitao opened a new pull request #7993:
URL: https://github.com/apache/arrow/pull/7993


   FYI @andygrove and @alamb 
   
   I admit I find this API a bit counter-intuitive: coming from spark, I would be expect a string when I call `df.explain()?`. However, I am following the commitment of understanding `explain` as a table with one row and one column and leave the collect and print for the users to handle.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #7993: ARROW-9760: [Rust] [DataFusion] Added DataFrame::explain

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #7993:
URL: https://github.com/apache/arrow/pull/7993#issuecomment-675665151


   https://issues.apache.org/jira/browse/ARROW-9760


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] andygrove commented on a change in pull request #7993: ARROW-9760: [Rust] [DataFusion] Added DataFrame::explain

Posted by GitBox <gi...@apache.org>.
andygrove commented on a change in pull request #7993:
URL: https://github.com/apache/arrow/pull/7993#discussion_r472425993



##########
File path: rust/datafusion/src/dataframe.rs
##########
@@ -174,4 +174,18 @@ pub trait DataFrame {
 
     /// Return the logical plan represented by this DataFrame.
     fn to_logical_plan(&self) -> LogicalPlan;
+
+    /// Return a DataFrame with the explanation of its plan so far.
+    ///
+    /// ```
+    /// # use datafusion::prelude::*;
+    /// # use datafusion::error::Result;
+    /// # fn main() -> Result<()> {
+    /// let mut ctx = ExecutionContext::new();
+    /// let df = ctx.read_csv("tests/example.csv", CsvReadOptions::new())?;
+    /// let batches = df.limit(100)?.explain(false)?.collect()?;
+    /// # Ok(())
+    /// # }
+    /// ```
+    fn explain(&self, verbose: bool) -> Result<Arc<dyn DataFrame>>;

Review comment:
       I agree with your concern @jorgecarleitao and would expect `explain()` to just print the plan rather than return a DataFrame. That would be easy to implement (just call pretty_print there) but hard to test.
   
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on pull request #7993: ARROW-9760: [Rust] [DataFusion] Added DataFrame::explain

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #7993:
URL: https://github.com/apache/arrow/pull/7993#issuecomment-676295725


   That makes sense to me
   
   On Tue, Aug 18, 2020 at 11:24 PM Jorge Leitao <no...@github.com>
   wrote:
   
   > *@jorgecarleitao* commented on this pull request.
   > ------------------------------
   >
   > In rust/datafusion/src/dataframe.rs
   > <https://github.com/apache/arrow/pull/7993#discussion_r472638005>:
   >
   > > @@ -174,4 +174,18 @@ pub trait DataFrame {
   >
   >      /// Return the logical plan represented by this DataFrame.
   >      fn to_logical_plan(&self) -> LogicalPlan;
   > +
   > +    /// Return a DataFrame with the explanation of its plan so far.
   > +    ///
   > +    /// ```
   > +    /// # use datafusion::prelude::*;
   > +    /// # use datafusion::error::Result;
   > +    /// # fn main() -> Result<()> {
   > +    /// let mut ctx = ExecutionContext::new();
   > +    /// let df = ctx.read_csv("tests/example.csv", CsvReadOptions::new())?;
   > +    /// let batches = df.limit(100)?.explain(false)?.collect()?;
   > +    /// # Ok(())
   > +    /// # }
   > +    /// ```
   > +    fn explain(&self, verbose: bool) -> Result<Arc<dyn DataFrame>>;
   >
   > I find it poor design that .explain prints directly to the stdout in
   > spark. IMO saving 1 extra line (print) of code is not a sufficiently good
   > reason to outright spam stdout and limit so much what a user can do with
   > .explain.
   >
   > Some downstream consequences of this decision in spark:
   >
   >    - it makes it much more difficult to log it correctly
   >    - the popular pyspark can't use it to convert it to a Python string
   >    and prettify it when it is being used in notebooks
   >
   > I agree with fn explain(&self, verbose: bool) -> String (prob.
   > Result<String>). For a user, the difference is
   >
   > df.explain()
   >
   > vs
   >
   > println("{}", df.explain()?)
   >
   > I find the latter more expressive of the user's intention, and gives them
   > the freedom to pipe the result to whatever stream they want.
   >
   > —
   > You are receiving this because you were mentioned.
   > Reply to this email directly, view it on GitHub
   > <https://github.com/apache/arrow/pull/7993#discussion_r472638005>, or
   > unsubscribe
   > <https://github.com/notifications/unsubscribe-auth/AADXZMJFG5NSF3L4FDEOIT3SBNAWNANCNFSM4QD6L7YA>
   > .
   >
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] andygrove commented on a change in pull request #7993: ARROW-9760: [Rust] [DataFusion] Added DataFrame::explain

Posted by GitBox <gi...@apache.org>.
andygrove commented on a change in pull request #7993:
URL: https://github.com/apache/arrow/pull/7993#discussion_r472425993



##########
File path: rust/datafusion/src/dataframe.rs
##########
@@ -174,4 +174,18 @@ pub trait DataFrame {
 
     /// Return the logical plan represented by this DataFrame.
     fn to_logical_plan(&self) -> LogicalPlan;
+
+    /// Return a DataFrame with the explanation of its plan so far.
+    ///
+    /// ```
+    /// # use datafusion::prelude::*;
+    /// # use datafusion::error::Result;
+    /// # fn main() -> Result<()> {
+    /// let mut ctx = ExecutionContext::new();
+    /// let df = ctx.read_csv("tests/example.csv", CsvReadOptions::new())?;
+    /// let batches = df.limit(100)?.explain(false)?.collect()?;
+    /// # Ok(())
+    /// # }
+    /// ```
+    fn explain(&self, verbose: bool) -> Result<Arc<dyn DataFrame>>;

Review comment:
       I agree with your concern @jorgecarleitao and would expect `explain()` to just print the plan rather than return a DataFrame. That would be easy to implement (just call pretty_print here) but hard to test.
   
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] andygrove commented on a change in pull request #7993: ARROW-9760: [Rust] [DataFusion] Added DataFrame::explain

Posted by GitBox <gi...@apache.org>.
andygrove commented on a change in pull request #7993:
URL: https://github.com/apache/arrow/pull/7993#discussion_r473075611



##########
File path: rust/datafusion/src/dataframe.rs
##########
@@ -174,4 +174,18 @@ pub trait DataFrame {
 
     /// Return the logical plan represented by this DataFrame.
     fn to_logical_plan(&self) -> LogicalPlan;
+
+    /// Return a DataFrame with the explanation of its plan so far.
+    ///
+    /// ```
+    /// # use datafusion::prelude::*;
+    /// # use datafusion::error::Result;
+    /// # fn main() -> Result<()> {
+    /// let mut ctx = ExecutionContext::new();
+    /// let df = ctx.read_csv("tests/example.csv", CsvReadOptions::new())?;
+    /// let batches = df.limit(100)?.explain(false)?.collect()?;
+    /// # Ok(())
+    /// # }
+    /// ```
+    fn explain(&self, verbose: bool) -> Result<Arc<dyn DataFrame>>;

Review comment:
       I agree completely. Spark's design is flawed in this case.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on pull request #7993: ARROW-9760: [Rust] [DataFusion] Added DataFrame::explain

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #7993:
URL: https://github.com/apache/arrow/pull/7993#issuecomment-676593892


   Ready to go, then!
   
   ![](https://image.shutterstock.com/image-photo/llama-lama-glama-funny-face-260nw-518893873.jpg)
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] andygrove commented on a change in pull request #7993: ARROW-9760: [Rust] [DataFusion] Added DataFrame::explain

Posted by GitBox <gi...@apache.org>.
andygrove commented on a change in pull request #7993:
URL: https://github.com/apache/arrow/pull/7993#discussion_r472583526



##########
File path: rust/datafusion/src/dataframe.rs
##########
@@ -174,4 +174,18 @@ pub trait DataFrame {
 
     /// Return the logical plan represented by this DataFrame.
     fn to_logical_plan(&self) -> LogicalPlan;
+
+    /// Return a DataFrame with the explanation of its plan so far.
+    ///
+    /// ```
+    /// # use datafusion::prelude::*;
+    /// # use datafusion::error::Result;
+    /// # fn main() -> Result<()> {
+    /// let mut ctx = ExecutionContext::new();
+    /// let df = ctx.read_csv("tests/example.csv", CsvReadOptions::new())?;
+    /// let batches = df.limit(100)?.explain(false)?.collect()?;
+    /// # Ok(())
+    /// # }
+    /// ```
+    fn explain(&self, verbose: bool) -> Result<Arc<dyn DataFrame>>;

Review comment:
       So Spark has a couple different things to help people debug. `explain` literally prints the query plan(s) to stdout. It is also possible to access the underlying logical and physical query plans using `df.queryExecution.plan` or `df.queryExecution.executedPlan` for example. We already have `to_logical_plan` and I do think that we eventually should expose the physical plan as well, but not until we have made a decision on how we're implementing the physical plan enum and optimizer (there are JIRAs open for these issues already).




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on a change in pull request #7993: ARROW-9760: [Rust] [DataFusion] Added DataFrame::explain

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on a change in pull request #7993:
URL: https://github.com/apache/arrow/pull/7993#discussion_r472638005



##########
File path: rust/datafusion/src/dataframe.rs
##########
@@ -174,4 +174,18 @@ pub trait DataFrame {
 
     /// Return the logical plan represented by this DataFrame.
     fn to_logical_plan(&self) -> LogicalPlan;
+
+    /// Return a DataFrame with the explanation of its plan so far.
+    ///
+    /// ```
+    /// # use datafusion::prelude::*;
+    /// # use datafusion::error::Result;
+    /// # fn main() -> Result<()> {
+    /// let mut ctx = ExecutionContext::new();
+    /// let df = ctx.read_csv("tests/example.csv", CsvReadOptions::new())?;
+    /// let batches = df.limit(100)?.explain(false)?.collect()?;
+    /// # Ok(())
+    /// # }
+    /// ```
+    fn explain(&self, verbose: bool) -> Result<Arc<dyn DataFrame>>;

Review comment:
       I find it poor design that `.explain` prints directly to the stdout in spark. IMO saving 1 extra line (print) of code is not a sufficiently good reason to outright spam stdout and limit so much what a user can do with `.explain`.
   
   Some downstream consequences of this decision in spark:
   * it makes it much more difficult to log it correctly
   * the popular pyspark can't use it to convert it to a Python string and prettify it when it is being used in notebooks
   
   I agree with `fn explain(&self, verbose: bool) -> String` (prob. `Result<String>`). For a user, the difference is
   
   ```
   df.explain()?
   ```
   
   vs 
   
   ```
   println("{}", df.explain()?)
   ```
   
   I find the latter more expressive of the user's intention, and gives them the freedom to pipe the result to whatever stream they want.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on a change in pull request #7993: ARROW-9760: [Rust] [DataFusion] Added DataFrame::explain

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #7993:
URL: https://github.com/apache/arrow/pull/7993#discussion_r472482939



##########
File path: rust/datafusion/src/dataframe.rs
##########
@@ -174,4 +174,18 @@ pub trait DataFrame {
 
     /// Return the logical plan represented by this DataFrame.
     fn to_logical_plan(&self) -> LogicalPlan;
+
+    /// Return a DataFrame with the explanation of its plan so far.
+    ///
+    /// ```
+    /// # use datafusion::prelude::*;
+    /// # use datafusion::error::Result;
+    /// # fn main() -> Result<()> {
+    /// let mut ctx = ExecutionContext::new();
+    /// let df = ctx.read_csv("tests/example.csv", CsvReadOptions::new())?;
+    /// let batches = df.limit(100)?.explain(false)?.collect()?;
+    /// # Ok(())
+    /// # }
+    /// ```
+    fn explain(&self, verbose: bool) -> Result<Arc<dyn DataFrame>>;

Review comment:
       Or we could add something like  `fn explain(&self, verbose: bool) -> String` if you wanted to keep it like spark. It would seem that to get a String for the explain plan, we would need to run the various optimizers and physical planner, but that is probably ok. 
   
   I am not as familiar with Spark's DataFrame API so I do not have a good feel for how people would want this API to behave.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on a change in pull request #7993: ARROW-9760: [Rust] [DataFusion] Added DataFrame::explain

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on a change in pull request #7993:
URL: https://github.com/apache/arrow/pull/7993#discussion_r472638005



##########
File path: rust/datafusion/src/dataframe.rs
##########
@@ -174,4 +174,18 @@ pub trait DataFrame {
 
     /// Return the logical plan represented by this DataFrame.
     fn to_logical_plan(&self) -> LogicalPlan;
+
+    /// Return a DataFrame with the explanation of its plan so far.
+    ///
+    /// ```
+    /// # use datafusion::prelude::*;
+    /// # use datafusion::error::Result;
+    /// # fn main() -> Result<()> {
+    /// let mut ctx = ExecutionContext::new();
+    /// let df = ctx.read_csv("tests/example.csv", CsvReadOptions::new())?;
+    /// let batches = df.limit(100)?.explain(false)?.collect()?;
+    /// # Ok(())
+    /// # }
+    /// ```
+    fn explain(&self, verbose: bool) -> Result<Arc<dyn DataFrame>>;

Review comment:
       I find it poor design that `.explain` prints directly to the stdout in spark. IMO saving 1 extra line (print) of code is not a sufficiently good reason to outright spam stdout and limit so much what a user can do with `.explain`.
   
   Some downstream consequences of this decision in spark:
   * it makes it much more difficult to log it correctly
   * the popular pyspark can't use it to convert it to a Python string and prettify it when it is being used in notebooks
   
   I agree with `fn explain(&self, verbose: bool) -> String` (prob. `Result<String>`). For a user, the difference is
   
   ```
   df.explain()
   ```
   
   vs 
   
   ```
   println("{}", df.explain()?)
   ```
   
   I find the latter more expressive of the user's intention, and gives them the freedom to pipe the result to whatever stream they want.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] andygrove closed pull request #7993: ARROW-9760: [Rust] [DataFusion] Added DataFrame::explain

Posted by GitBox <gi...@apache.org>.
andygrove closed pull request #7993:
URL: https://github.com/apache/arrow/pull/7993


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] andygrove commented on pull request #7993: ARROW-9760: [Rust] [DataFusion] Added DataFrame::explain

Posted by GitBox <gi...@apache.org>.
andygrove commented on pull request #7993:
URL: https://github.com/apache/arrow/pull/7993#issuecomment-676512584


   After thinking about this some more, I'm also fine with the current implementation that returns a DataFrame. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org