You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/10/10 21:24:22 UTC

[GitHub] [spark] amaliujia opened a new pull request, #38193: [SPARK-39375][CONNECT][FOLLOW-UP] Refactor Read to UnresolvedRelation

amaliujia opened a new pull request, #38193:
URL: https://github.com/apache/spark/pull/38193

   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
     2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
     3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
     4. Be sure to keep the PR description updated to reflect all changes.
     5. Please write your PR title to summarize what this PR proposes.
     6. If possible, provide a concise example to reproduce the issue for a faster review.
     7. If you want to add a new configuration, please read the guideline first for naming configurations in
        'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
     8. If you want to add or modify an error type or message, please read the guideline first in
        'core/src/main/resources/error/README.md'.
   -->
   
   ### What changes were proposed in this pull request?
   <!--
   Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. 
   If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
     1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
     2. If you fix some SQL features, you can provide some references of other DBMSes.
     3. If there is design documentation, please add the link.
     4. If there is a discussion in the mailing list, please add the link.
   -->
   Refactor proto from `Read` to `UnresolvedRelation`.
   
   ### Why are the changes needed?
   <!--
   Please clarify why the changes are needed. For instance,
     1. If you propose a new API, clarify the use case for a new API.
     2. If you fix a bug, you can clarify why it is a bug.
   -->
   UnresolvedRelation has its own proto. In the future we will add a proto for DataSource. We can decouple these two plans.
   
   
   ### Does this PR introduce _any_ user-facing change?
   <!--
   Note that it means *any* user-facing change including all aspects such as the documentation fix.
   If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
   If possible, please also clarify if this is a user-facing change compared to the released Spark versions or within the unreleased branches such as master.
   If no, write 'No'.
   -->
   
   No
   ### How was this patch tested?
   <!--
   If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
   If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why it was difficult to add.
   If benchmark tests were added, please run the benchmarks in GitHub Actions for the consistent environment, and the instructions could accord to: https://spark.apache.org/developer-tools.html#github-workflow-benchmarks.
   -->
   UT


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on a diff in pull request #38193: [SPARK-39375][CONNECT][FOLLOW-UP] Refactor Read to UnresolvedRelation

Posted by GitBox <gi...@apache.org>.
amaliujia commented on code in PR #38193:
URL: https://github.com/apache/spark/pull/38193#discussion_r991673158


##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -31,7 +31,7 @@ option java_package = "org.apache.spark.connect.proto";
 message Relation {
   RelationCommon common = 1;
   oneof rel_type {
-    Read read = 2;
+    UnresolvedRelation unresolved_relation = 2;

Review Comment:
   Per the review suggestions on https://github.com/apache/spark/pull/38086, we plan to decouple unresolved relation and data source (e.g. in different messages). cc @cloud-fan 
   
   Data source needs its own plan with its own fields. In that case current `Read` will only be unresolved relation. 
   
   For Spark I don't think Relation and DataSource are ever being the same concept. Relation has an identifier and analyzer needs to resolve it. Data source has format, path, schema and other configurations. IIRC relation and data source do not share same fields.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] grundprinzip commented on a diff in pull request #38193: [SPARK-39375][CONNECT][FOLLOW-UP] Refactor Read to UnresolvedRelation

Posted by GitBox <gi...@apache.org>.
grundprinzip commented on code in PR #38193:
URL: https://github.com/apache/spark/pull/38193#discussion_r991652681


##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -31,7 +31,7 @@ option java_package = "org.apache.spark.connect.proto";
 message Relation {
   RelationCommon common = 1;
   oneof rel_type {
-    Read read = 2;
+    UnresolvedRelation unresolved_relation = 2;

Review Comment:
   why?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] hvanhovell commented on a diff in pull request #38193: [SPARK-39375][CONNECT][FOLLOW-UP] Refactor Read to UnresolvedRelation

Posted by GitBox <gi...@apache.org>.
hvanhovell commented on code in PR #38193:
URL: https://github.com/apache/spark/pull/38193#discussion_r992254515


##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -31,7 +31,7 @@ option java_package = "org.apache.spark.connect.proto";
 message Relation {
   RelationCommon common = 1;
   oneof rel_type {
-    Read read = 2;
+    UnresolvedRelation unresolved_relation = 2;

Review Comment:
   I am not sure I follow why we need to bifurcate into different types of reads in the proto. The whole point of using parsed plans is to have a high level (non-ambiguous) description of the operation, not an implementation, the latter is for Spark to figure out. We just need to describe whatever it takes what is needed to read the data. If it is a table scan then it is easy, you only need a name, otherwise you may have to add a bit more information.
   
   BTW nitpicking on naming. Operations should almost always have a verb as its name (scan, project, filter, aggregate, generate, ...) if it does not, you probably should rethink the name. `UNRESOLVED RELATION` is the thing you want to scan, and not the operation.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on a diff in pull request #38193: [SPARK-39375][CONNECT][FOLLOW-UP] Refactor Read to UnresolvedRelation

Posted by GitBox <gi...@apache.org>.
amaliujia commented on code in PR #38193:
URL: https://github.com/apache/spark/pull/38193#discussion_r992889787


##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -31,7 +31,7 @@ option java_package = "org.apache.spark.connect.proto";
 message Relation {
   RelationCommon common = 1;
   oneof rel_type {
-    Read read = 2;
+    UnresolvedRelation unresolved_relation = 2;

Review Comment:
   I was thinking something similar: 
   
   It is a wrong pattern that a message contains many fields while a subset of the fields that belong to one type and others that belong to different types, which is option 1 that Wenchen mentioned.
   
   It's good that we keep all relevant fields into one message. If there is more than one message but we want to co-locate it then there is a type to differentiate those, which is Wenchen's option 2 and Martin's example, if I am not mis-read it.
   
   I think we are sort of having the consensus on the rule of when to split messages. 
   
   
   However the rule of when to co-locate messages here is still not clear on, for example, why we don't choose to define
   ```
   message ExecuteWithSingleInput {
    oneof node_type {
        Project,
        Filter
    } 
    
    message Project {
    
    }
    
    message Filter {
   
    }
   }
   ```
   
   Basically the rule to split messages is clear, but the rule to co-locate message is not. We can co-locate `Project` and `Filter` as they both have single relation input. But we have chosen not to in the proto. The naming in my example is bad and I am pretty we can have a better verb to describe it.
   
   To me the sharing semantic of a `unresolved_relation` and `data_source` is even less than `project` and `filter`. In Apache Calcite, it can merge `project` and `filter` and call it a `Calc`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on pull request #38193: [SPARK-39375][CONNECT][FOLLOW-UP] Refactor Read to UnresolvedRelation

Posted by GitBox <gi...@apache.org>.
amaliujia commented on PR #38193:
URL: https://github.com/apache/spark/pull/38193#issuecomment-1273833576

   R: @cloud-fan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #38193: [SPARK-39375][CONNECT][FOLLOW-UP] Refactor Read to UnresolvedRelation

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on PR #38193:
URL: https://github.com/apache/spark/pull/38193#issuecomment-1275403714

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] grundprinzip commented on a diff in pull request #38193: [SPARK-39375][CONNECT][FOLLOW-UP] Refactor Read to UnresolvedRelation

Posted by GitBox <gi...@apache.org>.
grundprinzip commented on code in PR #38193:
URL: https://github.com/apache/spark/pull/38193#discussion_r991817123


##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -31,7 +31,7 @@ option java_package = "org.apache.spark.connect.proto";
 message Relation {
   RelationCommon common = 1;
   oneof rel_type {
-    Read read = 2;
+    UnresolvedRelation unresolved_relation = 2;

Review Comment:
   I don't agree with the change. a read is a sink the same way you use it in the from statement. 
   
   What's wrong with modeling the different read types below Read? 
   
   Conceptually, the hierarchical modeling of Read -> Table and Read-> data source makes a lot of sense.
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on a diff in pull request #38193: [SPARK-39375][CONNECT][FOLLOW-UP] Refactor Read to UnresolvedRelation

Posted by GitBox <gi...@apache.org>.
amaliujia commented on code in PR #38193:
URL: https://github.com/apache/spark/pull/38193#discussion_r991673158


##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -31,7 +31,7 @@ option java_package = "org.apache.spark.connect.proto";
 message Relation {
   RelationCommon common = 1;
   oneof rel_type {
-    Read read = 2;
+    UnresolvedRelation unresolved_relation = 2;

Review Comment:
   Per the review suggestions on https://github.com/apache/spark/pull/38086, we plan to decouple unresolved relation and data source (e.g. in different messages). cc @cloud-fan 
   
   Data source needs its own plan with its own fields. In that case current `Read` will only be unresolved relation. 
   
   For Spark I don't think Relation and DataSource are ever being the same concept. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #38193: [SPARK-39375][CONNECT][FOLLOW-UP] Refactor Read to UnresolvedRelation

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #38193:
URL: https://github.com/apache/spark/pull/38193#discussion_r991722481


##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -60,16 +60,9 @@ message SQL {
   string query = 1;
 }
 
-// Relation that reads from a file / table or other data source. Does not have additional

Review Comment:
   @grundprinzip It's hard to unify this. Spark has 2 different code paths to resolve a table via name, and resolve a data source via format name, options, etc. This is kind of merging two query plans into one when they have no common fields. It's better to split.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] grundprinzip commented on a diff in pull request #38193: [SPARK-39375][CONNECT][FOLLOW-UP] Refactor Read to UnresolvedRelation

Posted by GitBox <gi...@apache.org>.
grundprinzip commented on code in PR #38193:
URL: https://github.com/apache/spark/pull/38193#discussion_r992464590


##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -31,7 +31,7 @@ option java_package = "org.apache.spark.connect.proto";
 message Relation {
   RelationCommon common = 1;
   oneof rel_type {
-    Read read = 2;
+    UnresolvedRelation unresolved_relation = 2;

Review Comment:
   Maybe we're getting closer to it. Yes, proto has proper Union type support. So what the `Read` relation should look like is the following:
   
   ```
   message Read {
   
     oneof read_type {
         UnresolvedRelation unresolved_relation = 1;
         UnresolvedDataSource unresolved_data_source =2;
     }
   
     message UnresolvedRelation {
     }
   
     message UnresolvedDataSource {
     }
   
   }
   ```
   
   The generated code has the right behavior.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] grundprinzip commented on a diff in pull request #38193: [SPARK-39375][CONNECT][FOLLOW-UP] Refactor Read to UnresolvedRelation

Posted by GitBox <gi...@apache.org>.
grundprinzip commented on code in PR #38193:
URL: https://github.com/apache/spark/pull/38193#discussion_r992465557


##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -60,16 +60,9 @@ message SQL {
   string query = 1;
 }
 
-// Relation that reads from a file / table or other data source. Does not have additional
-// inputs.
-message Read {
-  oneof read_type {
-    NamedTable named_table = 1;

Review Comment:
   @cloud-fan, please see here, the Read had already had a union type and this just needs to be extended with an additional DataSource type.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] grundprinzip commented on a diff in pull request #38193: [SPARK-39375][CONNECT][FOLLOW-UP] Refactor Read to UnresolvedRelation

Posted by GitBox <gi...@apache.org>.
grundprinzip commented on code in PR #38193:
URL: https://github.com/apache/spark/pull/38193#discussion_r992522700


##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -60,16 +60,9 @@ message SQL {
   string query = 1;
 }
 
-// Relation that reads from a file / table or other data source. Does not have additional
-// inputs.
-message Read {
-  oneof read_type {
-    NamedTable named_table = 1;

Review Comment:
   The benefits of not flattening is about intent. We're defining an intent for "Read"ing data using different alternatives. To a certain degree it's kind of like polymorphism :)
   
   Personally, I prefer the hierarchy because it makes it clear how this should be consumed from an API perspective.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] grundprinzip commented on pull request #38193: [SPARK-39375][CONNECT][FOLLOW-UP] Refactor Read to UnresolvedRelation

Posted by GitBox <gi...@apache.org>.
grundprinzip commented on PR #38193:
URL: https://github.com/apache/spark/pull/38193#issuecomment-1273838438

   I think this change is unnecessary.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on a diff in pull request #38193: [SPARK-39375][CONNECT][FOLLOW-UP] Refactor Read to UnresolvedRelation

Posted by GitBox <gi...@apache.org>.
amaliujia commented on code in PR #38193:
URL: https://github.com/apache/spark/pull/38193#discussion_r991673158


##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -31,7 +31,7 @@ option java_package = "org.apache.spark.connect.proto";
 message Relation {
   RelationCommon common = 1;
   oneof rel_type {
-    Read read = 2;
+    UnresolvedRelation unresolved_relation = 2;

Review Comment:
   Per the review suggestions on https://github.com/apache/spark/pull/38086, we plan to decouple unresolved relation and data source (e.g. in different messages). cc @cloud-fan 
   
   Data source needs its own plan with its own fields. In that case current `Read` will only be unresolved relation.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on a diff in pull request #38193: [SPARK-39375][CONNECT][FOLLOW-UP] Refactor Read to UnresolvedRelation

Posted by GitBox <gi...@apache.org>.
amaliujia commented on code in PR #38193:
URL: https://github.com/apache/spark/pull/38193#discussion_r991673440


##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -31,7 +31,7 @@ option java_package = "org.apache.spark.connect.proto";
 message Relation {
   RelationCommon common = 1;
   oneof rel_type {
-    Read read = 2;
+    UnresolvedRelation unresolved_relation = 2;

Review Comment:
   I found refactoring in https://github.com/apache/spark/pull/38086 makes that PR too complicated so I split this part out as a separate PR. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #38193: [SPARK-39375][CONNECT][FOLLOW-UP] Refactor Read to UnresolvedRelation

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #38193:
URL: https://github.com/apache/spark/pull/38193#discussion_r992475710


##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -60,16 +60,9 @@ message SQL {
   string query = 1;
 }
 
-// Relation that reads from a file / table or other data source. Does not have additional
-// inputs.
-message Read {
-  oneof read_type {
-    NamedTable named_table = 1;

Review Comment:
   I see, yea we can add a new `DataSource` type here, in addition to the `NamedTable`. This seems like we give them a parent class `Read`. Any benefits of doing so compared to just flattening it? Please bear with me as I'm not very familiar with protobuf...



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #38193: [SPARK-39375][CONNECT][FOLLOW-UP] Refactor Read to UnresolvedRelation

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #38193:
URL: https://github.com/apache/spark/pull/38193#discussion_r992291218


##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -31,7 +31,7 @@ option java_package = "org.apache.spark.connect.proto";
 message Relation {
   RelationCommon common = 1;
   oneof rel_type {
-    Read read = 2;
+    UnresolvedRelation unresolved_relation = 2;

Review Comment:
   @hvanhovell , yes, we are talking about how to describe the scan operation. IIUC there are two options on the table:
   1. Have a single `Read` operation, with many fields: `tableName`, `userSpecifiedSchema`, `dataSourceName`, `options`. If we are scanning a table, only `tableName` is set, if we are scanning a data source, all fields except `tableName` are set.
   2. Have a `UnresolvedRelation` and `UnresolvedDataSource` (we can pick better names, this is just to match catalyst). `UnresolvedRelation` only have a `tableName` field, `UnresolvedDataSource` has many fields: `userSpecifiedSchema`, `dataSourceName`, `options`.
   
   I'm not familiar wth the protobuf design philosophy. Does it prefer option 1 or 2?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #38193: [SPARK-39375][CONNECT][FOLLOW-UP] Refactor Read to UnresolvedRelation

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #38193:
URL: https://github.com/apache/spark/pull/38193#discussion_r992079265


##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -31,7 +31,7 @@ option java_package = "org.apache.spark.connect.proto";
 message Relation {
   RelationCommon common = 1;
   oneof rel_type {
-    Read read = 2;
+    UnresolvedRelation unresolved_relation = 2;

Review Comment:
   I left a comment in https://github.com/apache/spark/pull/38193#discussion_r991722481 but let me also reply here.
   
   Yes, we want the query plan proto definition to follow the basics for the relational algebra / SQL language. However, data source reading is not really SQL, but more of "configuration". To read a data source in SQL, we need to do
   ```
   CRETE TABLE t USING my_source OPTIONS (...)
   SELECT * FROM t
   ```
   We must register the data source as a table and then read it. We can't read a data source directly in SQL.
   
   DF API provides a way to read a data source directly: `spark.read.format("my_source").option(...).load()`. This means, we can't say `a read is a sink the same way you use it in the from statement`. Reading a data source directly needs a dedicated query plan.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on a diff in pull request #38193: [SPARK-39375][CONNECT][FOLLOW-UP] Refactor Read to UnresolvedRelation

Posted by GitBox <gi...@apache.org>.
amaliujia commented on code in PR #38193:
URL: https://github.com/apache/spark/pull/38193#discussion_r991673158


##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -31,7 +31,7 @@ option java_package = "org.apache.spark.connect.proto";
 message Relation {
   RelationCommon common = 1;
   oneof rel_type {
-    Read read = 2;
+    UnresolvedRelation unresolved_relation = 2;

Review Comment:
   Per the review suggestions on https://github.com/apache/spark/pull/38086, we plan to decouple unresolved relation and data source (e.g. in different messages). cc @cloud-fan 
   
   Data source needs its own plan with its own fields. In that case current `Read` will only be unresolved relation. 
   
   For Spark I don't think Relation and DataSource are ever being the same concept. Relation has an identifier and analyzer needs to resolve it. Data source has format, path, schema and other configurations that analyzer does not need to resolve it from the catalog. IIRC relation and data source does not share same fields.



##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -31,7 +31,7 @@ option java_package = "org.apache.spark.connect.proto";
 message Relation {
   RelationCommon common = 1;
   oneof rel_type {
-    Read read = 2;
+    UnresolvedRelation unresolved_relation = 2;

Review Comment:
   Per the review suggestions on https://github.com/apache/spark/pull/38086, we plan to decouple unresolved relation and data source (e.g. in different messages). cc @cloud-fan 
   
   Data source needs its own plan with its own fields. In that case current `Read` will only be unresolved relation. 
   
   For Spark I don't think Relation and DataSource are ever being the same concept. Relation has an identifier and analyzer needs to resolve it. Data source has format, path, schema and other configurations that analyzer does not need to resolve it from the catalog. IIRC relation and data source do not share same fields.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia closed pull request #38193: [SPARK-39375][CONNECT][FOLLOW-UP] Refactor Read to UnresolvedRelation

Posted by GitBox <gi...@apache.org>.
amaliujia closed pull request #38193: [SPARK-39375][CONNECT][FOLLOW-UP] Refactor Read to UnresolvedRelation
URL: https://github.com/apache/spark/pull/38193


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org