You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by "ueshin (via GitHub)" <gi...@apache.org> on 2023/10/25 00:29:06 UTC

[PR] [SPARK-45656][SQL] Fix observation when named observations with the same name on different datasets [spark]

ueshin opened a new pull request, #43519:
URL: https://github.com/apache/spark/pull/43519

   ### What changes were proposed in this pull request?
   
   Fixes observation when named observations with the same name on different datasets.
   
   ### Why are the changes needed?
   
   Currently if there are observations with the same name on different dataset, one of them will be overwritten by the other execution.
   
   For example,
   
   ```py
   >>> observation1 = Observation("named")
   >>> df1 = spark.range(50)
   >>> observed_df1 = df1.observe(observation1, count(lit(1)).alias("cnt"))
   >>>
   >>> observation2 = Observation("named")
   >>> df2 = spark.range(100)
   >>> observed_df2 = df2.observe(observation2, count(lit(1)).alias("cnt"))
   >>>
   >>> observed_df1.collect()
   ...
   >>> observed_df2.collect()
   ...
   >>> observation1.get
   {'cnt': 50}
   >>> observation2.get
   {'cnt': 50}
   ```
   
   `observation2` should return `{'cnt': 100}`.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, the observations with the same name will be available if they observe different datasets.
   
   ### How was this patch tested?
   
   Added the related tests.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45656][SQL] Fix observation when named observations with the same name on different datasets [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on PR #43519:
URL: https://github.com/apache/spark/pull/43519#issuecomment-1822188529

   Oh sorry I was a bit confused as well. I think it's because of self-join, we don't require the observation name to be unique.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45656][SQL] Fix observation when named observations with the same name on different datasets [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on PR #43519:
URL: https://github.com/apache/spark/pull/43519#issuecomment-1821335854

   name is unique but df can be self-joined and observation will be duplicated.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45656][SQL] Fix observation when named observations with the same name on different datasets [spark]

Posted by "EnricoMi (via GitHub)" <gi...@apache.org>.

EnricoMi commented on code in PR #43519:
URL: https://github.com/apache/spark/pull/43519#discussion_r1403202213


##########
sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala:
##########
@@ -1024,6 +1024,27 @@ class DatasetSuite extends QueryTest
     assert(namedObservation.get === expected)
   }
 
+  test("SPARK-45656: named observations with the same name on different datasets") {
+    val namedObservation1 = Observation("named")
+    val df1 = spark.range(50)
+    val observed_df1 = df1.observe(
+      namedObservation1, count(lit(1)).as("count"))

Review Comment:
   There is a rule that enforces that `name` is unique within a plan (except for identical expressions: self-joins).
   
   Could we have some similar check across the spark session?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45656][SQL] Fix observation when named observations with the same name on different datasets [spark]

Posted by "EnricoMi (via GitHub)" <gi...@apache.org>.

EnricoMi commented on PR #43519:
URL: https://github.com/apache/spark/pull/43519#issuecomment-1824030046

   This whole problem goes away with unnamed `Observation` instances:
   
   ```
   >>> observation1 = Observation()
   >>> observation2 = Observation()
   ```
   
   What is the purpose of the `name` anyway? It is not used anywhere, nor visible anyway. It is merely used in the query plan to give `CollectMetrics` some id.
   
   Historically, the `name` argument was introduced by the `Dataset.observe(str, Column, Column*)` method introduced in 3.0.0, as this `name` was the only identifier required / used to retrieve the metrics from the event listener.
   
   Then, in 3.3.0, I have introduced the `Observation` class to simplify retrieval of the metrics from the events, and added the `name` argument as optional, though there is no value in giving your observation a name.
   
   I'd propose to remove the optional `name` from Observation in 4.0.0, which fixes [SPARK-45656] by reducing complexity, rather than increasing complexity as per this PR.
   
   Arguably, `Dataset.observe(str, Column, Column*)` could be made `private[spark]` in 4.0.0 as well as `Dataset.observe(Observation, Column, Column*)` seems to be more user-friendly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45656][SQL] Fix observation when named observations with the same name on different datasets [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon closed pull request #43519: [SPARK-45656][SQL] Fix observation when named observations with the same name on different datasets
URL: https://github.com/apache/spark/pull/43519


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45656][SQL] Fix observation when named observations with the same name on different datasets [spark]

Posted by "EnricoMi (via GitHub)" <gi...@apache.org>.

EnricoMi commented on PR #43519:
URL: https://github.com/apache/spark/pull/43519#issuecomment-1867801729

   @cloud-fan what do you think about making `Dataset.observe(str, Column, Column*)` private?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45656][SQL] Fix observation when named observations with the same name on different datasets [spark]

Posted by "EnricoMi (via GitHub)" <gi...@apache.org>.

EnricoMi commented on PR #43519:
URL: https://github.com/apache/spark/pull/43519#issuecomment-1821529980

   > name is unique but df can be self-joined and observation will be duplicated.
   
   That sounds like an unrelated edge case. Example in the description is not a self-join of an observed dataset. And this PR does not fix this edge case: The self-joined dataset still contains two identical `CollectMetrics` with the same `dataframeId`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45656][SQL] Fix observation when named observations with the same name on different datasets [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on PR #43519:
URL: https://github.com/apache/spark/pull/43519#issuecomment-1778720002

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45656][SQL] Fix observation when named observations with the same name on different datasets [spark]

Posted by "EnricoMi (via GitHub)" <gi...@apache.org>.

EnricoMi commented on PR #43519:
URL: https://github.com/apache/spark/pull/43519#issuecomment-1821283381

   What is the point of the name of an observation, if that is not the unique identifier? Create an observation without a name if you cannot come up with a unique name and `Observation()` will pick a unique name for you.
   
   Why introducing another identifier when `name` is supposed to be unique?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45656][SQL] Fix observation when named observations with the same name on different datasets [spark]

Posted by "EnricoMi (via GitHub)" <gi...@apache.org>.

EnricoMi commented on PR #43519:
URL: https://github.com/apache/spark/pull/43519#issuecomment-1867978567

   Next release is a major release, so perfect opportunity to improve API.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45656][SQL] Fix observation when named observations with the same name on different datasets [spark]

Posted by "EnricoMi (via GitHub)" <gi...@apache.org>.

EnricoMi commented on PR #43519:
URL: https://github.com/apache/spark/pull/43519#issuecomment-1824032786

   > Oh sorry I was a bit confused as well. I think it's because of self-join, we don't require the observation name to be unique.
   > 
   > With the new df_id parameter, seems we can now? But it may be a breaking change.
   
   In a self-join, the `df_id` is identical between left and right side of the join.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45656][SQL] Fix observation when named observations with the same name on different datasets [spark]

Posted by "beliefer (via GitHub)" <gi...@apache.org>.

beliefer commented on PR #43519:
URL: https://github.com/apache/spark/pull/43519#issuecomment-1822712201

   I guess this PR fix a bug that caused by multiple datasets could share with the same spark session. The listener of `Observation` could receives the `SparkListenerSQLExecutionEnd` belongs to other `Observation`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45656][SQL] Fix observation when named observations with the same name on different datasets [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #43519:
URL: https://github.com/apache/spark/pull/43519#discussion_r1403120274


##########
sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala:
##########
@@ -1024,6 +1024,27 @@ class DatasetSuite extends QueryTest
     assert(namedObservation.get === expected)
   }
 
+  test("SPARK-45656: named observations with the same name on different datasets") {
+    val namedObservation1 = Observation("named")
+    val df1 = spark.range(50)
+    val observed_df1 = df1.observe(
+      namedObservation1, count(lit(1)).as("count"))

Review Comment:
   does the `observe` with string input have the same problem? or it's only an issue with `Observation`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45656][SQL] Fix observation when named observations with the same name on different datasets [spark]

Posted by "EnricoMi (via GitHub)" <gi...@apache.org>.

EnricoMi commented on code in PR #43519:
URL: https://github.com/apache/spark/pull/43519#discussion_r1403196154


##########
sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala:
##########
@@ -1024,6 +1024,27 @@ class DatasetSuite extends QueryTest
     assert(namedObservation.get === expected)
   }
 
+  test("SPARK-45656: named observations with the same name on different datasets") {
+    val namedObservation1 = Observation("named")
+    val df1 = spark.range(50)
+    val observed_df1 = df1.observe(
+      namedObservation1, count(lit(1)).as("count"))

Review Comment:
   If `name` in `Dataset.observe(name, expr, exprs)` is not unique across the spark session, then yes, we have the same problem. `Observation(name)` inherits the problem from `Dataset.observe(name, expr, exprs)`. `Observation()` avoids this problem.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45656][SQL] Fix observation when named observations with the same name on different datasets [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on PR #43519:
URL: https://github.com/apache/spark/pull/43519#issuecomment-1867814446

   We can not remove (making private is the same as removal for end users) a released API. We can update the document and say Spark always ignore the `name` parameter though.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org