You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "erenavsarogullari (via GitHub)" <gi...@apache.org> on 2024/01/09 23:19:48 UTC

[PR] [SPARK-46639][SQL] Add WindowExec SQLMetrics [spark]

erenavsarogullari opened a new pull request, #44646:
URL: https://github.com/apache/spark/pull/44646

   ### What changes were proposed in this pull request?
   Currently, WindowExec Physical Operator does not have any SQLMetrics. This PR aims to add WindowExec SQLMetrics to provide following information from WindowExec usage during query execution:
   ```
   numOfOutputRows: Number of total output rows.
   numOfPartitions: Number of processed input partitions.
   numOfWindowPartitions: Number of generated window partitions.
   spilledRows: Number of total spilled rows.
   spillSizeOnDisk: Total spilled data size on disk.
   ```
   ### Why are the changes needed?
   These SQLMetrics can help Spark users for better understanding about Window functions runtime results.
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   ### How was this patch tested?
   New 4 Unit Test have been added.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46639][SQL] Add WindowExec SQLMetrics [spark]

Posted by "erenavsarogullari (via GitHub)" <gi...@apache.org>.
erenavsarogullari commented on PR #44646:
URL: https://github.com/apache/spark/pull/44646#issuecomment-1965093597

   Hi @cloud-fan and @beliefer,
   Is it possible to review when you have chance?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46639][SQL] Add WindowExec SQLMetrics [spark]

Posted by "erenavsarogullari (via GitHub)" <gi...@apache.org>.
erenavsarogullari commented on PR #44646:
URL: https://github.com/apache/spark/pull/44646#issuecomment-1892961944

   > Adding statistical information is good, but I'm not sure if it's worth it.
   
   Currently, `WindowExec` Physical Operator does not have any SQLMetrics except `spillSize` (in-memory). This PR aims to be understood WindowExec runtime behavior such as:
   - what is created `numOfWindowPartitions` in addition `numOfPartitions`?, 
   - what is processed `numOfOutputRows`?, 
   - `how many rows are spilled` into disk?,
   -  what is the spilled size on disk`?. 
   For example, WindowExec spilling behavior depends on multiple factor and it is hard to track without metrics  such as:
   
   **1-** WindowExec creates `ExternalAppendOnlyUnsafeRowArray` (internal ArrayBuffer) per `task` (a.k.a child RDD partition)
   **2-** When ExternalAppendOnlyUnsafeRowArray size exceeds `spark.sql.windowExec.buffer.in.memory.threshold=4096`, ExternalAppendOnlyUnsafeRowArray switches to `UnsafeExternalSorter` as `spillableArray` by moving its all buffered rows into UnsafeExternalSorter and ExternalAppendOnlyUnsafeRowArray (internal ArrayBuffer) is cleared. In this case, WindowExec starts to write UnsafeExternalSorter' s buffer (a.k.a `UnsafeInMemorySorter`).
   **3-** UnsafeExternalSorter is being created per `window partition`. When UnsafeExternalSorter' buffer size exceeds `spark.sql.windowExec.buffer.spill.threshold=Integer.MAX_VALUE`, it starts to write to disk and get cleared all buffer (a.k.a UnsafeInMemorySorter) content. In this case, UnsafeExternalSorter will continue to buffer next records until exceeding spark.sql.windowExec.buffer.spill.threshold. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46639][SQL] Add WindowExec SQLMetrics [spark]

Posted by "erenavsarogullari (via GitHub)" <gi...@apache.org>.
erenavsarogullari commented on PR #44646:
URL: https://github.com/apache/spark/pull/44646#issuecomment-1888407610

   Hi @cloud-fan and @dongjoon-hyun,
   Hope you are fine.
   Would like to get your feedback on this PR whenever you have time. Thanks in advance.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org