You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/10/12 04:04:43 UTC
[GitHub] [spark] LuciferYang opened a new pull request, #38214: [SPARK-40765][SQL] Optimize redundant fs operations in `CommandUtils#calculateSingleLocationSize#getPathSize` method
LuciferYang opened a new pull request, #38214:
URL: https://github.com/apache/spark/pull/38214
### What changes were proposed in this pull request?
This pr change the 2nd input parameter from `Path` to `FileStatus to avoid redundant `fs.getFileStatus(path)` in each recursive call.
### Why are the changes needed?
Reduce one dfs operation.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Pass Github Actions
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #38214: [SPARK-40765][SQL] Optimize redundant fs operation in `CommandUtils#calculateSingleLocationSize#getPathSize` method
Posted by GitBox <gi...@apache.org>.
LuciferYang commented on PR #38214:
URL: https://github.com/apache/spark/pull/38214#issuecomment-1276963870
Ignore the previous
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] srowen commented on pull request #38214: [SPARK-40765][SQL] Optimize redundant fs operation in `CommandUtils#calculateSingleLocationSize#getPathSize` method
Posted by GitBox <gi...@apache.org>.
srowen commented on PR #38214:
URL: https://github.com/apache/spark/pull/38214#issuecomment-1276312561
I see so it saves one call, not one call per recursion?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #38214: [SPARK-40765][SQL] Optimize redundant fs operation in `CommandUtils#calculateSingleLocationSize#getPathSize` method
Posted by GitBox <gi...@apache.org>.
LuciferYang commented on PR #38214:
URL: https://github.com/apache/spark/pull/38214#issuecomment-1276300373
<img width="648" alt="image" src="https://user-images.githubusercontent.com/1475305/195373375-9b721841-fc21-45f1-b88a-85b499c01237.png">
https://github.com/apache/spark/blob/0dd78341e64a7e146413156834448707f16bba83/sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala#L116-L132
Line 122 It is a recursive call. The original logic is pass in `status.getPath`, and then execute `val fileStatus = fs.getFileStatus(status.getPath)` (Line 117).
After change, the recursive call reuse the incoming status.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #38214: [SPARK-40765][SQL] Optimize redundant fs operation in `CommandUtils#calculateSingleLocationSize#getPathSize` method
Posted by GitBox <gi...@apache.org>.
LuciferYang commented on PR #38214:
URL: https://github.com/apache/spark/pull/38214#issuecomment-1276311354
> I don't see the same status retrieved twice in the original
Before recursion, there is an exist `status` instance, then the recursive call will call `getStatus` for the exist `status`.`getPath`, Isn't this an unnecessary call?
If there is no recursion, there is no saving.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #38214: [SPARK-40765][SQL] Optimize redundant fs operation in `CommandUtils#calculateSingleLocationSize#getPathSize` method
Posted by GitBox <gi...@apache.org>.
LuciferYang commented on PR #38214:
URL: https://github.com/apache/spark/pull/38214#issuecomment-1275973420
Yes, one `fs.getFileStatus(path)` operation is saved for each recursive call
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #38214: [SPARK-40765][SQL] Optimize redundant fs operation in `CommandUtils#calculateSingleLocationSize#getPathSize` method
Posted by GitBox <gi...@apache.org>.
LuciferYang commented on PR #38214:
URL: https://github.com/apache/spark/pull/38214#issuecomment-1276950904
Thanks @srowen @HyukjinKwon and @zhengruifeng
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #38214: [SPARK-40765][SQL] Optimize redundant fs operation in `CommandUtils#calculateSingleLocationSize#getPathSize` method
Posted by GitBox <gi...@apache.org>.
LuciferYang commented on PR #38214:
URL: https://github.com/apache/spark/pull/38214#issuecomment-1276354759
For example , the data path is `/a/b/c/d.txt`:
Before
- recursion-0: pass in `Path(/a)` -> ` fs.getFileStatus(Path(/a))` -> `fs.listStatus(Status(/a))` -> `Status(/a/b)` -> call `getPathSize(Status(/a/b).getPath)`
- recursion-1: pass in `Path(/a/b)` -> `fs.getFileStatus(Path(/a/b))` -> `fs.listStatus(Status(/a/b))` -> `Status(/a/b/c)` -> call `getPathSize(Status(/a/b/c).getPath)`
- recursion-2: pass in `Path(/a/b/c)` -> `fs.getFileStatus(Path(/a/b/c))` -> `fs.listStatus(Status(/a/b/c))` -> `Status(/a/b/c/d.txt)` -> call `getPathSize(Status(/a/b/c/d.txt).getPath)`
- recursion-3: pass in path of `Path(/a/b/c/d.txt)` -> `fs.getFileStatus(Path(/a/b/c/d.txt))` -> `Status(/a/b/c/d.txt)` -> return `size`
After
fs.getFileStatus(Path(/a))
- recursion-0: pass in `Status(/a)` -> `fs.listStatus(Status(/a))` -> `Status(/a/b)` -> call `getPathSize(Status(/a/b))`
- recursion-1: pass in `Status(/a/b)` -> `fs.listStatus(Status(/a/b))` -> `Status(/a/b/c)` -> call `getPathSize(Status(/a/b/c))`
- recursion-2: pass in `Status(/a/b/c)` -> `fs.listStatus(Status(/a/b/c))` -> > `Status(/a/b/c/d.txt)` -> call `getPathSize(Status(/a/b/c/d.txt))`
- recursion-3: pass in `Status(/a/b/c/d.txt)` -> return `size`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] srowen commented on pull request #38214: [SPARK-40765][SQL] Optimize redundant fs operation in `CommandUtils#calculateSingleLocationSize#getPathSize` method
Posted by GitBox <gi...@apache.org>.
srowen commented on PR #38214:
URL: https://github.com/apache/spark/pull/38214#issuecomment-1276474003
Oh yes I think that's it. The other call is not in the same recursive loop. Ok I get it
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #38214: [SPARK-40765][SQL] Optimize redundant fs operation in `CommandUtils#calculateSingleLocationSize#getPathSize` method
Posted by GitBox <gi...@apache.org>.
LuciferYang commented on PR #38214:
URL: https://github.com/apache/spark/pull/38214#issuecomment-1276472661
<img width="615" alt="image" src="https://user-images.githubusercontent.com/1475305/195402745-99aa2d9f-1e68-4ba9-a574-02e40b5190d9.png">
Is it because of code folding? After change, only getFileStatus will be made for `/a`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #38214: [SPARK-40765][SQL] Optimize redundant fs operation in `CommandUtils#calculateSingleLocationSize#getPathSize` method
Posted by GitBox <gi...@apache.org>.
LuciferYang commented on PR #38214:
URL: https://github.com/apache/spark/pull/38214#issuecomment-1276963399
I need to know more about the rules for defining error-class, so close first
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #38214: [SPARK-40765][SQL] Optimize redundant fs operation in `CommandUtils#calculateSingleLocationSize#getPathSize` method
Posted by GitBox <gi...@apache.org>.
LuciferYang commented on PR #38214:
URL: https://github.com/apache/spark/pull/38214#issuecomment-1276389487
From the example:
- `fs.getFileStatus(Path(/a))` not saved, but no additional calls are added
- `fs.getFileStatus(Path(/a/b))`,`fs.getFileStatus(Path(/a/b/c))` and `fs.getFileStatus(Path(/a/b/c/d.txt))` saved
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #38214: [SPARK-40765][SQL] Optimize redundant fs operation in `CommandUtils#calculateSingleLocationSize#getPathSize` method
Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on PR #38214:
URL: https://github.com/apache/spark/pull/38214#issuecomment-1275918414
so it can save some IO operations?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #38214: [SPARK-40765][SQL] Optimize redundant fs operation in `CommandUtils#calculateSingleLocationSize#getPathSize` method
Posted by GitBox <gi...@apache.org>.
HyukjinKwon closed pull request #38214: [SPARK-40765][SQL] Optimize redundant fs operation in `CommandUtils#calculateSingleLocationSize#getPathSize` method
URL: https://github.com/apache/spark/pull/38214
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #38214: [SPARK-40765][SQL] Optimize redundant fs operation in `CommandUtils#calculateSingleLocationSize#getPathSize` method
Posted by GitBox <gi...@apache.org>.
LuciferYang commented on PR #38214:
URL: https://github.com/apache/spark/pull/38214#issuecomment-1276475810
> Oh yes I think that's it. The other call is not in the same recursive loop. Ok I get it
Yes, and I finally know why you had doubts before :)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #38214: [SPARK-40765][SQL] Optimize redundant fs operation in `CommandUtils#calculateSingleLocationSize#getPathSize` method
Posted by GitBox <gi...@apache.org>.
LuciferYang commented on PR #38214:
URL: https://github.com/apache/spark/pull/38214#issuecomment-1276359784
@srowen Is the above example helpful?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #38214: [SPARK-40765][SQL] Optimize redundant fs operation in `CommandUtils#calculateSingleLocationSize#getPathSize` method
Posted by GitBox <gi...@apache.org>.
LuciferYang commented on PR #38214:
URL: https://github.com/apache/spark/pull/38214#issuecomment-1276317638
The saved times are related to the directory level and the number of directories in each level
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] srowen commented on pull request #38214: [SPARK-40765][SQL] Optimize redundant fs operation in `CommandUtils#calculateSingleLocationSize#getPathSize` method
Posted by GitBox <gi...@apache.org>.
srowen commented on PR #38214:
URL: https://github.com/apache/spark/pull/38214#issuecomment-1276320786
I may have to think about it more, but still don't see how this saves much as you have _added_ a new call to getFileStatus too. Right? how can it save all the calls but one?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] srowen commented on pull request #38214: [SPARK-40765][SQL] Optimize redundant fs operation in `CommandUtils#calculateSingleLocationSize#getPathSize` method
Posted by GitBox <gi...@apache.org>.
srowen commented on PR #38214:
URL: https://github.com/apache/spark/pull/38214#issuecomment-1276468566
OK, I'm still not seeing how they are saved, because in the recursive call, getFileStatus is called. After, you are still calling this on /a/b, /a/b/c, etc. Right? That is line 138
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #38214: [SPARK-40765][SQL] Optimize redundant fs operation in `CommandUtils#calculateSingleLocationSize#getPathSize` method
Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on PR #38214:
URL: https://github.com/apache/spark/pull/38214#issuecomment-1276938912
Merged to master.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #38214: [SPARK-40765][SQL] Optimize redundant fs operation in `CommandUtils#calculateSingleLocationSize#getPathSize` method
Posted by GitBox <gi...@apache.org>.
LuciferYang commented on PR #38214:
URL: https://github.com/apache/spark/pull/38214#issuecomment-1275557093
cc @HyukjinKwon @zhengruifeng
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] srowen commented on pull request #38214: [SPARK-40765][SQL] Optimize redundant fs operation in `CommandUtils#calculateSingleLocationSize#getPathSize` method
Posted by GitBox <gi...@apache.org>.
srowen commented on PR #38214:
URL: https://github.com/apache/spark/pull/38214#issuecomment-1276288007
I don't quite see how this saves a call; the caller makes an fs call instead of the callee, but the number of calls seems like it must be the same?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] srowen commented on pull request #38214: [SPARK-40765][SQL] Optimize redundant fs operation in `CommandUtils#calculateSingleLocationSize#getPathSize` method
Posted by GitBox <gi...@apache.org>.
srowen commented on PR #38214:
URL: https://github.com/apache/spark/pull/38214#issuecomment-1276303206
But the caller had to add a new call to get the status. It seems to just be moved around. I don't see the same status retrieved twice in the original, so wondering how we can actually avoid a call here
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #38214: [SPARK-40765][SQL] Optimize redundant fs operation in `CommandUtils#calculateSingleLocationSize#getPathSize` method
Posted by GitBox <gi...@apache.org>.
LuciferYang commented on PR #38214:
URL: https://github.com/apache/spark/pull/38214#issuecomment-1276315937
Every time a recursive call occurs, it will be saved once (the initial call is not counted, and the initial call is not saved)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org