You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by nsuthar <gi...@git.apache.org> on 2014/04/25 04:43:06 UTC

[GitHub] spark pull request: Getting absloute path instead of relative path...

GitHub user nsuthar opened a pull request:

    https://github.com/apache/spark/pull/546

    Getting absloute path instead of relative path.

    Getting absloute path instead of relative path. : Its a same bug as Jira 1527

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/nsuthar/spark master

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/546.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #546
    
----
commit 6c25d6b92c5564ad455b9d065fc1e983c3ff08ea
Author: nirajguavus <ni...@gmail.com>
Date:   2014-04-23T23:23:23Z

    Fixed bug for Jira Issue# 1527
    Details: rootDirs in DiskBlockManagerSuite doesn't get full path from rootDir0, rootDir1

commit 53ed683214ebbe24e8af80a1bcfc35b4685bb73d
Author: nirajsuthar <ni...@gmail.com>
Date:   2014-04-24T18:25:15Z

    Jira 1527
    
    rootDirs in DiskBlockManagerSuite doesn't get full path from rootDir0,
    rootDir1

commit 25b34f47e89a7012a92a8d8e9d1d3ccbfb248c75
Author: nirajsuthar <ni...@gmail.com>
Date:   2014-04-24T19:26:45Z

    Merge branch 'bugFixes'

commit 59e1ad11afdd6bfa167a305169e89289e7f5b413
Author: nirajsuthar <ni...@gmail.com>
Date:   2014-04-25T02:39:08Z

    fixed it to use absolute path instead of relative
    
    Its same issue as Jira 1527
    - fixed it to use absolute path instead of relative

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1623. Broadcast cleaner should use getCa...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/546#issuecomment-42484142

@pwendell @nsuthar

First, note SPARK-1527 (https://issues.apache.org/jira/browse/SPARK-1527) and its PR (https://github.com/apache/spark/pull/436).

@advancedxy says that was a real issue, and I think it is fixed by the PR, which uses `getAbsolutePath()`. FWIW I agree with that change.

I think that is slightly preferable to the change in this PR. `getCanonicalPath()` is probably just fine too, and also solves the issue, but may be unnecessary.

There is a separate issue addressed in this PR / SPARK-1623. That is that the set `files`, which has `String`s, is populated with the result of `File.getAbsolutePath()` but then later unpopulated with the result of `File.toString()`, which is `File.getPath()`, which is *not* necessarily the same for the same file.

Whether it happens to all work out because the arguments happen to always be absolute, I think it's probably nicer to resolve this by a) calling `files.remove(file.getAbsolutePath)` or b) just making `files` a set of `File` to begin with.

So I suggest:

- commit the PR for SPARK-1527 and close it
- abandon this PR and make a new one that implements b) or a) above to resolve SPARK-1623

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Getting absloute path instead of relative path...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/546#discussion_r11983521
  
    --- Diff: core/src/main/scala/org/apache/spark/broadcast/HttpBroadcast.scala ---
    @@ -213,7 +213,7 @@ private[spark] object HttpBroadcast extends Logging {
         SparkEnv.get.blockManager.master.removeBroadcast(id, removeFromDriver, blocking)
         if (removeFromDriver) {
           val file = getFile(id)
    -      files.remove(file.toString)
    +      files.remove(file.getCanonicalPath)
    --- End diff --
    
    Hmm and also note that the file names are added using the value of `getAbsolutePath()`. If its absolute path were not-canonicalized it would not work. 
    
    Why not just have `File` objects in the set `files`? They implement the desired `equals()` semantics:
    http://docs.oracle.com/javase/7/docs/api/java/io/File.html#equals(java.lang.Object)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1623. Broadcast cleaner should use getCa...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/546


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: Getting absloute path instead of relative path...

Posted by pwendell <gi...@git.apache.org>.

Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/546#issuecomment-41355382
  
    Mind changing the name? I created a new JIRA for this:
    https://issues.apache.org/jira/browse/SPARK-1623
    
    Also - the fix as-is doesn't seem to compile.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Getting absloute path instead of relative path...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/546#issuecomment-41355131
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Getting absloute path instead of relative path...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/546#issuecomment-41354350
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Getting absloute path instead of relative path...

Posted by nsuthar <gi...@git.apache.org>.

Github user nsuthar commented on a diff in the pull request:

    https://github.com/apache/spark/pull/546#discussion_r11983658
  
    --- Diff: core/src/main/scala/org/apache/spark/broadcast/HttpBroadcast.scala ---
    @@ -213,7 +213,7 @@ private[spark] object HttpBroadcast extends Logging {
         SparkEnv.get.blockManager.master.removeBroadcast(id, removeFromDriver, blocking)
         if (removeFromDriver) {
           val file = getFile(id)
    -      files.remove(file.toString)
    +      files.remove(file.getCanonicalPath)
    --- End diff --
    
    I have tried both of the following :
    
    1.    deleteBroadcastFile(new File(file.toString).getCanonicalPath)
    2.    deleteBroadcastFile(new File((file.toString).getCanonicalPath))
    In both the cases IDE complaints !
    
    Niraj
    
    
    On Thu, Apr 24, 2014 at 10:21 PM, Sean Owen <no...@github.com>wrote:
    
    > In core/src/main/scala/org/apache/spark/broadcast/HttpBroadcast.scala:
    >
    > > @@ -213,7 +213,7 @@ private[spark] object HttpBroadcast extends Logging {
    > >      SparkEnv.get.blockManager.master.removeBroadcast(id, removeFromDriver, blocking)
    > >      if (removeFromDriver) {
    > >        val file = getFile(id)
    > > -      files.remove(file.toString)
    > > +      files.remove(file.getCanonicalPath)
    >
    > Hmm and also note that the file names are added using the value of
    > getAbsolutePath(). If its absolute path were not-canonicalized it would
    > not work.
    >
    > Why not just have File objects in the set files? They implement the
    > desired equals() semantics:
    >
    > http://docs.oracle.com/javase/7/docs/api/java/io/File.html#equals(java.lang.Object)
    >
    > —
    > Reply to this email directly or view it on GitHub<https://github.com/apache/spark/pull/546/files#r11983521>
    > .
    >


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1623. Broadcast cleaner should use getCa...

Posted by nsuthar <gi...@git.apache.org>.

Github user nsuthar commented on the pull request:

    https://github.com/apache/spark/pull/546#issuecomment-41416816
  
    could someone pls verify this patch. Thanks. 
    
    Niraj


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Getting absloute path instead of relative path...

Posted by techaddict <gi...@git.apache.org>.

Github user techaddict commented on a diff in the pull request:

    https://github.com/apache/spark/pull/546#discussion_r11982803
  
    --- Diff: core/src/main/scala/org/apache/spark/broadcast/HttpBroadcast.scala ---
    @@ -229,7 +229,7 @@ private[spark] object HttpBroadcast extends Logging {
           val (file, time) = (entry.getKey, entry.getValue)
           if (time < cleanupTime) {
             iterator.remove()
    -        deleteBroadcastFile(new File(file.toString))
    +        deleteBroadcastFile(new File(file.getCanonicalPath))
    --- End diff --
    
    Shouldn't this be `new File(file.toString).getCanonicalPath` ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Getting absloute path instead of relative path...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/546#issuecomment-41355242
  
    Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Getting absloute path instead of relative path...

Posted by pwendell <gi...@git.apache.org>.

Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/546#issuecomment-41360648
  
    @nsuthar ah I wasn't clear - I meant change the title of this pull request to have SPARK-1623 at the front of the title.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1623. Broadcast cleaner should use getCa...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/546#issuecomment-41503775
  
    Agree and that leads to the changes. For example `HttpBroadcast.scala` adds paths to the set using the string from `getAbsolutePath` (line 173) but removes them with the value of `toString`, which is `getPath`, which is not necessarily the same. (It may happen to be here.) Seems better to be consistent. In `DiskBlockManagerSuite.scala` I also don't see we have a guarantee that the path is absolute, and it needs to be IIUC, and the OP suggests this doesn't actually work in some cases.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1623. Broadcast cleaner should use getCa...

Posted by tdas <gi...@git.apache.org>.

Github user tdas commented on the pull request:

    https://github.com/apache/spark/pull/546#issuecomment-41447132
  
    Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1623. Broadcast cleaner should use getCa...

Posted by nsuthar <gi...@git.apache.org>.

Github user nsuthar commented on a diff in the pull request:

    https://github.com/apache/spark/pull/546#discussion_r11984024
  
    --- Diff: core/src/main/scala/org/apache/spark/broadcast/HttpBroadcast.scala ---
    @@ -229,7 +229,7 @@ private[spark] object HttpBroadcast extends Logging {
           val (file, time) = (entry.getKey, entry.getValue)
           if (time < cleanupTime) {
             iterator.remove()
    -        deleteBroadcastFile(new File(file.toString))
    +        deleteBroadcastFile(new File(file.getCanonicalPath))
    --- End diff --
    
    Sure Sean, I think I agree with you cause hashSet entries are string:
    
     private val files = new TimeStampedHashSet[String]
    
    and I am making it to deleteBroadcastFile(new File(file))  as per your
    suggestion.
    
    Niraj
    
    
    On Thu, Apr 24, 2014 at 10:52 PM, Sean Owen <no...@github.com>wrote:
    
    > In core/src/main/scala/org/apache/spark/broadcast/HttpBroadcast.scala:
    >
    > > @@ -229,7 +229,7 @@ private[spark] object HttpBroadcast extends Logging {
    > >        val (file, time) = (entry.getKey, entry.getValue)
    > >        if (time < cleanupTime) {
    > >          iterator.remove()
    > > -        deleteBroadcastFile(new File(file.toString))
    > > +        deleteBroadcastFile(new File(file.getCanonicalPath))
    >
    > (Removed my earlier incorrect comment) @techaddict<https://github.com/techaddict>is correct since the value
    > file is a String. Niraj your two new versions are either create a Stringargument where a
    > File is needed or call File methods on a String. I think the correct
    > invocation is simply deleteBroadcastFile(new File(file)). Everything else
    > would be superfluous. (And this too would simplify if the set of values
    > contained Files not Strings of paths.)
    >
    > —
    > Reply to this email directly or view it on GitHub<https://github.com/apache/spark/pull/546/files#r11983954>
    > .
    >


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1623. Broadcast cleaner should use getCa...

Posted by pwendell <gi...@git.apache.org>.

Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/546#issuecomment-41512227
  
    Just fly-by-commenting here, but the performance cost of doing a few directory traversals is not relevant in this code path. It gets called only once in the lifetime of a broadcast variable. It think the bigger issues are around correctness and consistency...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Getting absloute path instead of relative path...

Posted by nsuthar <gi...@git.apache.org>.

Github user nsuthar commented on the pull request:

    https://github.com/apache/spark/pull/546#issuecomment-41360471
  
    Hi Patrick,
    
    By changing the name..I suppose you meant.."Assignee Name". I have assigned
    it to myself.
    
    Please correct me if i am wrong.
    Thank you,
    Niraj
    
    
    On Thu, Apr 24, 2014 at 8:13 PM, Patrick Wendell
    <no...@github.com>wrote:
    
    > Mind changing the name? I created a new JIRA for this:
    > https://issues.apache.org/jira/browse/SPARK-1623
    >
    > Also - the fix as-is doesn't seem to compile.
    >
    > —
    > Reply to this email directly or view it on GitHub<https://github.com/apache/spark/pull/546#issuecomment-41355382>
    > .
    >


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1623. Broadcast cleaner should use getCa...

Posted by nsuthar <gi...@git.apache.org>.

Github user nsuthar commented on the pull request:

    https://github.com/apache/spark/pull/546#issuecomment-41360736
  
    OK, Sorry I misunderstood it. did it.
    
    
    On Thu, Apr 24, 2014 at 10:44 PM, Sean Owen <no...@github.com>wrote:
    
    > @nsuthar <https://github.com/nsuthar> pretty sure he meant the name of
    > the PR, to something like
    > SPARK-1623. Broadcast cleaner should use getCanonicalPath when deleting
    > files by name
    > (Note spelling of Canonical ! :) )
    >
    > —
    > Reply to this email directly or view it on GitHub<https://github.com/apache/spark/pull/546#issuecomment-41360674>
    > .
    >


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1623. Broadcast cleaner should use getCa...

Posted by mridulm <gi...@git.apache.org>.

Github user mridulm commented on the pull request:

    https://github.com/apache/spark/pull/546#issuecomment-42110226
  
    It is not about a few uses here or there - either spark codebase as a whole moves to a) canonical path always; or always sticks to b) paths relative to cwd and/or what is returned by File.createTempFile - there is no middle ground IMO.
    (b) is where we are at currently with some bugs as @srowen mentioned. Either we fix those to conform to (b) or move to (a) entirely.
    
    Btw, regarding cost - we do quite a lot of path manipulations elsewhere (block management, shuffle, etc) - which adds to the cost.
    
    Trying to carefully cordon off different sections of the code to different idioms is just asking for bugs as the codebase evolves. As we are apparently already hitting !


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1623. Broadcast cleaner should use getCa...

Posted by pwendell <gi...@git.apache.org>.

Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/546#issuecomment-42454571
  
    @mridulm @srowen sorry not totally following this one. Is there an immediate isolated bug here, or is this a broader issue that we need to add consistency across the code base?
    
    I guess my question is - in what case does the current implementation cause a problem for users?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1623. Broadcast cleaner should use getCa...

Posted by nsuthar <gi...@git.apache.org>.

Github user nsuthar commented on the pull request:

    https://github.com/apache/spark/pull/546#issuecomment-41362685
  
    fixed it to get abs path....jira 1623


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Getting absloute path instead of relative path...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/546#discussion_r11983495
  
    --- Diff: core/src/main/scala/org/apache/spark/broadcast/HttpBroadcast.scala ---
    @@ -229,7 +229,7 @@ private[spark] object HttpBroadcast extends Logging {
           val (file, time) = (entry.getKey, entry.getValue)
           if (time < cleanupTime) {
             iterator.remove()
    -        deleteBroadcastFile(new File(file.toString))
    +        deleteBroadcastFile(new File(file.getCanonicalPath))
    --- End diff --
    
    `file.getCanonicalFile` is more direct even. The re-creating of a `File` in the original version of the method was redundant. just `file` as an argument would work and still works.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1623. Broadcast cleaner should use getCa...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/546#issuecomment-41497607
  
    getAbsolutePath() would also resolve both issues as far as I can tell,
    and does not involve I/O. IIRC there was a real issue observed in
    DiskBlockManagerSuite.
    
    For HttpBroadcast, see how the file names are retrieved with
    getAbsolutePath(), so this is a real issue albeit probably theoretical
    right now.  (That one's also resolved by just using a Set of Files.)
    
    (Is the I/O going to matter though? These are executed once.)
    
    On Sun, Apr 27, 2014 at 3:06 PM, Mridul Muralidharan
    <no...@github.com> wrote:
    > Are you actually seeing problems or is this a cleanup exercise to use
    > appropriate api ?
    > Creation of the file happens from within spark and is not externally
    > provided - and that should match how it gets used.
    >
    > If we are not seeing any actual issues, I would rather not go down the path
    > of fixing this.
    > getCanonicalPath, etc are expensive operations reqiuring filesystem IO to
    > resolve.
    >
    > —
    > Reply to this email directly or view it on GitHub.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1623. Broadcast cleaner should use getCa...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/546#discussion_r11983954
  
    --- Diff: core/src/main/scala/org/apache/spark/broadcast/HttpBroadcast.scala ---
    @@ -229,7 +229,7 @@ private[spark] object HttpBroadcast extends Logging {
           val (file, time) = (entry.getKey, entry.getValue)
           if (time < cleanupTime) {
             iterator.remove()
    -        deleteBroadcastFile(new File(file.toString))
    +        deleteBroadcastFile(new File(file.getCanonicalPath))
    --- End diff --
    
    (Removed my earlier incorrect comment) @techaddict is correct since the value `file` is a `String`. Niraj your two new versions are either create a `String` argument where a `File` is needed or call `File` methods on a `String`. I think the correct invocation is simply `deleteBroadcastFile(new File(file))`. Everything else would be superfluous. (And this too would simplify if the set of values contained `File`s not `String`s of paths.)



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Getting absloute path instead of relative path...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/546#issuecomment-41355243
  
    
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14472/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1623. Broadcast cleaner should use getCa...

Posted by mridulm <gi...@git.apache.org>.

Github user mridulm commented on the pull request:

    https://github.com/apache/spark/pull/546#issuecomment-41503498
  
    It goes back to the problem we are trying to solve.
    If the set/map can contain arbitrary paths then file.getCanonical is unavoidable.
    But then (multiple) IO and native call overhead will be there
    
    If it is in context of a specific usecase - then it becomes function of that : do we need canonical names or are the path always constructed using common patterns which are consistent (always relative to cwd or specified w.r.t some root (which can be symlink).
    IMO spark does the latter - unless there are recent changes I am missing.
    
    In any case, getAbsolutePath/File does not buy us anything much.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1623. Broadcast cleaner should use getCa...

Posted by mridulm <gi...@git.apache.org>.

Github user mridulm commented on the pull request:

    https://github.com/apache/spark/pull/546#issuecomment-41497440
  
    Are you actually seeing problems or is this a cleanup exercise to use appropriate api ?
    Creation of the file happens from within spark and is not externally provided - and that should match how it gets used.
    
    If we are not seeing any actual issues, I would rather not go down the path of fixing this.
    getCanonicalPath, etc are expensive operations reqiuring filesystem IO to resolve.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Getting absloute path instead of relative path...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/546#issuecomment-41355128
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1623. Broadcast cleaner should use getCa...

Posted by advancedxy <gi...@git.apache.org>.

Github user advancedxy commented on the pull request:

https://github.com/apache/spark/pull/546#issuecomment-42525924

Sorry for being late for this conversation. I am busy with my own project.

First, I agree with @srowen. SPARK-1527 was discovered when there were untracked files in my spark git dir after running tests. In DiskBlockManagerSuite, files were created using file.getName as parent folds, which was a relative path to cwd. The cleanup code deletes the original temp dir.
```
val rootDir0 = Files.createTempDir()
rootDir0.deleteOnExit()
val rootDir1 = Files.createTempDir()
rootDir1.deleteOnExit()
val rootDirs = rootDir0.getName + "," + rootDir1.getName
```

This is a problem or a bug I prefer because it doesn't do it tends to do.

In SPARK-1527, srowen suggested that ```toString``` could return relative path and should use getAbsolutePath instead. After thinking twice, I think it would be ok to use ```toString``` if we make sure we don't change working directory between file creation and deletion. That is to say even if tempDir.toString is a relative path, we can still delete this file using file.toString as file path.

Back in this pr, as @srowen said,
>the set files, which has Strings, is populated with the result of File.getAbsolutePath() but then later unpopulated with the result of File.toString(), which is File.getPath(), which is not necessarily the same for the same file.

There is a problem. ```File.getAbsolutePaht``` is not necessarily the same as ```File.getPath```, or even the same as ```File.getCanonicalPath```
see the code below:
```
scala>val fstr = "./1.txt"
fstr: String = ./1.txt

scala>val f = new File(fstr)
f: java.io.File = ./1.txt

scala> f.getPath
res0: String = ./1.txt

scala> f.getAbsolutePath
res1: String = /Users/yexianjin/./1.txt

scala> f.getCanonicalPath
res2: String = /Users/yexianjin/1.txt
```

@mridulm I think you are right, we are at case b)paths relative to cwd and/or what is returned by Files.createTempFile, and I think we should stick to this case. But, we need to review the codebase to make sure path manipulation are correct.

So I suggest(like @srowen said):
1. choose ```getPath``` or ```getAbsolutePath```. ```getAbsolutePath``` can deal with changing working directory situation, but ```getPath``` or ```toString``` is more consistent with spark codebase.
2. PR for SPARK-1527 could be committed if we use ```getAbsolutePath```, otherwise, we can abandon PR for SPARK-1527 and this PR.
3. make a new one that implements b) or something similar based on our choice.

what do you think?

[GitHub] spark pull request: Getting absloute path instead of relative path...

Posted by pwendell <gi...@git.apache.org>.

Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/546#issuecomment-41355113
  
    Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Getting absloute path instead of relative path...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/546#issuecomment-41360674
  
    @nsuthar pretty sure he meant the name of the PR, to something like
    `SPARK-1623. Broadcast cleaner should use getCanonicalPath when deleting files by name`
    (Note spelling of Canonical ! :) )


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1623. Broadcast cleaner should use getCa...

Posted by vanzin <gi...@git.apache.org>.

Github user vanzin commented on the pull request:

    https://github.com/apache/spark/pull/546#issuecomment-42456863
  
    Just my 2 bits, but using getCanonicalPath() is usually a code smell; unless it's explicitly necessary (i.e. you know you have a symlink and you want to remove the actual file it references), it's generally better to use getAbsolutePath().
    
    (Not that I think this applies here, but just commenting on the choice of API.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1623. Broadcast cleaner should use getCa...

Posted by mridulm <gi...@git.apache.org>.

Github user mridulm commented on a diff in the pull request:

    https://github.com/apache/spark/pull/546#discussion_r12029270
  
    --- Diff: core/src/main/scala/org/apache/spark/broadcast/HttpBroadcast.scala ---
    @@ -229,7 +229,7 @@ private[spark] object HttpBroadcast extends Logging {
           val (file, time) = (entry.getKey, entry.getValue)
           if (time < cleanupTime) {
             iterator.remove()
    -        deleteBroadcastFile(new File(file.toString))
    +        deleteBroadcastFile(new File(file))
    --- End diff --
    
    not required to wrap it in another File.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1623. Broadcast cleaner should use getCa...

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/546#issuecomment-50538277
  
    Since we merged in https://github.com/apache/spark/pull/749 I believe it's okay to close this, right @pwendell ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---