You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by zheh12 <gi...@git.apache.org> on 2018/05/10 02:15:18 UTC

[GitHub] spark pull request #21286: [SPARK-24194] HadoopFsRelation cannot overwrite a...

GitHub user zheh12 opened a pull request:

    https://github.com/apache/spark/pull/21286

    [SPARK-24194] HadoopFsRelation cannot overwrite a path that is also b…

    ## What changes were proposed in this pull request?
    
    When there are multiple tasks at the same time append a `HadoopFsRelation`, there will be an error, there are the following two errors: 
    
    1. A task will succeed, but the data will be wrong and more data than excepted will appear
    2. Other tasks will fail with `java.io.FileNotFoundException: Failed to get file status skip_dir/_temporary/0`
    
    The main reason for this problem is because multiple job will use the same `_temporary` directory.
    
    So the core idea of this `PR` is to create a different temporary directory with jobId for the different Job in the `output` folder , so that conflicts can be avoided.
    
    ## How was this patch tested?
    
    I manually tested. 
    But I don't know how to write a unit test for this situation. Please help me.
    
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/zheh12/spark SPARK-24238

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/21286.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21286
    
----
commit b676a36af110b0b7d7dfc47ab292d09c441f6a0f
Author: yangz <zh...@...>
Date:   2018-05-10T01:46:49Z

    [SPARK-24194] HadoopFsRelation cannot overwrite a path that is also being read from

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21286: [SPARK-24238][SQL] HadoopFsRelation can't append the sam...

Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on the issue:

    https://github.com/apache/spark/pull/21286
  
    ...this makes me think that the FileOutputCommitter actually has an assumption that nobody has called out before, specifically "only one application will be writing data to the target FS with the same job id". It's probably been implicit in MR with a local HDFS for a long time, first on the assumption of all jobs getting unique job Ids from the same central source *and* nothing outside the cluster writing to the same destinations. With cloud stores, that doesn't hold; it's conceivable that >1  YARN cluster could start jobs with the same dest. As the timestamp of YARN launch is used as the initial part of the identifier, if >1 cluster was launched in the same minute, things are lined up to collide. Oops.
    
    FWIW, the parsing code I mentioned is {{org.apache.hadoop.mapreduce.JobID.forName()}}: any numbering scheme spark uses should be able to map from a string to a job ID through that & back again.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21286: [SPARK-24238][SQL] HadoopFsRelation can't append the sam...

Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on the issue:

    https://github.com/apache/spark/pull/21286
  
    @jinxing64 from my reading of the code, the original patch proposed creating a temp dir for every query, which could then do its own work & cleanup in parallel, with a new meta-commit on each job commit, moving stuff from this per-job temp dir into the final dest. 
    
    This is to address
    * conflict of work in the `_temporary/0` path
    * rm of `_temporary` in job abort, post-commit cleanup
    
    And the reason for that '0' is that spark's job id is just a counter of queries done from app start, whereas on hadoop MR it's unique for across a live YARN cluster. Spark deploys in different ways, and can't rely on that value.
    
    The job id discussion proposes generating unique job IDs for every spark app, so allowing `_temporary/$jobID1` to work alongside ``_temporary/$jobID2`. With that *and disabling cleanup in the FileOutputCommitter (`mapreduce.fileoutputcommitter.cleanup.skipped`), @zheh12 should get what they need: parallel queries to same dest using FileOutputCommitter without conflict of temp data
    
    > Thus the change outside committer and doesn't break commiterr's logic. Did I understand correctly ?
    
    Exactly. It also makes it a simpler change, which is good as the commit algorithms are pretty complex and its hard to test all the failure modes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21286: [SPARK-24238][SQL] HadoopFsRelation can't append the sam...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21286
  
    **[Test build #90827 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90827/testReport)** for PR 21286 at commit [`49532fe`](https://github.com/apache/spark/commit/49532fe871278473dd18807efdc7fac68e0c0b26).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21286: [SPARK-24238][SQL] HadoopFsRelation can't append the sam...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21286
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21286: [SPARK-24238][SQL] HadoopFsRelation can't append the sam...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21286
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21286: [SPARK-24238][SQL] HadoopFsRelation can't append the sam...

Posted by zheh12 <gi...@git.apache.org>.
Github user zheh12 commented on the issue:

    https://github.com/apache/spark/pull/21286
  
    relates to #21257 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21286: [SPARK-24238][SQL] HadoopFsRelation can't append the sam...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21286
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21286: [SPARK-24238][SQL] HadoopFsRelation can't append the sam...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/21286
  
    It seems like we somehow set the job id wrong, and caused different jobs share the same working directory. I don't believe HDFS has this issue by design. Can you look into why jobs share the same working directory in Spark?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21286: [SPARK-24238][SQL] HadoopFsRelation can't append the sam...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21286
  
    **[Test build #90827 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90827/testReport)** for PR 21286 at commit [`49532fe`](https://github.com/apache/spark/commit/49532fe871278473dd18807efdc7fac68e0c0b26).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21286: [SPARK-24238][SQL] HadoopFsRelation can't append the sam...

Posted by zheh12 <gi...@git.apache.org>.
Github user zheh12 commented on the issue:

    https://github.com/apache/spark/pull/21286
  
    cc @cloud-fan @jiangxb1987
    Is there some drawbacks for this idea? Please give some advice when you have time.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21286: [SPARK-24238][SQL] HadoopFsRelation can't append the sam...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/21286
  
    cc @steveloughran who I believe is the expert in this area.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21286: [SPARK-24238][SQL] HadoopFsRelation can't append the sam...

Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on the issue:

    https://github.com/apache/spark/pull/21286
  
    > After the job is committed, `skip_dir/tab1/_temporary` will be deleted. Then when other jobs attempt to commit, an error will be reported.
    
    I see. Yes, that's `org.apache.hadoop.mapreduce.OutputCommitter.cleanupJob()` doing the work. It does this as it wants to cleanup all attempts, including predecessors which have failed, and expects only one job to be writing at a time.
    
    Like I said, this proposed patch breaks all the blobstore-specific committer work, causes problems at scale with HDFS alone, and adds a new problem: how do you clean up from failed jobs writing to the same destination? 
    
    It's causing these problems because it's using another layer of temp dir and then the rename.
    
    Assuming you only want to work with `org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter` and subclasses thereof (like the Parquet one), why not
    
    1. Pick up my SPARK-23977 patch and Hadoop 3.1. There are some problems with hive versioning there, but that is a WiP of mine.
    1. make your own subclass of `FileOutputCommitter` whose `cleanupJob()` method doesn't do that full `$dest/_temporary` dir cleanup, just deletes the current job ID's subdir
    1. Configure the jobs (new) committer factory underneath the FileOutputFormat to return your committer; do the same for parquet via the `BindingParquetOutputCommitter`. 
    
    That way, you get to choose cleanup policy, don't create conflict, don't need to rename things. 
    
    There's also the option of providing a MAPREDUCE- patch to add a switch to change cleanup to only purge that job's data...you'd need to make sure all attempts of that job get cleaned up, as MR can make multiple attempts. There's a general fear of going near that class as its such a critical piece of code, but cleanup is not the bit everyone is scared of. Get a change in there and all the file output committer subclasses get it. That'd be for Hadoop 3.2 & 2.10; no need to change anything in spark other than the job ID problem.
    
    
    > Meanwhile, due to all applications share the same app appempt id, they write temporary data to the same temporary dir `skip_dir/tab1/_temporary/0`. Data committed by the successful application is also corrupted.
    
    that's the issue we've been discussing related to job IDs. If each spark driver comes up with a unique job ID, that conflict will go away.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21286: [SPARK-24238][SQL] HadoopFsRelation can't append the sam...

Posted by zheh12 <gi...@git.apache.org>.
Github user zheh12 commented on the issue:

    https://github.com/apache/spark/pull/21286
  
    I think the Hadoop design does not allow two jobs to share the same output folder.
    
    Hadoop has a related patch that can partially solve this problem. You can configure the parameters to not clean up the _temporary directory. But I think this is not a good solution.
    
    [MAPREDUCE-6478. Add an option to skip cleanupJob stage or ignore cleanup failure during commitJob.](https://issues.apache.org/jira/browse/MAPREDUCE-6478?attachmentSortBy=fileName) 
    
    For this problem, we'd better use different temporary output directories for different jobs, and then copy the files.
    
    However, the current implementation breaks some unit tests. There are two ways to fix it.
    
    1. Add the check of  presence of tempDir in `HadoopMapReduceCommitProtocal.commitJob`, but this requires an external set `FileOutputFormat.setOutputPath(job, s".temp-${commiter.getJobId()}")`
    
    2. Another approach is that we enable the tempDir directory for all `HadoopMapReduceCommitProtocal`.
      The shield tempDir setting problem, but for all jobs will be one more files move.
    
    cc @cloud-fan.  Which do you think is better?  Please give me some advice?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21286: [SPARK-24238][SQL] HadoopFsRelation can't append the sam...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/21286
  
    Thanks @steveloughran for your deep explanation!
    
    Spark does have a unique job id, but it's only unique within a SparkContext, we may have 2 different spark applications writing to the same directory. I think timestamp+uuid should be good enough as a job id. Spark doesn't retry jobs so we can always set job attempt id to 0.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21286: [SPARK-24238][SQL] HadoopFsRelation can't append the sam...

Posted by jinxing64 <gi...@git.apache.org>.
Github user jinxing64 commented on the issue:

    https://github.com/apache/spark/pull/21286
  
    Cool, so do I understand correctly that there are only two things need to do:
    1. create a unique jobID, permaps `timestamp+uuid`?
    2. disable cleanup in FileOutputCommitter mapreduce.fileoutputcommitter.cleanup.skipped


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21286: [SPARK-24238][SQL] HadoopFsRelation can't append the sam...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21286
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21286: [SPARK-24238][SQL] HadoopFsRelation can't append the sam...

Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on the issue:

    https://github.com/apache/spark/pull/21286
  
    > cc @steveloughran who I believe is the expert in this area.
    
    I suppose "Stepped through the FileOutputCommit operations with a debugger and a pen and paper" counts, given the complexity there. There's still a lot of corner cases which I'm not 100% sure on (or confident that the expectations of the job coordinators are met). Otherwise its more folklore "we did this because of job A failed with error...", plus some experiments with fault injection. I'd point to @rdblue as having put in this work too.
    
    * Hadoop MR uses the jobID for unique temp paths, which comes from yarn and guaranteed to be unique within the cluster, at least until everything is restarted. See [Hadoop committer architecture](http://hadoop.apache.org/docs/r3.1.0/hadoop-aws/tools/hadoop-aws/committer_architecture.html)
    * And to handle job restart, has a temp ID too.
    
    Using a temp dir and then renaming in is ~what the FileOutputCommitter v1 algorithm does
    
    1. task commit:  `_temporary/$jobAttemptId/_temporary/$taskID_$taskAttemptID` -> `_temporary/$jobAttemptId/$taskID`
    2. Job commit: list `_temporary/$jobAttemptId`, move over. This is sequential renaming, slow on very large jobs on HDFS &c, where it's O(files), performance killer on any object store where it's O(data)
    3. The "v2" algorithm avoids this job commit overhead by incrementally committing tasks as they complete, so breaking fundamental assumptions about observability of output and the ability to recover from failure of tasks and jobs.
    
    
    Adding an extra directory with another rename has some serious issues
    
    * Completely breaks all the work Ryan and I have done with committers which PUT directly into place in S3, where "place" can include specific partitions with specific conflict resolution
    * Adds *another* O(files) or O(data) rename process. So doubles the commit time of V1, and for v2 restores the v1 commit overhead, while at least fixing the task commit semantics. Essentially: it reinstates v1, just less efficiently.
    * still has that problem of how to handle failure in object stores (s3, GCS) which don't do atomic directory rename.
    
    Which is why I think it's the wrong solution
    
    Normally Spark rejects work to the destination if it's already there, so only one job will have a temp dir. This conflict will only be an issue if overwrite is allowed, which is going to have other adverse consequences if files with the same name are ever created. If the two jobs commit simultaneously, you'll get a mixture of results. This is partly why the S3A committers insert UUIDs into their filenames by default, the other being S3's lack of update consistency.
    
    Ignoring that little issue, @cloud-fan  is right: giving jobs a unique ID should be enough to ensure that FileOutputCommitter does all it's work in isolation.
    
    Any ID known to be unique to all work actively potentially able to write to the same dest dir. Hadoop MR has a strict ordering requirement so that it can attempt to recover from job attempt failures (it looks for _temporary/$job_id_($job-attempt-id--/ to find committed work from the previous attempt).  Spark should be able to just create a UUID.
    
    @zheh12 : welcome to the world of distributed commit protocols. My writeup is [here](https://github.com/steveloughran/zero-rename-committer/releases/download/tag_draft_003/a_zero_rename_committer.pdf). Also check out Gil Vernik's [Stocator Papper](https://arxiv.org/abs/1709.01812). Start with those and the source and assume we've all made mistakes...
    
    finally, regarding MAPREDUCE cleanup JIRAs, the most recent is [MAPREDUCE-7029](https://issues.apache.org/jira/browse/MAPREDUCE-7029). That includes comments from the google team on their store's behaviour.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21286: [SPARK-24238][SQL] HadoopFsRelation can't append the sam...

Posted by jinxing64 <gi...@git.apache.org>.
Github user jinxing64 commented on the issue:

    https://github.com/apache/spark/pull/21286
  
    Does Spark have a jobID in writing path? Below path is an example in my debugging log:
    ```
    parquettest2/_temporary/0/_temporary/attempt_20180515215310_0000_m_000000_0/part-00000-9104445e-e54a-4e3f-9ba4-e624d60e6247-c000.snappy.parquet
    ```
    parquettest2 is a non-partitioned table. Seems that the `jobAttemptId` in `_temporary/$jobAttemptId/_temporary/$taskID_$taskAttemptID` is always 0 if no retry.
    If no unique jobID is provided in the writing path, think about below scenario:
    ```
    1. JobA started and writes data to dir/tab/_temporary/0/_temporary/$taskID_$taskAttemptID
    2. JobB started and writes data to dir/tab/_temporary/0/_temporary/$taskID_$taskAttemptID
    3. Note that JobA and JobB write data to dir/tab/_temporary/0/_temporary at the same time
    4. When JobA commits, all data under dir/tab/_temporary/0/ are commited as the output -- Yes, it's a mixture from both JobA and jobB, the generated data to the target table is incorrect.
    5. When JobA commits and cleanup, dir/tab/_temporary/ will be deleted. But at this moment, JobB is not finisehd yet and cann
    ot find dir/tab/_temporary/0/ and failed.
    ```
    If I understand correctly, this pr proposes to add a jobId outside the _temporary and the writing path format is like below:
    `$jobID/_temporary/$jobAttemptId/_temporary/$taskID_$taskAttemptID`.
    Thus the change outside committer and doesn't break commiterr's logic.
    Did I understand correctly ?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21286: [SPARK-24238][SQL] HadoopFsRelation can't append the sam...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/21286
  
    ok to test


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21286: [SPARK-24238][SQL] HadoopFsRelation can't append the sam...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21286
  
    **[Test build #90463 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90463/testReport)** for PR 21286 at commit [`b676a36`](https://github.com/apache/spark/commit/b676a36af110b0b7d7dfc47ab292d09c441f6a0f).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21286: [SPARK-24238][SQL] HadoopFsRelation can't append the sam...

Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on the issue:

    https://github.com/apache/spark/pull/21286
  
    @jinxing64 yes, with the detail that the way some bits of hadoop parse a jobattempt, they like it to be an integer. Some random number used as the upper digits of counter could work; it'd still give meaningful job IDs like "45630001" for the first, "45630002", for the process which  came up with "4563" as its prefix. Yes, eventually it'll wrap, but that's integers for you.
    
    BTW, the `newFileAbsPath` code creates the staging dir ".spark-staging-" + jobId. Again, a jobID unique across all processes is enough


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21286: [SPARK-24238][SQL] HadoopFsRelation can't append the sam...

Posted by zheh12 <gi...@git.apache.org>.
Github user zheh12 commented on the issue:

    https://github.com/apache/spark/pull/21286
  
    Thanks @cloud-fan @steveloughran for your reply, I will look more detail on this problem.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21286: [SPARK-24238][SQL] HadoopFsRelation can't append the sam...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21286
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21286: [SPARK-24238][SQL] HadoopFsRelation can't append the sam...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21286
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90827/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21286: [SPARK-24238][SQL] HadoopFsRelation can't append the sam...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21286
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21286: [SPARK-24238][SQL] HadoopFsRelation can't append the sam...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21286
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90463/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21286: [SPARK-24238][SQL] HadoopFsRelation can't append the sam...

Posted by zheh12 <gi...@git.apache.org>.
Github user zheh12 commented on the issue:

    https://github.com/apache/spark/pull/21286
  
    I think I may not have described this issue clearly.
    
    First of all,the scene of the problem is this.
    
    When multiple applications simultaneously append data to the same parquet datasource table.
    
    They will run simultaneously and share the same output directory.
    
    ```
    FileOutputFormat.setOutputPath(job, new Path(outputSpec.outputPath))
    ```
    
    `ouputSepc` is the output table directory `skip_dir/tab1/`
    
    `skip_dir/tab/_temporary` will be created as temporary dir.
    
    But once one Job is successfully committed, it will run cleanupJob
    
    ```
    Path pendingJobAttemptsPath = getPendingJobAttemptsPath();
    
    fs.delete(pendingJobAttemptsPath, true);
    ```
    
    The pendingJobAttemptsPath is `skip_dir/tab1/_temporary`
    
    ```
    Private Path getPendingJobAttemptsPath() {
        Return getPendingJobAttemptsPath(getOutputPath());
    }
    
    Private static Path getPendingJobAttemptsPath(Path out) {
        Return new Path(out, PENDING_DIR_NAME);
    }
    
    Public static final String PENDING_DIR_NAME = "_temporary";
    ```
    
    After the job is committed, `skip_dir/tab1/_temporary` will be deleted. Then when other jobs attempt to commit, an error will be reported.
    
    Meanwhile, due to all applications share the same app appempt id, they write temporary data to the same temporary dir  `skip_dir/tab1/_temporary/0`. Data committed by the successful application is also corrupted.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21286: [SPARK-24238][SQL] HadoopFsRelation can't append the sam...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21286
  
    **[Test build #90463 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90463/testReport)** for PR 21286 at commit [`b676a36`](https://github.com/apache/spark/commit/b676a36af110b0b7d7dfc47ab292d09c441f6a0f).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21286: [SPARK-24238][SQL] HadoopFsRelation can't append the sam...

Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on the issue:

    https://github.com/apache/spark/pull/21286
  
    that would work. Like you say, no need to worry about job attempt IDs, just uniqueness. If you put the timestamp first, you could still sort the listing by time, which might be good for diagnostics.
    
    Some org.apache.hadoop code snippets do attempt to parse the yarn app attempt strings into numeric job & task IDs in exactly the way they shouldn't. It should already have surfaced if it was a problem in the committer codepaths, but it's worth remembering & maybe replicate in the new IDs.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org