You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by steveloughran <gi...@git.apache.org> on 2016/03/28 19:57:02 UTC

[GitHub] spark pull request: [SPARK-7481][build][WIP] Add Hadoop 2.6+ profi...

GitHub user steveloughran opened a pull request:

    https://github.com/apache/spark/pull/12004

    [SPARK-7481][build][WIP] Add Hadoop 2.6+ profile to pull in object store FS accesors

    ## What changes were proposed in this pull request?
    
    [SPARK-7481] Add Hadoop 2.6+ profile to pull in object store FS accessors in hadoop-openstack, hadoop-aws and hadoop-azure. 
    
    As a result, it gets the Hadoop classes s3n:// support back into spark-assembly; adds s3a and openstack on Hadoop 2.6, and on Hadoop 2.7 adds azure support. It does not add the dependencies needed for s3a or azure
    
    - spark-assembly has an explicit dependency on jets3t; this is used by s3n
    - s3a needs a (large) amazon-aws JAR in Hadoop 2.6; Hadoop 2.7 has switched to a leaner amazon-aws-s3 JAR.
    - azure needs a microsoft azure storage JAR
    - openstack reuses JARs already in the assembly and adds one: commons-io. This is not excluded from spark-assembly, though it is easy to add that.
    
    The patch defines a new module, "cloud" with transitive dependencies on the amazon (hadoop 2.6+) and azure (hadoop 2.7+) JARs. The spark assembly JAR pulls in spark-cloud and its hadoop dependencies (scoped at hadoop-provided) —but excludes those external dependencies. 
    
    Having an explicit module allows for followup work, specifically some tests. It also enables downstream applications to declare their dependency upon `spark-cloud` and get the the object store accessors (and anything else people choose to add in future)
    
    ## How was this patch tested?
    
    The dependency logic was verified via maven dependency checking; the inclusion of the hadoop code and exclusion of com.microsoft and com.amazon files checked by examining the contents of the assembly JAR; that check could be automated.
    
    For testing that the spark integration with s3a, wasb, etc works, I'd propose a followup piece of work. That'd add to spark-cloud some tests and changes to the pom needed to pass down some environment options for running the tests, skipping them if the credentials are not provided.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/steveloughran/spark features/SPARK-7481-cloud

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/12004.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #12004
    
----
commit 5e9cfbe30a5aff78e5b807a2d2cf38aa1a2b814d
Author: Steve Loughran <st...@hortonworks.com>
Date:   2016-03-28T17:40:59Z

    [SPARK-7481] Add Hadoop 2.6+ profile to pull in object store FS accessors in hadoop-openstack, hadoop-aws and hadoop-azure.
    
    This defines a new module, "cloud" with transitive dependencies on the amazon (hadoop 2.6+) and azure (hadoop 2.7+) JARs. The spark assembly JAR pulls in spark-cloud and its hadoop dependencies (scoped at hadoop-provided) —but excludes those external dependencies. The hadoop classes come in (visually verified in JAR); the com.amazon and com.microsoft artifacts are omitted.
    
    Having an explicit module allows for followup work, specifically some tests

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481][build] [WIP] Add Hadoop 2.6+ spark-cloud mo...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62235/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481] [build] Add spark-cloud module to pull in o...

Posted by nchammas <gi...@git.apache.org>.
Github user nchammas commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    As a dumb end-user, and as the maintainer of [Flintrock](https://github.com/nchammas/flintrock), my interest in this PR stems from the hope that we will be able to get builds of Spark against the latest version of Hadoop that can interact with S3 out of the box.
    
    Because Spark builds against Hadoop 2.6 and 2.7 don't have that support, many Flintrock users [opt to use Spark built against Hadoop 2.4](https://github.com/nchammas/flintrock/issues/88) since S3 support was still bundled in with those versions. Many users don't know that they can get S3 support at runtime with the right call to `--packages`.
    
    Given that Spark and S3 are very commonly used together, I hope there is some way we can address the out-of-the-box use case here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481][build] [WIP] Add Hadoop 2.7+ spark-cloud mo...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65446/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481][build] [WIP] Add Hadoop 2.6+ spark-cloud mo...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    **[Test build #62255 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62255/consoleFull)** for PR 12004 at commit [`c7ba2aa`](https://github.com/apache/spark/commit/c7ba2aa8bd17ffda6eb7d17465a2d1e79705770e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481][build] [WIP] Add Hadoop 2.6+ spark-cloud mo...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    **[Test build #61559 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61559/consoleFull)** for PR 12004 at commit [`3f2d301`](https://github.com/apache/spark/commit/3f2d301303f9d014758b87e83b996052ba243204).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481] [build] Add spark-cloud module to pull in o...

Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    This is the patch stripped down to the packaging and some tests to load the direct and indirect dependencies, so verifying that the classpath is valid within the module itself. It also documents the object stores and their issues, as well as pointing the existing openstack doc at the now expanded doc


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481] [build] Add spark-cloud module to pull in o...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    **[Test build #71148 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71148/testReport)** for PR 12004 at commit [`c911ccb`](https://github.com/apache/spark/commit/c911ccb8b0d4218919fb3a6781ed5e19933fd7dc).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481] [build] Add spark-cloud module to pull in o...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481] [build] Add spark-cloud module to pull in o...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    If I may, I believe the intent here is to add an extra dependency-only module that adds in Hadoop's integration modules for various cloud stores. If building with this module enabled, you build in some support for S3, Azure, etc.
    
    And there are docs for the same, and some very basic smoke tests.
    
    I think the use case is: custom build of Spark for stand-alone deployment on a cloud provider? because a Hadoop cluster would have these. I think the upside is clear: docs are nice, a pre-packaged way to pull in these deps correctly is nice.
    
    My outstanding hesitations are:
    
    - Well, the complexity of another module
    - Do people really want to build support for all cloud providers or just the one they use? if just one can they bundle with their app? (I have the feeling I asked and forgot this was answered)
    - Does it telegraph some commitment to working with, say, S3, that isn't really there? that is, I'm not clear that you can really use Spark with S3 after this anyway or am I not up to date?
    - We may be about to not support Hadoop < 2.6, or < 2.7. Does that change the right way to do this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481] [build] Add spark-cloud module to pull in o...

Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    I haven't forgotten this; I've just been trying to make the module POM-only, while adding support for  Hadoop 2.6 builds, which is causing some issues downstream. Specifically, my downstream cloud test module always seems to end up with the Hadoop 2.6 hadoop-aws module, even with different profiles enabled. As the s3a client cannot be used with spark (it will partition every file to a block size of 1 byte), this is a serious problem


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481][build] [WIP] Add Hadoop 2.7+ spark-cloud mo...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481] [build] Add spark-cloud module to pull in o...

Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    comments?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481][build] [WIP] Add Hadoop 2.6+ spark-cloud mo...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64298/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12004: [SPARK-7481] [build] Add spark-cloud module to pu...

Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12004#discussion_r113725231
  
    --- Diff: cloud/pom.xml ---
    @@ -0,0 +1,117 @@
    +<?xml version="1.0" encoding="UTF-8"?>
    +<!--
    +  ~ Licensed to the Apache Software Foundation (ASF) under one or more
    +  ~ contributor license agreements.  See the NOTICE file distributed with
    +  ~ this work for additional information regarding copyright ownership.
    +  ~ The ASF licenses this file to You under the Apache License, Version 2.0
    +  ~ (the "License"); you may not use this file except in compliance with
    +  ~ the License.  You may obtain a copy of the License at
    +  ~
    +  ~    http://www.apache.org/licenses/LICENSE-2.0
    +  ~
    +  ~ Unless required by applicable law or agreed to in writing, software
    +  ~ distributed under the License is distributed on an "AS IS" BASIS,
    +  ~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +  ~ See the License for the specific language governing permissions and
    +  ~ limitations under the License.
    +  -->
    +<project xmlns="http://maven.apache.org/POM/4.0.0"
    +  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    +  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    +  <modelVersion>4.0.0</modelVersion>
    +  <parent>
    +    <groupId>org.apache.spark</groupId>
    +    <artifactId>spark-parent_2.11</artifactId>
    +    <version>2.2.0-SNAPSHOT</version>
    --- End diff --
    
    I'd noticed that this morning....


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481] [build] Add spark-cloud module to pull in o...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    **[Test build #71725 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71725/testReport)** for PR 12004 at commit [`b0bff58`](https://github.com/apache/spark/commit/b0bff5856ee8dc69f386bea944e9728d526645aa).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481] [build] Add spark-cloud module to pull in o...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66962/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12004: [SPARK-7481] [build] Add spark-cloud module to pu...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12004#discussion_r113700119
  
    --- Diff: pom.xml ---
    @@ -1145,6 +1150,70 @@
               </exclusion>
             </exclusions>
           </dependency>
    +      <!--
    +        the AWS module pulls in jackson; its transitive dependencies can create
    +        intra-jackson-module version problems.
    +        -->
    +      <dependency>
    --- End diff --
    
    Shouldn't all this only be in `cloud/pom.xml`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481] [build] Add spark-cloud module to pull in o...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    **[Test build #66513 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66513/consoleFull)** for PR 12004 at commit [`a216aed`](https://github.com/apache/spark/commit/a216aed9a009c41a90131a8d6de04bb54c504a17).
     * This patch **fails build dependency tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12004: [SPARK-7481] [build] Add spark-cloud module to pu...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12004#discussion_r89363934
  
    --- Diff: pom.xml ---
    @@ -2558,6 +2660,26 @@
           </modules>
         </profile>
     
    +    <!--
    +      The cloud profile enables the cloud module.
    +      It does not declare the hadoop-* artifacts which
    +      the cloud module pulls in; these are delegated to
    +      the hadoop-x.y protocols, so permitting different
    +      hadoop versions to declare different include/exclude
    +      rules (especially transient dependencies).
    +
    +      To use this profile, the hadoop-2.7 profile must also
    --- End diff --
    
    `hadoop-2.7` already also activates the `cloud` module ... but I see that the point is that `cloud` profile affects the assembly too.
    
    OK, I like the flexibility though I guess I'm on the fence about keeping it separate. I'll put it this way: should one reasonably expect that something that builds around Hadoop 2.7 includes this support by default? if it's a kinda optional extra in Hadoop, then yes I can see how it makes sense to keep it optional here. BTW how about `hadoop-cloud` or something since this is narrowly about Hadoop deps?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481] [build] Add spark-cloud module to pull in o...

Posted by mtustin-handy <gi...@git.apache.org>.
Github user mtustin-handy commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    I don't see any downsides to this. At present working with s3 isn't super painful, but I do see why one would want support to be better and smoother. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12004: [SPARK-7481] [build] Add spark-cloud module to pu...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12004#discussion_r113696353
  
    --- Diff: docs/cloud-integration.md ---
    @@ -0,0 +1,512 @@
    +---
    +layout: global
    +displayTitle: Integration with Cloud Infrastructures
    +title: Integration with Cloud Infrastructures
    +description: Introduction to cloud storage support in Apache Spark SPARK_VERSION_SHORT
    +---
    +<!---
    +  Licensed under the Apache License, Version 2.0 (the "License");
    +  you may not use this file except in compliance with the License.
    +  You may obtain a copy of the License at
    +
    +   http://www.apache.org/licenses/LICENSE-2.0
    +
    +  Unless required by applicable law or agreed to in writing, software
    +  distributed under the License is distributed on an "AS IS" BASIS,
    +  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +  See the License for the specific language governing permissions and
    +  limitations under the License. See accompanying LICENSE file.
    +-->
    +
    +* This will become a table of contents (this text will be scraped).
    +{:toc}
    +
    +## <a name="introduction"></a>Introduction
    +
    +
    +All the public cloud infrastructures, Amazon AWS, Microsoft Azure, Google GCS and others offer
    +persistent data storage systems, "object stores". These are not quite the same as classic file
    +systems: in order to scale to hundreds of Petabytes, without any single points of failure
    +or size limits, object stores, "blobstores", have a simpler model of `name => data`.
    +
    +Apache Spark can read or write data in object stores for data access.
    +through filesystem connectors implemented in Apache Hadoop or provided by third-parties.
    +These libraries make the object stores look *almost* like filesystems, with directories and
    +operations on files (rename) and directories (create, rename, delete) which mimic
    +those of a classic filesystem. Because of this, Spark and Spark-based applications
    +can work with object stores, generally treating them as as if they were slower-but-larger filesystems.
    +
    +With these connectors, Apache Spark supports object stores as the source
    +of data for analysis, including Spark Streaming and DataFrames.
    +
    +
    +## <a name="quick_start"></a>Quick Start
    +
    +Provided the relevant libraries are on the classpath, and Spark is configured with your credentials,
    +objects in an object store can be can be read or written through URLs which uses the name of the
    +object store client as the schema and the bucket/container as the hostname.
    +
    +
    +### Dependencies
    +
    +The Spark application neeeds the relevant Hadoop libraries, which can
    +be done by including the `spark-hadoop-cloud` module for the specific version of spark used.
    +
    +The Spark application should include <code>hadoop-openstack</code> dependency, which can
    +be done by including the `spark-hadoop-cloud` module for the specific version of spark used.
    +For example, for Maven support, add the following to the <code>pom.xml</code> file:
    +
    +{% highlight xml %}
    +<dependencyManagement>
    +  ...
    +  <dependency>
    +    <groupId>org.apache.spark</groupId>
    +    <artifactId>spark-hadoop-cloud_2.11</artifactId>
    +    <version>${spark.version}</version>
    +  </dependency>
    +  ...
    +</dependencyManagement>
    +{% endhighlight %}
    +
    +If using the Scala 2.10-compatible version of Spark, the artifact is of course `spark-hadoop-cloud_2.10`.
    +
    +### Basic Use
    +
    +You can refer to data in an object store just as you would data in a filesystem, by
    +using a URL to the data in methods like `SparkContext.textFile()` to read data, 
    +`saveAsTextFile()` to write it back.
    +
    +
    +Because object stores are viewed by Spark as filesystems, object stores can
    +be used as the source or destination of any spark work \u2014be it batch, SQL, DataFrame,
    +Streaming or something else.
    +
    +The steps to do so are as follows
    +
    +1. Use the full URI to refer to a bucket, including the prefix for the client-side library
    +to use. Example: `s3a://landsat-pds/scene_list.gz`
    +1. Have the Spark context configured with the authentication details of the object store.
    +In a YARN cluster, this may also be done in the `core-site.xml` file.
    +
    +
    +## <a name="output"></a>Object Stores as a substitute for HDFS
    +
    +As the examples show, you can write data to object stores. However, that does not mean
    +That they can be used as replacements for a cluster-wide filesystem.
    +
    +The full details are covered in [Cloud Object Stores are Not Real Filesystems](#cloud_stores_are_not_filesystems).
    +
    +The brief summary is:
    +
    +| Object Store Connector      |  Replace HDFS? |
    +|-----------------------------|--------------------|
    +| `s3a://` `s3n://`  from the ASF   | No  |
    +| Amazon EMR `s3://`          | Yes |
    +| Microsoft Azure `wasb://`   | Yes |
    +| OpenStack `swift://`        | No  |
    +
    +It is possible to use any of the object stores as a destination of work, i.e. use
    +`saveAsTextFile()` or `save()` to save data there, but the commit process may be slow
    +and, unreliable in the presence of failures.
    +
    +It is faster and safer to use the cluster filesystem as the destination of Spark jobs,
    +using that data as the data for follow-on work. The final results can
    +be persisted in to the object store using `distcp`.
    +
    +#### <a name="checkpointing"></a>Spark Streaming and object stores
    +
    +Spark Streaming can monitor files added to object stores, by
    +creating a `FileInputDStream` DStream monitoring a path under a bucket through
    +`StreamingContext.textFileStream()`.
    +
    +
    +1. The time to scan for new files is proportional to the number of files
    +under the path \u2014not the number of *new* files, and that it can become a slow operation.
    --- End diff --
    
    "path -not" -> "path, not"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7481][build][WIP] Add Hadoop 2.6+ profi...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12004#issuecomment-203057500
  
    **[Test build #54454 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54454/consoleFull)** for PR 12004 at commit [`72b3548`](https://github.com/apache/spark/commit/72b354855867c51c6426d72209d1b98d17796730).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12004: [SPARK-7481] [build] Add spark-cloud module to pu...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12004#discussion_r113698868
  
    --- Diff: docs/cloud-integration.md ---
    @@ -0,0 +1,512 @@
    +---
    +layout: global
    +displayTitle: Integration with Cloud Infrastructures
    +title: Integration with Cloud Infrastructures
    +description: Introduction to cloud storage support in Apache Spark SPARK_VERSION_SHORT
    +---
    +<!---
    +  Licensed under the Apache License, Version 2.0 (the "License");
    +  you may not use this file except in compliance with the License.
    +  You may obtain a copy of the License at
    +
    +   http://www.apache.org/licenses/LICENSE-2.0
    +
    +  Unless required by applicable law or agreed to in writing, software
    +  distributed under the License is distributed on an "AS IS" BASIS,
    +  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +  See the License for the specific language governing permissions and
    +  limitations under the License. See accompanying LICENSE file.
    +-->
    +
    +* This will become a table of contents (this text will be scraped).
    +{:toc}
    +
    +## <a name="introduction"></a>Introduction
    +
    +
    +All the public cloud infrastructures, Amazon AWS, Microsoft Azure, Google GCS and others offer
    +persistent data storage systems, "object stores". These are not quite the same as classic file
    +systems: in order to scale to hundreds of Petabytes, without any single points of failure
    +or size limits, object stores, "blobstores", have a simpler model of `name => data`.
    +
    +Apache Spark can read or write data in object stores for data access.
    +through filesystem connectors implemented in Apache Hadoop or provided by third-parties.
    +These libraries make the object stores look *almost* like filesystems, with directories and
    +operations on files (rename) and directories (create, rename, delete) which mimic
    +those of a classic filesystem. Because of this, Spark and Spark-based applications
    +can work with object stores, generally treating them as as if they were slower-but-larger filesystems.
    +
    +With these connectors, Apache Spark supports object stores as the source
    +of data for analysis, including Spark Streaming and DataFrames.
    +
    +
    +## <a name="quick_start"></a>Quick Start
    +
    +Provided the relevant libraries are on the classpath, and Spark is configured with your credentials,
    +objects in an object store can be can be read or written through URLs which uses the name of the
    +object store client as the schema and the bucket/container as the hostname.
    +
    +
    +### Dependencies
    +
    +The Spark application neeeds the relevant Hadoop libraries, which can
    +be done by including the `spark-hadoop-cloud` module for the specific version of spark used.
    +
    +The Spark application should include <code>hadoop-openstack</code> dependency, which can
    +be done by including the `spark-hadoop-cloud` module for the specific version of spark used.
    +For example, for Maven support, add the following to the <code>pom.xml</code> file:
    +
    +{% highlight xml %}
    +<dependencyManagement>
    +  ...
    +  <dependency>
    +    <groupId>org.apache.spark</groupId>
    +    <artifactId>spark-hadoop-cloud_2.11</artifactId>
    +    <version>${spark.version}</version>
    +  </dependency>
    +  ...
    +</dependencyManagement>
    +{% endhighlight %}
    +
    +If using the Scala 2.10-compatible version of Spark, the artifact is of course `spark-hadoop-cloud_2.10`.
    +
    +### Basic Use
    +
    +You can refer to data in an object store just as you would data in a filesystem, by
    +using a URL to the data in methods like `SparkContext.textFile()` to read data, 
    +`saveAsTextFile()` to write it back.
    +
    +
    +Because object stores are viewed by Spark as filesystems, object stores can
    +be used as the source or destination of any spark work \u2014be it batch, SQL, DataFrame,
    +Streaming or something else.
    +
    +The steps to do so are as follows
    +
    +1. Use the full URI to refer to a bucket, including the prefix for the client-side library
    +to use. Example: `s3a://landsat-pds/scene_list.gz`
    +1. Have the Spark context configured with the authentication details of the object store.
    +In a YARN cluster, this may also be done in the `core-site.xml` file.
    +
    +
    +## <a name="output"></a>Object Stores as a substitute for HDFS
    +
    +As the examples show, you can write data to object stores. However, that does not mean
    +That they can be used as replacements for a cluster-wide filesystem.
    +
    +The full details are covered in [Cloud Object Stores are Not Real Filesystems](#cloud_stores_are_not_filesystems).
    +
    +The brief summary is:
    +
    +| Object Store Connector      |  Replace HDFS? |
    +|-----------------------------|--------------------|
    +| `s3a://` `s3n://`  from the ASF   | No  |
    +| Amazon EMR `s3://`          | Yes |
    +| Microsoft Azure `wasb://`   | Yes |
    +| OpenStack `swift://`        | No  |
    +
    +It is possible to use any of the object stores as a destination of work, i.e. use
    +`saveAsTextFile()` or `save()` to save data there, but the commit process may be slow
    +and, unreliable in the presence of failures.
    +
    +It is faster and safer to use the cluster filesystem as the destination of Spark jobs,
    +using that data as the data for follow-on work. The final results can
    +be persisted in to the object store using `distcp`.
    +
    +#### <a name="checkpointing"></a>Spark Streaming and object stores
    +
    +Spark Streaming can monitor files added to object stores, by
    +creating a `FileInputDStream` DStream monitoring a path under a bucket through
    +`StreamingContext.textFileStream()`.
    +
    +
    +1. The time to scan for new files is proportional to the number of files
    +under the path \u2014not the number of *new* files, and that it can become a slow operation.
    +The size of the window needs to be set to handle this.
    +
    +1. Files only appear in an object store once they are completely written; there
    +is no need for a worklow of write-then-rename to ensure that files aren't picked up
    +while they are still being written. Applications can write straight to the monitored directory.
    +
    +1. Streams should only be checkpointed to an object store considered compatible with
    +HDFS. Otherwise the checkpointing will be slow and potentially unreliable.
    +
    +### Recommended settings for writing to object stores
    +
    +Here are the settings to use when writing to object stores. This uses the "version 2" algorithm
    +for committing files \u2014which does less renaming than the v1 algorithm. Speculative execution is
    +disabled to avoid multiple writers corrupting the output.
    +
    +```
    +spark.speculation false
    +spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2
    +spark.hadoop.mapreduce.fileoutputcommitter.cleanup-failures.ignored true
    +```
    +
    +There's also the option of skipping the cleanup of temporary files in the output directory.
    +Enabling this option eliminates a small delay caused by listing and deleting any such files.
    +
    +```
    +spark.hadoop.mapreduce.fileoutputcommitter.cleanup.skipped true
    +```
    +
    +Bear in mind that storing temporary files can run up charges; Delete
    +directories called `"_temporary"` on a regular basis to avoid this.
    +
    +
    +### YARN Scheduler settings
    +
    +When running Spark in a YARN cluster running in EC2, turning off locality avoids any delays
    +waiting for the scheduler to find a node close to the data.
    +
    +```xml
    +  <property>
    +    <name>yarn.scheduler.capacity.node-locality-delay</name>
    +    <value>0</value>
    +  </property>
    +```
    +
    +This has to be set in the YARN cluster configuration, not in the Spark configuration.
    +
    +### Parquet I/O Settings
    +
    +For optimal performance when reading files saved in the Apache Parquet format,
    +read and write operations must be minimized, including generation of summary metadata,
    +and coalescing metadata from multiple files. The `filterPushdown` option
    +enables the Parquet library to optimize data reads itself, potentially saving bandwidth.
    +
    +```
    +spark.hadoop.parquet.enable.summary-metadata false
    +spark.sql.parquet.mergeSchema false
    +spark.sql.parquet.filterPushdown true
    +spark.sql.hive.metastorePartitionPruning true
    +```
    +
    +### ORC I/O Settings
    +
    +For optimal performance when reading files saved in the Apache ORC format,
    +read and write operations must be minimized. Here are the options to achieve this.
    +
    +
    +```
    +spark.sql.orc.filterPushdown true
    +spark.sql.orc.splits.include.file.footer true
    +spark.sql.orc.cache.stripe.details.size 10000
    +spark.sql.hive.metastorePartitionPruning true
    +```
    +
    +The `filterPushdown` option enables the ORC library to optimize data reads itself,
    +potentially saving bandwidth.
    +
    +The `spark.sql.orc.splits.include.file.footer` option means that the ORC file footer information,
    +is passed around with the file information \u2014so eliminating the need to reread this data.
    +
    +## <a name="authenticating"></a>Authenticating with Object Stores
    +
    +Apart from the special case of public read-only data, all object stores
    +require callers to authenticate themselves.
    +To do this, the Spark context must be configured with the authentication
    +details of the object store.
    +
    +1. In a YARN cluster, this may be done automatically in the `core-site.xml` file.
    +1. When Spark is running in cloud infrastructure (for example, on Amazon EC2, Google Cloud or
    +Microsoft Azure), the authentication details may be automatically derived from information
    +available to the VM.
    +1. `spark-submit` automatically picks up the contents of `AWS_ACCESS_KEY` and `AWS_SECRET_KEY`
    +environment variables and sets the associated configuration parameters for`s3n` and `s3a`
    +to these values. This essentially propagates the values across the Spark cluster.
    +1. Authentication details may be manually added to the Spark configuration
    +1. Alternatively, they can be programmatically added. *Important: never put authentication
    +secrets in source code. They will be compromised*.
    +
    +It is critical that the credentials used to access object stores are kept secret. Not only can
    +they be abused to run up compute charges, they can be used to read and alter private data.
    +
    +1. If adding login details to a spark configuration file, do not share this file, including
    +attaching to bug reports or committing it to SCM repositories.
    +1. Have different accounts for access to the storage for each application,
    +each with access rights restricted to those object storage buckets/containers used by the
    +application.
    +1. If the object store supports any form of session credential (e.g. Amazon's STS), issue
    +session credentials for the expected lifetime of the application.
    +1. When using a version of Spark with with Hadoop 2.8+ libraries, consider using Hadoop
    +credential files to store secrets, referencing
    +these files in the relevant ID/secret properties of the XML configuration file.
    +
    +
    +## <a name="object_stores"></a>Object stores and Their Library Dependencies
    +
    +The different object stores supported by Spark depend on specific Hadoop versions,
    +and require specific Hadoop JARs and dependent Java libraries on the classpath.
    +
    +<table class="table">
    +  <tr><th>Schema</th><th>Store</th><th>Details</th></tr>
    +  <tr>
    +    <td><code>s3a://</code></td>
    +    <td>Amazon S3</a>
    +    <td>
    +    Recommended S3 client for Spark releases built on Apache Hadoop 2.7 or later.
    +    </td>
    +  </tr>
    +  <tr>
    +    <td><code>s3n://</code></td>
    +    <td>Amazon S3</a>
    +    <td>
    +    Deprected S3 client; only use for Spark releases built on Apache Hadoop 2.6 or earlier.
    +    </td>
    +  </tr>
    +  <tr>
    +    <td><code>s3://</code></td>
    +    <td>Amazon S3 on Amazon EMR</a>
    +    <td>
    +    Amazon's own S3 client; use only and exclusivley in Amazon EMR.
    +    </td>
    +  </tr>
    +  <tr>
    +    <td><code>wasb://</code></td>
    +    <td>Azure Storage</a>
    +    <td>
    +    Client for Microsoft Azure Storage; since Hadoop 2.7.
    +    </td>
    +  </tr>
    +  <tr>
    +    <td><code>swift://</code></td>
    +    <td>OpenStack Swift</a>
    +    <td>
    +    Client for OpenStack Swift object stores.
    +    </td>
    +  </tr>
    +  <tr>
    +    <td><code>gs://</code></td>
    +    <td>Google Cloud Storage</a>
    +    <td>
    +    Google's client for their cloud object store.
    +    </td>
    +  </tr>
    +</table>
    +
    +
    +### <a name="working_with_amazon_s3"></a>Working with Amazon S3
    +
    +Amazon's S3 object store is probably the most widely used object store \u2014it is also the one
    +with the most client libraries. This is due to the evolution of Hadoop's support, and Amazon
    +offering Hadoop and Spark as its EMR service, along with its own S3 client.
    +
    +The recommendations for which client to use depend upon the version of Hadoop on the Spark classpath.
    +
    +<table class="table">
    +  <tr><th>Hadoop Library Version</th><th>Client</th></tr>
    +  <tr>
    +    <td>Hadoop 2.7+ and commercial products based on it</a>
    +    <td><code>s3a://</code></td>
    +  </tr>
    +  <tr>
    +    <td>Hadoop 2.6 or earlier</a>
    +    <td><code>s3n://</code></td>
    +  </tr>
    +  <tr>
    +    <td>Amazon EMR</a>
    +    <td><code>s3://</code></td>
    +  </tr>
    +</table>
    +
    +Authentication is generally via properties set in the spark context or, in YARN clusters,
    +`core-site.xml`.
    +Versions of the S3A client also support short-lived session credentials and IAM authentication to
    +automatically pick up credentials on EC2 deployments. Consult the appropriate Hadoop documentation for specifics.
    +
    +`spark-submit` will automatically pick up and propagate `AWS_ACCESS_KEY` and `AWS_SECRET_KEY`
    +from the environment variables set in the environment of the user running `spark-submit`; these
    +will override any set in the configuration files.
    +
    +Be aware that while S3 buckets support complex access control declarations, Spark needs
    +full read/write access to any bucket to which it must write data. That is: it does not support writing
    +to buckets where the root paths are read only, or not readable at all.
    +
    +#### <a name="s3a"></a>S3A Filesystem Client: `s3a://`
    +
    +The ["S3A" filesystem client](https://hadoop.apache.org/docs/stable2/hadoop-aws/tools/hadoop-aws/index.html)
    +is the sole S3 connector undergoing active maintenance at the Apache, and should be used wherever
    +possible.
    +
    +
    +**Tuning for performance:**
    +
    +For recent Hadoop versions, *when working with binary formats* (Parquet, ORC) use
    +
    +```
    +spark.hadoop.fs.s3a.experimental.input.fadvise random
    +```
    +
    +This reads from the object in blocks, which is efficient when seeking backwards as well as
    +forwards in a file \u2014at the expense of making full file reads slower.
    +
    +When working with text formats (text, CSV), or any sequential read through an entire file
    +(including .gzip compressed data),
    +this "random" I/O policy should be disabled. This is the default, but can be done
    +explicitly:
    +
    +```
    +spark.hadoop.fs.s3a.experimental.input.fadvise normal
    +spark.hadoop.fs.s3a.readahead.range 157810688
    +```
    +
    +This optimizes the object read for sequential input, and when there is a forward `seek()` call
    +up to that readahead range, will simply read the data in the current HTTPS request, rather than
    +abort it and start a new one.
    +
    +
    +#### <a name="s3n"></a>S3 Native Client `s3n://`
    +
    +The ["S3N" filesystem client](https://hadoop.apache.org/docs/stable2/hadoop-aws/tools/hadoop-aws/index.html)
    +was implemented in 2008 and has been widely used.
    +
    +While stable, S3N is essentially unmaintained, and deprecated in favor of S3A.
    +As well as being slower and limited in authentication mechanisms, the
    +only maintenance it receives are critical security issues.
    +
    +
    +#### <a name="emrs3"></a>Amazon EMR's S3 Client: `s3://`
    +
    +
    +In Amazon EMR, `s3://` is the URL schema used to refer to
    +[Amazon's own filesystem client](https://aws.amazon.com/premiumsupport/knowledge-center/emr-file-system-s3/),
    +one that is closed-source.
    +
    +As EMR also maps `s3n://` to the same filesystem, using URLs with the `s3n://` schema avoids
    +some confusion. Bear in mind, however, that Amazon's S3 client library is not the Apache one:
    +only Amazon can field bug reports related to it.
    +
    +To work with this data outside of EMR itself, use `s3a://` or `s3n://` instead.
    +
    +
    +#### <a name="asf_s3"></a>Obsolete: Apache Hadoop's S3 client, `s3://`
    +
    +Apache's own Hadoop releases (i.e not EMR), uses URL `s3://` to refer to a
    +deprecated inode-based filesystem implemented on top of S3.
    +This filesystem is obsolete, deprecated and has been dropped from Hadoop 3.x.
    +
    +*Important: * Do not use `s3://` URLs with Apache Spark except on Amazon EMR*
    +It is not the same as the Amazon EMR one and incompatible with all other applications.
    +
    +
    +### <a name="working_with_azure"></a>Working with Microsoft Azure Storage
    +
    +Azure support comes with the [`wasb` filesystem client](https://hadoop.apache.org/docs/stable2/hadoop-azure/index.html).
    +
    +The Apache implementation is that used by Microsoft in Azure itself: it can be used
    +to access data in Azure as well as remotely. The object store itself is *consistent*, and
    +can be reliably used as the destination of queries.
    +
    +
    +### <a name="working_with_swift"></a>Working with OpenStack Swift
    +
    +
    +The OpenStack [`swift://` filesystem client](https://hadoop.apache.org/docs/stable2/hadoop-openstack/index.html)
    +works with Swift object stores in private OpenStack installations, public installations
    +including Rackspace Cloud and IBM Softlayer.
    +
    +### <a name="working_with_google_cloud_storage"></a>Working with Google Cloud Storage
    +
    +[Google Cloud Storage](https://cloud.google.com/storage) is supported via Google's own
    +[GCS filesystem client](https://cloud.google.com/hadoop/google-cloud-storage-connector).
    +
    +
    +For use outside of Google cloud, `gcs-connector.jar` must be be manually downloaded then added
    +to `$SPARK_HOME/jars`.
    +
    +
    +## <a name="cloud_stores_are_not_filesystems"></a>Important: Cloud Object Stores are Not Real Filesystems
    +
    +Object stores are not filesystems: they are not a hierarchical tree of directories and files.
    +
    +The Hadoop filesystem APIs offer a filesystem API to the object stores, but underneath
    +they are still object stores, [and the difference is significant](http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/filesystem/introduction.html)
    +
    +While object stores can be used as the source and store
    +for persistent data, they cannot be used as a direct replacement for a cluster-wide filesystem such as HDFS.
    +This is important to know, as the fact they are accessed with the same APIs can be misleading.
    +
    +### Directory Operations May be Slow and Non-atomic
    +
    +Directory rename and delete may be performed as a series of operations. Specifically, recursive
    --- End diff --
    
    This is all true, but is it relevant to a Spark app? maybe, if it's also using HDFS APIs directly. If you revise this bit, consider emphasizing how this reality affects Spark usage in particular.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12004: [SPARK-7481] [build] Add spark-cloud module to pu...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12004#discussion_r113697394
  
    --- Diff: docs/cloud-integration.md ---
    @@ -0,0 +1,512 @@
    +---
    +layout: global
    +displayTitle: Integration with Cloud Infrastructures
    +title: Integration with Cloud Infrastructures
    +description: Introduction to cloud storage support in Apache Spark SPARK_VERSION_SHORT
    +---
    +<!---
    +  Licensed under the Apache License, Version 2.0 (the "License");
    +  you may not use this file except in compliance with the License.
    +  You may obtain a copy of the License at
    +
    +   http://www.apache.org/licenses/LICENSE-2.0
    +
    +  Unless required by applicable law or agreed to in writing, software
    +  distributed under the License is distributed on an "AS IS" BASIS,
    +  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +  See the License for the specific language governing permissions and
    +  limitations under the License. See accompanying LICENSE file.
    +-->
    +
    +* This will become a table of contents (this text will be scraped).
    +{:toc}
    +
    +## <a name="introduction"></a>Introduction
    +
    +
    +All the public cloud infrastructures, Amazon AWS, Microsoft Azure, Google GCS and others offer
    +persistent data storage systems, "object stores". These are not quite the same as classic file
    +systems: in order to scale to hundreds of Petabytes, without any single points of failure
    +or size limits, object stores, "blobstores", have a simpler model of `name => data`.
    +
    +Apache Spark can read or write data in object stores for data access.
    +through filesystem connectors implemented in Apache Hadoop or provided by third-parties.
    +These libraries make the object stores look *almost* like filesystems, with directories and
    +operations on files (rename) and directories (create, rename, delete) which mimic
    +those of a classic filesystem. Because of this, Spark and Spark-based applications
    +can work with object stores, generally treating them as as if they were slower-but-larger filesystems.
    +
    +With these connectors, Apache Spark supports object stores as the source
    +of data for analysis, including Spark Streaming and DataFrames.
    +
    +
    +## <a name="quick_start"></a>Quick Start
    +
    +Provided the relevant libraries are on the classpath, and Spark is configured with your credentials,
    +objects in an object store can be can be read or written through URLs which uses the name of the
    +object store client as the schema and the bucket/container as the hostname.
    +
    +
    +### Dependencies
    +
    +The Spark application neeeds the relevant Hadoop libraries, which can
    +be done by including the `spark-hadoop-cloud` module for the specific version of spark used.
    +
    +The Spark application should include <code>hadoop-openstack</code> dependency, which can
    +be done by including the `spark-hadoop-cloud` module for the specific version of spark used.
    +For example, for Maven support, add the following to the <code>pom.xml</code> file:
    +
    +{% highlight xml %}
    +<dependencyManagement>
    +  ...
    +  <dependency>
    +    <groupId>org.apache.spark</groupId>
    +    <artifactId>spark-hadoop-cloud_2.11</artifactId>
    +    <version>${spark.version}</version>
    +  </dependency>
    +  ...
    +</dependencyManagement>
    +{% endhighlight %}
    +
    +If using the Scala 2.10-compatible version of Spark, the artifact is of course `spark-hadoop-cloud_2.10`.
    +
    +### Basic Use
    +
    +You can refer to data in an object store just as you would data in a filesystem, by
    +using a URL to the data in methods like `SparkContext.textFile()` to read data, 
    +`saveAsTextFile()` to write it back.
    +
    +
    +Because object stores are viewed by Spark as filesystems, object stores can
    +be used as the source or destination of any spark work \u2014be it batch, SQL, DataFrame,
    +Streaming or something else.
    +
    +The steps to do so are as follows
    +
    +1. Use the full URI to refer to a bucket, including the prefix for the client-side library
    +to use. Example: `s3a://landsat-pds/scene_list.gz`
    +1. Have the Spark context configured with the authentication details of the object store.
    +In a YARN cluster, this may also be done in the `core-site.xml` file.
    +
    +
    +## <a name="output"></a>Object Stores as a substitute for HDFS
    +
    +As the examples show, you can write data to object stores. However, that does not mean
    +That they can be used as replacements for a cluster-wide filesystem.
    +
    +The full details are covered in [Cloud Object Stores are Not Real Filesystems](#cloud_stores_are_not_filesystems).
    +
    +The brief summary is:
    +
    +| Object Store Connector      |  Replace HDFS? |
    +|-----------------------------|--------------------|
    +| `s3a://` `s3n://`  from the ASF   | No  |
    +| Amazon EMR `s3://`          | Yes |
    +| Microsoft Azure `wasb://`   | Yes |
    +| OpenStack `swift://`        | No  |
    +
    +It is possible to use any of the object stores as a destination of work, i.e. use
    +`saveAsTextFile()` or `save()` to save data there, but the commit process may be slow
    +and, unreliable in the presence of failures.
    +
    +It is faster and safer to use the cluster filesystem as the destination of Spark jobs,
    +using that data as the data for follow-on work. The final results can
    +be persisted in to the object store using `distcp`.
    +
    +#### <a name="checkpointing"></a>Spark Streaming and object stores
    +
    +Spark Streaming can monitor files added to object stores, by
    +creating a `FileInputDStream` DStream monitoring a path under a bucket through
    +`StreamingContext.textFileStream()`.
    +
    +
    +1. The time to scan for new files is proportional to the number of files
    +under the path \u2014not the number of *new* files, and that it can become a slow operation.
    +The size of the window needs to be set to handle this.
    +
    +1. Files only appear in an object store once they are completely written; there
    +is no need for a worklow of write-then-rename to ensure that files aren't picked up
    +while they are still being written. Applications can write straight to the monitored directory.
    +
    +1. Streams should only be checkpointed to an object store considered compatible with
    +HDFS. Otherwise the checkpointing will be slow and potentially unreliable.
    +
    +### Recommended settings for writing to object stores
    +
    +Here are the settings to use when writing to object stores. This uses the "version 2" algorithm
    +for committing files \u2014which does less renaming than the v1 algorithm. Speculative execution is
    +disabled to avoid multiple writers corrupting the output.
    +
    +```
    +spark.speculation false
    +spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2
    +spark.hadoop.mapreduce.fileoutputcommitter.cleanup-failures.ignored true
    +```
    +
    +There's also the option of skipping the cleanup of temporary files in the output directory.
    +Enabling this option eliminates a small delay caused by listing and deleting any such files.
    +
    +```
    +spark.hadoop.mapreduce.fileoutputcommitter.cleanup.skipped true
    +```
    +
    +Bear in mind that storing temporary files can run up charges; Delete
    +directories called `"_temporary"` on a regular basis to avoid this.
    +
    +
    +### YARN Scheduler settings
    +
    +When running Spark in a YARN cluster running in EC2, turning off locality avoids any delays
    --- End diff --
    
    Is this specific to EC2 or to any cloud?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481][build] [WIP] Add Hadoop 2.6+ spark-cloud mo...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    **[Test build #63787 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63787/consoleFull)** for PR 12004 at commit [`2feade0`](https://github.com/apache/spark/commit/2feade078603c2bbfd5893cc1a0deb8f188cff02).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481] [build] Add spark-cloud module to pull in o...

Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    Still waiting reviews for this. Anyone? Ideally before my forthcoming Spark Summit talk...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481] [build] Add spark-cloud module to pull in o...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    **[Test build #68668 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68668/consoleFull)** for PR 12004 at commit [`ac6b33f`](https://github.com/apache/spark/commit/ac6b33f35e3e0370d33116e6defa0e9baa0ec7f1).
     * This patch **fails build dependency tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481][build] [WIP] Add Hadoop 2.6+ spark-cloud mo...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481] [WiP] [build] Add spark-cloud module to pul...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66182/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481] [WiP] [build] Add spark-cloud module to pul...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481] [build] Add spark-cloud module to pull in o...

Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    @srowen anything else I need to do here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481][build] [WIP] Add Hadoop 2.6+ spark-cloud mo...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61559/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7481][build][WIP] Add Hadoop 2.6+ spark...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12004#issuecomment-220265675
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/58856/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12004: [SPARK-7481] [build] Add spark-cloud module to pu...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12004#discussion_r113697962
  
    --- Diff: docs/cloud-integration.md ---
    @@ -0,0 +1,512 @@
    +---
    +layout: global
    +displayTitle: Integration with Cloud Infrastructures
    +title: Integration with Cloud Infrastructures
    +description: Introduction to cloud storage support in Apache Spark SPARK_VERSION_SHORT
    +---
    +<!---
    +  Licensed under the Apache License, Version 2.0 (the "License");
    +  you may not use this file except in compliance with the License.
    +  You may obtain a copy of the License at
    +
    +   http://www.apache.org/licenses/LICENSE-2.0
    +
    +  Unless required by applicable law or agreed to in writing, software
    +  distributed under the License is distributed on an "AS IS" BASIS,
    +  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +  See the License for the specific language governing permissions and
    +  limitations under the License. See accompanying LICENSE file.
    +-->
    +
    +* This will become a table of contents (this text will be scraped).
    +{:toc}
    +
    +## <a name="introduction"></a>Introduction
    +
    +
    +All the public cloud infrastructures, Amazon AWS, Microsoft Azure, Google GCS and others offer
    +persistent data storage systems, "object stores". These are not quite the same as classic file
    +systems: in order to scale to hundreds of Petabytes, without any single points of failure
    +or size limits, object stores, "blobstores", have a simpler model of `name => data`.
    +
    +Apache Spark can read or write data in object stores for data access.
    +through filesystem connectors implemented in Apache Hadoop or provided by third-parties.
    +These libraries make the object stores look *almost* like filesystems, with directories and
    +operations on files (rename) and directories (create, rename, delete) which mimic
    +those of a classic filesystem. Because of this, Spark and Spark-based applications
    +can work with object stores, generally treating them as as if they were slower-but-larger filesystems.
    +
    +With these connectors, Apache Spark supports object stores as the source
    +of data for analysis, including Spark Streaming and DataFrames.
    +
    +
    +## <a name="quick_start"></a>Quick Start
    +
    +Provided the relevant libraries are on the classpath, and Spark is configured with your credentials,
    +objects in an object store can be can be read or written through URLs which uses the name of the
    +object store client as the schema and the bucket/container as the hostname.
    +
    +
    +### Dependencies
    +
    +The Spark application neeeds the relevant Hadoop libraries, which can
    +be done by including the `spark-hadoop-cloud` module for the specific version of spark used.
    +
    +The Spark application should include <code>hadoop-openstack</code> dependency, which can
    +be done by including the `spark-hadoop-cloud` module for the specific version of spark used.
    +For example, for Maven support, add the following to the <code>pom.xml</code> file:
    +
    +{% highlight xml %}
    +<dependencyManagement>
    +  ...
    +  <dependency>
    +    <groupId>org.apache.spark</groupId>
    +    <artifactId>spark-hadoop-cloud_2.11</artifactId>
    +    <version>${spark.version}</version>
    +  </dependency>
    +  ...
    +</dependencyManagement>
    +{% endhighlight %}
    +
    +If using the Scala 2.10-compatible version of Spark, the artifact is of course `spark-hadoop-cloud_2.10`.
    +
    +### Basic Use
    +
    +You can refer to data in an object store just as you would data in a filesystem, by
    +using a URL to the data in methods like `SparkContext.textFile()` to read data, 
    +`saveAsTextFile()` to write it back.
    +
    +
    +Because object stores are viewed by Spark as filesystems, object stores can
    +be used as the source or destination of any spark work \u2014be it batch, SQL, DataFrame,
    +Streaming or something else.
    +
    +The steps to do so are as follows
    +
    +1. Use the full URI to refer to a bucket, including the prefix for the client-side library
    +to use. Example: `s3a://landsat-pds/scene_list.gz`
    +1. Have the Spark context configured with the authentication details of the object store.
    +In a YARN cluster, this may also be done in the `core-site.xml` file.
    +
    +
    +## <a name="output"></a>Object Stores as a substitute for HDFS
    +
    +As the examples show, you can write data to object stores. However, that does not mean
    +That they can be used as replacements for a cluster-wide filesystem.
    +
    +The full details are covered in [Cloud Object Stores are Not Real Filesystems](#cloud_stores_are_not_filesystems).
    +
    +The brief summary is:
    +
    +| Object Store Connector      |  Replace HDFS? |
    +|-----------------------------|--------------------|
    +| `s3a://` `s3n://`  from the ASF   | No  |
    +| Amazon EMR `s3://`          | Yes |
    +| Microsoft Azure `wasb://`   | Yes |
    +| OpenStack `swift://`        | No  |
    +
    +It is possible to use any of the object stores as a destination of work, i.e. use
    +`saveAsTextFile()` or `save()` to save data there, but the commit process may be slow
    +and, unreliable in the presence of failures.
    +
    +It is faster and safer to use the cluster filesystem as the destination of Spark jobs,
    +using that data as the data for follow-on work. The final results can
    +be persisted in to the object store using `distcp`.
    +
    +#### <a name="checkpointing"></a>Spark Streaming and object stores
    +
    +Spark Streaming can monitor files added to object stores, by
    +creating a `FileInputDStream` DStream monitoring a path under a bucket through
    +`StreamingContext.textFileStream()`.
    +
    +
    +1. The time to scan for new files is proportional to the number of files
    +under the path \u2014not the number of *new* files, and that it can become a slow operation.
    +The size of the window needs to be set to handle this.
    +
    +1. Files only appear in an object store once they are completely written; there
    +is no need for a worklow of write-then-rename to ensure that files aren't picked up
    +while they are still being written. Applications can write straight to the monitored directory.
    +
    +1. Streams should only be checkpointed to an object store considered compatible with
    +HDFS. Otherwise the checkpointing will be slow and potentially unreliable.
    +
    +### Recommended settings for writing to object stores
    +
    +Here are the settings to use when writing to object stores. This uses the "version 2" algorithm
    +for committing files \u2014which does less renaming than the v1 algorithm. Speculative execution is
    +disabled to avoid multiple writers corrupting the output.
    +
    +```
    +spark.speculation false
    +spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2
    +spark.hadoop.mapreduce.fileoutputcommitter.cleanup-failures.ignored true
    +```
    +
    +There's also the option of skipping the cleanup of temporary files in the output directory.
    +Enabling this option eliminates a small delay caused by listing and deleting any such files.
    +
    +```
    +spark.hadoop.mapreduce.fileoutputcommitter.cleanup.skipped true
    +```
    +
    +Bear in mind that storing temporary files can run up charges; Delete
    +directories called `"_temporary"` on a regular basis to avoid this.
    +
    +
    +### YARN Scheduler settings
    +
    +When running Spark in a YARN cluster running in EC2, turning off locality avoids any delays
    +waiting for the scheduler to find a node close to the data.
    +
    +```xml
    +  <property>
    +    <name>yarn.scheduler.capacity.node-locality-delay</name>
    +    <value>0</value>
    +  </property>
    +```
    +
    +This has to be set in the YARN cluster configuration, not in the Spark configuration.
    +
    +### Parquet I/O Settings
    +
    +For optimal performance when reading files saved in the Apache Parquet format,
    +read and write operations must be minimized, including generation of summary metadata,
    +and coalescing metadata from multiple files. The `filterPushdown` option
    +enables the Parquet library to optimize data reads itself, potentially saving bandwidth.
    +
    +```
    +spark.hadoop.parquet.enable.summary-metadata false
    +spark.sql.parquet.mergeSchema false
    +spark.sql.parquet.filterPushdown true
    +spark.sql.hive.metastorePartitionPruning true
    +```
    +
    +### ORC I/O Settings
    +
    +For optimal performance when reading files saved in the Apache ORC format,
    +read and write operations must be minimized. Here are the options to achieve this.
    +
    +
    +```
    +spark.sql.orc.filterPushdown true
    +spark.sql.orc.splits.include.file.footer true
    +spark.sql.orc.cache.stripe.details.size 10000
    +spark.sql.hive.metastorePartitionPruning true
    +```
    +
    +The `filterPushdown` option enables the ORC library to optimize data reads itself,
    +potentially saving bandwidth.
    +
    +The `spark.sql.orc.splits.include.file.footer` option means that the ORC file footer information,
    +is passed around with the file information \u2014so eliminating the need to reread this data.
    +
    +## <a name="authenticating"></a>Authenticating with Object Stores
    +
    +Apart from the special case of public read-only data, all object stores
    +require callers to authenticate themselves.
    +To do this, the Spark context must be configured with the authentication
    +details of the object store.
    +
    +1. In a YARN cluster, this may be done automatically in the `core-site.xml` file.
    +1. When Spark is running in cloud infrastructure (for example, on Amazon EC2, Google Cloud or
    +Microsoft Azure), the authentication details may be automatically derived from information
    +available to the VM.
    +1. `spark-submit` automatically picks up the contents of `AWS_ACCESS_KEY` and `AWS_SECRET_KEY`
    +environment variables and sets the associated configuration parameters for`s3n` and `s3a`
    +to these values. This essentially propagates the values across the Spark cluster.
    +1. Authentication details may be manually added to the Spark configuration
    +1. Alternatively, they can be programmatically added. *Important: never put authentication
    +secrets in source code. They will be compromised*.
    +
    +It is critical that the credentials used to access object stores are kept secret. Not only can
    +they be abused to run up compute charges, they can be used to read and alter private data.
    +
    +1. If adding login details to a spark configuration file, do not share this file, including
    +attaching to bug reports or committing it to SCM repositories.
    +1. Have different accounts for access to the storage for each application,
    +each with access rights restricted to those object storage buckets/containers used by the
    +application.
    +1. If the object store supports any form of session credential (e.g. Amazon's STS), issue
    +session credentials for the expected lifetime of the application.
    +1. When using a version of Spark with with Hadoop 2.8+ libraries, consider using Hadoop
    +credential files to store secrets, referencing
    +these files in the relevant ID/secret properties of the XML configuration file.
    +
    +
    +## <a name="object_stores"></a>Object stores and Their Library Dependencies
    +
    +The different object stores supported by Spark depend on specific Hadoop versions,
    +and require specific Hadoop JARs and dependent Java libraries on the classpath.
    +
    +<table class="table">
    +  <tr><th>Schema</th><th>Store</th><th>Details</th></tr>
    +  <tr>
    +    <td><code>s3a://</code></td>
    +    <td>Amazon S3</a>
    +    <td>
    +    Recommended S3 client for Spark releases built on Apache Hadoop 2.7 or later.
    +    </td>
    +  </tr>
    +  <tr>
    +    <td><code>s3n://</code></td>
    +    <td>Amazon S3</a>
    +    <td>
    +    Deprected S3 client; only use for Spark releases built on Apache Hadoop 2.6 or earlier.
    +    </td>
    +  </tr>
    +  <tr>
    +    <td><code>s3://</code></td>
    +    <td>Amazon S3 on Amazon EMR</a>
    +    <td>
    +    Amazon's own S3 client; use only and exclusivley in Amazon EMR.
    +    </td>
    +  </tr>
    +  <tr>
    +    <td><code>wasb://</code></td>
    +    <td>Azure Storage</a>
    +    <td>
    +    Client for Microsoft Azure Storage; since Hadoop 2.7.
    +    </td>
    +  </tr>
    +  <tr>
    +    <td><code>swift://</code></td>
    +    <td>OpenStack Swift</a>
    +    <td>
    +    Client for OpenStack Swift object stores.
    +    </td>
    +  </tr>
    +  <tr>
    +    <td><code>gs://</code></td>
    +    <td>Google Cloud Storage</a>
    +    <td>
    +    Google's client for their cloud object store.
    +    </td>
    +  </tr>
    +</table>
    +
    +
    +### <a name="working_with_amazon_s3"></a>Working with Amazon S3
    +
    +Amazon's S3 object store is probably the most widely used object store \u2014it is also the one
    +with the most client libraries. This is due to the evolution of Hadoop's support, and Amazon
    +offering Hadoop and Spark as its EMR service, along with its own S3 client.
    +
    +The recommendations for which client to use depend upon the version of Hadoop on the Spark classpath.
    +
    +<table class="table">
    +  <tr><th>Hadoop Library Version</th><th>Client</th></tr>
    +  <tr>
    +    <td>Hadoop 2.7+ and commercial products based on it</a>
    +    <td><code>s3a://</code></td>
    +  </tr>
    +  <tr>
    +    <td>Hadoop 2.6 or earlier</a>
    +    <td><code>s3n://</code></td>
    +  </tr>
    +  <tr>
    +    <td>Amazon EMR</a>
    +    <td><code>s3://</code></td>
    +  </tr>
    +</table>
    +
    +Authentication is generally via properties set in the spark context or, in YARN clusters,
    +`core-site.xml`.
    +Versions of the S3A client also support short-lived session credentials and IAM authentication to
    +automatically pick up credentials on EC2 deployments. Consult the appropriate Hadoop documentation for specifics.
    +
    +`spark-submit` will automatically pick up and propagate `AWS_ACCESS_KEY` and `AWS_SECRET_KEY`
    +from the environment variables set in the environment of the user running `spark-submit`; these
    +will override any set in the configuration files.
    +
    +Be aware that while S3 buckets support complex access control declarations, Spark needs
    +full read/write access to any bucket to which it must write data. That is: it does not support writing
    +to buckets where the root paths are read only, or not readable at all.
    +
    +#### <a name="s3a"></a>S3A Filesystem Client: `s3a://`
    +
    +The ["S3A" filesystem client](https://hadoop.apache.org/docs/stable2/hadoop-aws/tools/hadoop-aws/index.html)
    +is the sole S3 connector undergoing active maintenance at the Apache, and should be used wherever
    --- End diff --
    
    "at the Apache Hadoop project"?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7481][build][WIP] Add Hadoop 2.6+ spark...

Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on the pull request:

    https://github.com/apache/spark/pull/12004#issuecomment-217536672
  
    For anyone trying to run these tests, they'll need a test xml file and refer to it
    
    ```
    mvn test -Phadoop-2.6 -Dcloud.test.configuration.file=../cloud.xml 
    ```
    
    The referenced file uses XInclude to input the AWS credentials which I keep a long way away from SCM-managed directories. 
    
    ```xml
    <?xml version="1.0"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
    <!--
      ~ Licensed to the Apache Software Foundation (ASF) under one
      ~  or more contributor license agreements.  See the NOTICE file
      ~  distributed with this work for additional information
      ~  regarding copyright ownership.  The ASF licenses this file
      ~  to you under the Apache License, Version 2.0 (the
      ~  "License"); you may not use this file except in compliance
      ~  with the License.  You may obtain a copy of the License at
      ~
      ~       http://www.apache.org/licenses/LICENSE-2.0
      ~
      ~  Unless required by applicable law or agreed to in writing, software
      ~  distributed under the License is distributed on an "AS IS" BASIS,
      ~  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
      ~  See the License for the specific language governing permissions and
      ~  limitations under the License.
      -->
    
    <configuration>
      <include xmlns="http://www.w3.org/2001/XInclude"
        href="file:///home/stevel/.aws/keys.xml"/>
    
      <property>
        <name>aws.tests.enabled</name>
        <value>true</value>
      </property>
    
      <property>
        <name>s3a.test.uri</name>
        <value>s3a://test-eu1</value>
      </property>
    </configuration>
    ```
    
    All the test suites will be designed to run iff the relevant enabled.flag is set; this is why there's a new method to declare tests, `ctest(key: String, summary: String, detail: String)(testFun: => Unit): Unit`
    
    these tests are not only conditional on the suite being enabled, they each have a key which can be explicitly named from the build in the `test.method.keys` attr. This allows explicit methods to be named the way the current maven surefire runner doesn't; the time it can take to run individual tests makes this feature invaluable during iterative development. 
    
    ```scala
      ctest("CSVgz", "Read compressed CSV",
        "Read compressed CSV files through the spark context") {
        val source = SceneList
        sc = new SparkContext("local", "test", newSparkConf(source))
        val sceneInfo = getFS(source).getFileStatus(source)
        logInfo(s"Compressed size = ${sceneInfo.getLen}")
        validateCSV(sc, source)
        logInfo(s"Filesystem statistics ${getFS(source)}")
      }
    ```
    For best performance, you need a build of hadoop which has as much of [HADOOP-11694](https://issues.apache.org/jira/browse/HADOOP-11694) applied, especially HADOOP-12444, lazy seek (in branch-2.8 already), and [HADOOP-13028](https://issues.apache.org/jira/browse/HADOOP-13028)
    ```
    mvn test -Phadoop-2.7 -DwildcardSuites=org.apache.spark.cloud.s3.S3aIOSuite  \
    -Dcloud.test.configuration.file=../cloud.xml \
    -Dhadoop.version=2.9.0-SNAPSHOT \
    -Dtest.method.keys=CSVgz
    ```



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481] [build] Add spark-cloud module to pull in o...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481] [build] Add spark-cloud module to pull in o...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7481][build][WIP] Add Hadoop 2.6+ spark...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12004#issuecomment-204966315
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12004: [SPARK-7481] [build] Add spark-cloud module to pu...

Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12004#discussion_r113995508
  
    --- Diff: docs/cloud-integration.md ---
    @@ -0,0 +1,512 @@
    +---
    +layout: global
    +displayTitle: Integration with Cloud Infrastructures
    +title: Integration with Cloud Infrastructures
    +description: Introduction to cloud storage support in Apache Spark SPARK_VERSION_SHORT
    +---
    +<!---
    +  Licensed under the Apache License, Version 2.0 (the "License");
    +  you may not use this file except in compliance with the License.
    +  You may obtain a copy of the License at
    +
    +   http://www.apache.org/licenses/LICENSE-2.0
    +
    +  Unless required by applicable law or agreed to in writing, software
    +  distributed under the License is distributed on an "AS IS" BASIS,
    +  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +  See the License for the specific language governing permissions and
    +  limitations under the License. See accompanying LICENSE file.
    +-->
    +
    +* This will become a table of contents (this text will be scraped).
    +{:toc}
    +
    +## <a name="introduction"></a>Introduction
    +
    +
    +All the public cloud infrastructures, Amazon AWS, Microsoft Azure, Google GCS and others offer
    +persistent data storage systems, "object stores". These are not quite the same as classic file
    +systems: in order to scale to hundreds of Petabytes, without any single points of failure
    +or size limits, object stores, "blobstores", have a simpler model of `name => data`.
    +
    +Apache Spark can read or write data in object stores for data access.
    +through filesystem connectors implemented in Apache Hadoop or provided by third-parties.
    +These libraries make the object stores look *almost* like filesystems, with directories and
    +operations on files (rename) and directories (create, rename, delete) which mimic
    +those of a classic filesystem. Because of this, Spark and Spark-based applications
    +can work with object stores, generally treating them as as if they were slower-but-larger filesystems.
    +
    +With these connectors, Apache Spark supports object stores as the source
    +of data for analysis, including Spark Streaming and DataFrames.
    +
    +
    +## <a name="quick_start"></a>Quick Start
    +
    +Provided the relevant libraries are on the classpath, and Spark is configured with your credentials,
    +objects in an object store can be can be read or written through URLs which uses the name of the
    +object store client as the schema and the bucket/container as the hostname.
    +
    +
    +### Dependencies
    +
    +The Spark application neeeds the relevant Hadoop libraries, which can
    +be done by including the `spark-hadoop-cloud` module for the specific version of spark used.
    +
    +The Spark application should include <code>hadoop-openstack</code> dependency, which can
    +be done by including the `spark-hadoop-cloud` module for the specific version of spark used.
    +For example, for Maven support, add the following to the <code>pom.xml</code> file:
    +
    +{% highlight xml %}
    +<dependencyManagement>
    +  ...
    +  <dependency>
    +    <groupId>org.apache.spark</groupId>
    +    <artifactId>spark-hadoop-cloud_2.11</artifactId>
    +    <version>${spark.version}</version>
    +  </dependency>
    +  ...
    +</dependencyManagement>
    +{% endhighlight %}
    +
    +If using the Scala 2.10-compatible version of Spark, the artifact is of course `spark-hadoop-cloud_2.10`.
    +
    +### Basic Use
    +
    +You can refer to data in an object store just as you would data in a filesystem, by
    +using a URL to the data in methods like `SparkContext.textFile()` to read data, 
    +`saveAsTextFile()` to write it back.
    +
    +
    +Because object stores are viewed by Spark as filesystems, object stores can
    +be used as the source or destination of any spark work \u2014be it batch, SQL, DataFrame,
    +Streaming or something else.
    +
    +The steps to do so are as follows
    +
    +1. Use the full URI to refer to a bucket, including the prefix for the client-side library
    +to use. Example: `s3a://landsat-pds/scene_list.gz`
    +1. Have the Spark context configured with the authentication details of the object store.
    +In a YARN cluster, this may also be done in the `core-site.xml` file.
    +
    +
    +## <a name="output"></a>Object Stores as a substitute for HDFS
    +
    +As the examples show, you can write data to object stores. However, that does not mean
    +That they can be used as replacements for a cluster-wide filesystem.
    +
    +The full details are covered in [Cloud Object Stores are Not Real Filesystems](#cloud_stores_are_not_filesystems).
    +
    +The brief summary is:
    +
    +| Object Store Connector      |  Replace HDFS? |
    +|-----------------------------|--------------------|
    +| `s3a://` `s3n://`  from the ASF   | No  |
    +| Amazon EMR `s3://`          | Yes |
    +| Microsoft Azure `wasb://`   | Yes |
    +| OpenStack `swift://`        | No  |
    +
    +It is possible to use any of the object stores as a destination of work, i.e. use
    +`saveAsTextFile()` or `save()` to save data there, but the commit process may be slow
    +and, unreliable in the presence of failures.
    +
    +It is faster and safer to use the cluster filesystem as the destination of Spark jobs,
    +using that data as the data for follow-on work. The final results can
    +be persisted in to the object store using `distcp`.
    +
    +#### <a name="checkpointing"></a>Spark Streaming and object stores
    +
    +Spark Streaming can monitor files added to object stores, by
    +creating a `FileInputDStream` DStream monitoring a path under a bucket through
    +`StreamingContext.textFileStream()`.
    +
    +
    +1. The time to scan for new files is proportional to the number of files
    +under the path \u2014not the number of *new* files, and that it can become a slow operation.
    +The size of the window needs to be set to handle this.
    +
    +1. Files only appear in an object store once they are completely written; there
    +is no need for a worklow of write-then-rename to ensure that files aren't picked up
    +while they are still being written. Applications can write straight to the monitored directory.
    +
    +1. Streams should only be checkpointed to an object store considered compatible with
    +HDFS. Otherwise the checkpointing will be slow and potentially unreliable.
    +
    +### Recommended settings for writing to object stores
    +
    +Here are the settings to use when writing to object stores. This uses the "version 2" algorithm
    +for committing files \u2014which does less renaming than the v1 algorithm. Speculative execution is
    +disabled to avoid multiple writers corrupting the output.
    +
    +```
    +spark.speculation false
    +spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2
    +spark.hadoop.mapreduce.fileoutputcommitter.cleanup-failures.ignored true
    +```
    +
    +There's also the option of skipping the cleanup of temporary files in the output directory.
    +Enabling this option eliminates a small delay caused by listing and deleting any such files.
    +
    +```
    +spark.hadoop.mapreduce.fileoutputcommitter.cleanup.skipped true
    +```
    +
    +Bear in mind that storing temporary files can run up charges; Delete
    +directories called `"_temporary"` on a regular basis to avoid this.
    +
    +
    +### YARN Scheduler settings
    +
    +When running Spark in a YARN cluster running in EC2, turning off locality avoids any delays
    +waiting for the scheduler to find a node close to the data.
    +
    +```xml
    +  <property>
    +    <name>yarn.scheduler.capacity.node-locality-delay</name>
    +    <value>0</value>
    +  </property>
    +```
    +
    +This has to be set in the YARN cluster configuration, not in the Spark configuration.
    +
    +### Parquet I/O Settings
    +
    +For optimal performance when reading files saved in the Apache Parquet format,
    +read and write operations must be minimized, including generation of summary metadata,
    +and coalescing metadata from multiple files. The `filterPushdown` option
    +enables the Parquet library to optimize data reads itself, potentially saving bandwidth.
    +
    +```
    +spark.hadoop.parquet.enable.summary-metadata false
    +spark.sql.parquet.mergeSchema false
    +spark.sql.parquet.filterPushdown true
    +spark.sql.hive.metastorePartitionPruning true
    +```
    +
    +### ORC I/O Settings
    +
    +For optimal performance when reading files saved in the Apache ORC format,
    +read and write operations must be minimized. Here are the options to achieve this.
    +
    +
    +```
    +spark.sql.orc.filterPushdown true
    +spark.sql.orc.splits.include.file.footer true
    +spark.sql.orc.cache.stripe.details.size 10000
    +spark.sql.hive.metastorePartitionPruning true
    +```
    +
    +The `filterPushdown` option enables the ORC library to optimize data reads itself,
    +potentially saving bandwidth.
    +
    +The `spark.sql.orc.splits.include.file.footer` option means that the ORC file footer information,
    +is passed around with the file information \u2014so eliminating the need to reread this data.
    +
    +## <a name="authenticating"></a>Authenticating with Object Stores
    +
    +Apart from the special case of public read-only data, all object stores
    +require callers to authenticate themselves.
    +To do this, the Spark context must be configured with the authentication
    +details of the object store.
    +
    +1. In a YARN cluster, this may be done automatically in the `core-site.xml` file.
    +1. When Spark is running in cloud infrastructure (for example, on Amazon EC2, Google Cloud or
    +Microsoft Azure), the authentication details may be automatically derived from information
    +available to the VM.
    +1. `spark-submit` automatically picks up the contents of `AWS_ACCESS_KEY` and `AWS_SECRET_KEY`
    +environment variables and sets the associated configuration parameters for`s3n` and `s3a`
    +to these values. This essentially propagates the values across the Spark cluster.
    +1. Authentication details may be manually added to the Spark configuration
    +1. Alternatively, they can be programmatically added. *Important: never put authentication
    +secrets in source code. They will be compromised*.
    +
    +It is critical that the credentials used to access object stores are kept secret. Not only can
    +they be abused to run up compute charges, they can be used to read and alter private data.
    +
    +1. If adding login details to a spark configuration file, do not share this file, including
    +attaching to bug reports or committing it to SCM repositories.
    +1. Have different accounts for access to the storage for each application,
    +each with access rights restricted to those object storage buckets/containers used by the
    +application.
    +1. If the object store supports any form of session credential (e.g. Amazon's STS), issue
    +session credentials for the expected lifetime of the application.
    +1. When using a version of Spark with with Hadoop 2.8+ libraries, consider using Hadoop
    +credential files to store secrets, referencing
    +these files in the relevant ID/secret properties of the XML configuration file.
    +
    +
    +## <a name="object_stores"></a>Object stores and Their Library Dependencies
    +
    +The different object stores supported by Spark depend on specific Hadoop versions,
    +and require specific Hadoop JARs and dependent Java libraries on the classpath.
    +
    +<table class="table">
    +  <tr><th>Schema</th><th>Store</th><th>Details</th></tr>
    +  <tr>
    +    <td><code>s3a://</code></td>
    +    <td>Amazon S3</a>
    +    <td>
    +    Recommended S3 client for Spark releases built on Apache Hadoop 2.7 or later.
    +    </td>
    +  </tr>
    +  <tr>
    +    <td><code>s3n://</code></td>
    +    <td>Amazon S3</a>
    +    <td>
    +    Deprected S3 client; only use for Spark releases built on Apache Hadoop 2.6 or earlier.
    +    </td>
    +  </tr>
    +  <tr>
    +    <td><code>s3://</code></td>
    +    <td>Amazon S3 on Amazon EMR</a>
    +    <td>
    +    Amazon's own S3 client; use only and exclusivley in Amazon EMR.
    +    </td>
    +  </tr>
    +  <tr>
    +    <td><code>wasb://</code></td>
    +    <td>Azure Storage</a>
    +    <td>
    +    Client for Microsoft Azure Storage; since Hadoop 2.7.
    +    </td>
    +  </tr>
    +  <tr>
    +    <td><code>swift://</code></td>
    +    <td>OpenStack Swift</a>
    +    <td>
    +    Client for OpenStack Swift object stores.
    +    </td>
    +  </tr>
    +  <tr>
    +    <td><code>gs://</code></td>
    +    <td>Google Cloud Storage</a>
    +    <td>
    +    Google's client for their cloud object store.
    +    </td>
    +  </tr>
    +</table>
    +
    +
    +### <a name="working_with_amazon_s3"></a>Working with Amazon S3
    +
    +Amazon's S3 object store is probably the most widely used object store \u2014it is also the one
    +with the most client libraries. This is due to the evolution of Hadoop's support, and Amazon
    +offering Hadoop and Spark as its EMR service, along with its own S3 client.
    +
    +The recommendations for which client to use depend upon the version of Hadoop on the Spark classpath.
    +
    +<table class="table">
    +  <tr><th>Hadoop Library Version</th><th>Client</th></tr>
    +  <tr>
    +    <td>Hadoop 2.7+ and commercial products based on it</a>
    +    <td><code>s3a://</code></td>
    +  </tr>
    +  <tr>
    +    <td>Hadoop 2.6 or earlier</a>
    +    <td><code>s3n://</code></td>
    +  </tr>
    +  <tr>
    +    <td>Amazon EMR</a>
    +    <td><code>s3://</code></td>
    +  </tr>
    +</table>
    +
    +Authentication is generally via properties set in the spark context or, in YARN clusters,
    +`core-site.xml`.
    +Versions of the S3A client also support short-lived session credentials and IAM authentication to
    +automatically pick up credentials on EC2 deployments. Consult the appropriate Hadoop documentation for specifics.
    +
    +`spark-submit` will automatically pick up and propagate `AWS_ACCESS_KEY` and `AWS_SECRET_KEY`
    +from the environment variables set in the environment of the user running `spark-submit`; these
    +will override any set in the configuration files.
    +
    +Be aware that while S3 buckets support complex access control declarations, Spark needs
    +full read/write access to any bucket to which it must write data. That is: it does not support writing
    +to buckets where the root paths are read only, or not readable at all.
    +
    +#### <a name="s3a"></a>S3A Filesystem Client: `s3a://`
    +
    +The ["S3A" filesystem client](https://hadoop.apache.org/docs/stable2/hadoop-aws/tools/hadoop-aws/index.html)
    +is the sole S3 connector undergoing active maintenance at the Apache, and should be used wherever
    +possible.
    +
    +
    +**Tuning for performance:**
    +
    +For recent Hadoop versions, *when working with binary formats* (Parquet, ORC) use
    +
    +```
    +spark.hadoop.fs.s3a.experimental.input.fadvise random
    +```
    +
    +This reads from the object in blocks, which is efficient when seeking backwards as well as
    +forwards in a file \u2014at the expense of making full file reads slower.
    +
    +When working with text formats (text, CSV), or any sequential read through an entire file
    +(including .gzip compressed data),
    +this "random" I/O policy should be disabled. This is the default, but can be done
    +explicitly:
    +
    +```
    +spark.hadoop.fs.s3a.experimental.input.fadvise normal
    +spark.hadoop.fs.s3a.readahead.range 157810688
    +```
    +
    +This optimizes the object read for sequential input, and when there is a forward `seek()` call
    +up to that readahead range, will simply read the data in the current HTTPS request, rather than
    +abort it and start a new one.
    +
    +
    +#### <a name="s3n"></a>S3 Native Client `s3n://`
    +
    +The ["S3N" filesystem client](https://hadoop.apache.org/docs/stable2/hadoop-aws/tools/hadoop-aws/index.html)
    +was implemented in 2008 and has been widely used.
    +
    +While stable, S3N is essentially unmaintained, and deprecated in favor of S3A.
    +As well as being slower and limited in authentication mechanisms, the
    +only maintenance it receives are critical security issues.
    +
    +
    +#### <a name="emrs3"></a>Amazon EMR's S3 Client: `s3://`
    +
    +
    +In Amazon EMR, `s3://` is the URL schema used to refer to
    +[Amazon's own filesystem client](https://aws.amazon.com/premiumsupport/knowledge-center/emr-file-system-s3/),
    +one that is closed-source.
    +
    +As EMR also maps `s3n://` to the same filesystem, using URLs with the `s3n://` schema avoids
    +some confusion. Bear in mind, however, that Amazon's S3 client library is not the Apache one:
    +only Amazon can field bug reports related to it.
    +
    +To work with this data outside of EMR itself, use `s3a://` or `s3n://` instead.
    +
    +
    +#### <a name="asf_s3"></a>Obsolete: Apache Hadoop's S3 client, `s3://`
    +
    +Apache's own Hadoop releases (i.e not EMR), uses URL `s3://` to refer to a
    +deprecated inode-based filesystem implemented on top of S3.
    +This filesystem is obsolete, deprecated and has been dropped from Hadoop 3.x.
    +
    +*Important: * Do not use `s3://` URLs with Apache Spark except on Amazon EMR*
    +It is not the same as the Amazon EMR one and incompatible with all other applications.
    +
    +
    +### <a name="working_with_azure"></a>Working with Microsoft Azure Storage
    +
    +Azure support comes with the [`wasb` filesystem client](https://hadoop.apache.org/docs/stable2/hadoop-azure/index.html).
    +
    +The Apache implementation is that used by Microsoft in Azure itself: it can be used
    +to access data in Azure as well as remotely. The object store itself is *consistent*, and
    +can be reliably used as the destination of queries.
    +
    +
    +### <a name="working_with_swift"></a>Working with OpenStack Swift
    +
    +
    +The OpenStack [`swift://` filesystem client](https://hadoop.apache.org/docs/stable2/hadoop-openstack/index.html)
    +works with Swift object stores in private OpenStack installations, public installations
    +including Rackspace Cloud and IBM Softlayer.
    +
    +### <a name="working_with_google_cloud_storage"></a>Working with Google Cloud Storage
    +
    +[Google Cloud Storage](https://cloud.google.com/storage) is supported via Google's own
    +[GCS filesystem client](https://cloud.google.com/hadoop/google-cloud-storage-connector).
    +
    +
    +For use outside of Google cloud, `gcs-connector.jar` must be be manually downloaded then added
    +to `$SPARK_HOME/jars`.
    +
    +
    +## <a name="cloud_stores_are_not_filesystems"></a>Important: Cloud Object Stores are Not Real Filesystems
    +
    +Object stores are not filesystems: they are not a hierarchical tree of directories and files.
    +
    +The Hadoop filesystem APIs offer a filesystem API to the object stores, but underneath
    +they are still object stores, [and the difference is significant](http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/filesystem/introduction.html)
    +
    +While object stores can be used as the source and store
    +for persistent data, they cannot be used as a direct replacement for a cluster-wide filesystem such as HDFS.
    +This is important to know, as the fact they are accessed with the same APIs can be misleading.
    +
    +### Directory Operations May be Slow and Non-atomic
    +
    +Directory rename and delete may be performed as a series of operations. Specifically, recursive
    +directory deletion may be implemented as "list the objects, delete them singly or in batches".
    +File and directory renames may be implemented as "copy all the objects" followed by the delete operation.
    +
    +1. The time to delete a directory depends on the number of files in the directory.
    +1. Directory deletion may fail partway through, leaving a partially deleted directory.
    +1. Directory renaming may fail part way through, leaving the destination directory containing some of the files
    +being renamed, the source directory untouched.
    +1. The time to rename files and directories increases with the amount of data to rename.
    +1. If the rename is done on the client, the time to rename
    +each file will depend upon the bandwidth between client and the filesystem. The further away the client
    +is, the longer the rename will take.
    +1. Recursive directory listing can be very slow. This can slow down some parts of job submission
    +and execution.
    +
    +Because of these behaviours, committing of work by renaming directories is neither efficient nor
    +reliable. In Spark 1.6 and predecessors, there was a special output committer for Parquet,
    +the `org.apache.spark.sql.execution.datasources.parquet.DirectParquetOutputCommitter`
    +which bypasses the rename phase. However, as well as having major problems when used
    + with speculative execution enabled, it handled failures badly. For this reason, it
    +[was removed from Spark 2.0](https://issues.apache.org/jira/browse/SPARK-10063).
    +
    +*Critical* speculative execution does not work against object
    +stores which do not support atomic directory renames. Your output may get
    +corrupted.
    +
    +*Warning* even non-speculative execution is at risk of leaving the output of a job in an inconsistent
    +state if a "Direct" output committer is used and executors fail.
    +
    +### Data is Not Written Until the OutputStream's `close()` Operation.
    +
    +Data written to the object store is often buffered to a local file or stored in memory,
    --- End diff --
    
    (FWIW, where this does cause problems is that really slow writes can block things like heartbeating protocols if the same thread is doing the heartbeat. Hasn't surfaced in spark AFAIK.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481] [build] Add spark-cloud module to pull in o...

Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    The latest patch
    
    1.  keeps the cloud package separate from hadoop-2.7. This is important avoid outstanding problems related to org.json licensed artifacts in the aws SDK JARs. The hadoop project retains the right to release these binaries until April 2017, but other projects are not allowed to start. For builds against Hadoop 2.9 or later, upgraded dependencies mean the SDK is distributable.
    1. Declares the dependency on `com.fasterxml.jackson.dataformat:jackson-dataformat-cbor` needed to keep a dependency in the updated AWS SDK consistent with the rest of Spark's Jackson imports.
    1. Adds the SBT build too. 
    
    I'm about to flip the name to spark-hadoop-cloud



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481] [build] Add spark-cloud module to pull in o...

Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    this patch is ready for review. Anyone?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12004: [SPARK-7481] [build] Add spark-cloud module to pu...

Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12004#discussion_r113971968
  
    --- Diff: docs/cloud-integration.md ---
    @@ -0,0 +1,512 @@
    +---
    +layout: global
    +displayTitle: Integration with Cloud Infrastructures
    +title: Integration with Cloud Infrastructures
    +description: Introduction to cloud storage support in Apache Spark SPARK_VERSION_SHORT
    +---
    +<!---
    +  Licensed under the Apache License, Version 2.0 (the "License");
    +  you may not use this file except in compliance with the License.
    +  You may obtain a copy of the License at
    +
    +   http://www.apache.org/licenses/LICENSE-2.0
    +
    +  Unless required by applicable law or agreed to in writing, software
    +  distributed under the License is distributed on an "AS IS" BASIS,
    +  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +  See the License for the specific language governing permissions and
    +  limitations under the License. See accompanying LICENSE file.
    +-->
    +
    +* This will become a table of contents (this text will be scraped).
    +{:toc}
    +
    +## <a name="introduction"></a>Introduction
    +
    +
    +All the public cloud infrastructures, Amazon AWS, Microsoft Azure, Google GCS and others offer
    +persistent data storage systems, "object stores". These are not quite the same as classic file
    +systems: in order to scale to hundreds of Petabytes, without any single points of failure
    +or size limits, object stores, "blobstores", have a simpler model of `name => data`.
    +
    +Apache Spark can read or write data in object stores for data access.
    +through filesystem connectors implemented in Apache Hadoop or provided by third-parties.
    +These libraries make the object stores look *almost* like filesystems, with directories and
    +operations on files (rename) and directories (create, rename, delete) which mimic
    +those of a classic filesystem. Because of this, Spark and Spark-based applications
    +can work with object stores, generally treating them as as if they were slower-but-larger filesystems.
    +
    +With these connectors, Apache Spark supports object stores as the source
    +of data for analysis, including Spark Streaming and DataFrames.
    +
    +
    +## <a name="quick_start"></a>Quick Start
    +
    +Provided the relevant libraries are on the classpath, and Spark is configured with your credentials,
    +objects in an object store can be can be read or written through URLs which uses the name of the
    +object store client as the schema and the bucket/container as the hostname.
    +
    +
    +### Dependencies
    +
    +The Spark application neeeds the relevant Hadoop libraries, which can
    +be done by including the `spark-hadoop-cloud` module for the specific version of spark used.
    +
    +The Spark application should include <code>hadoop-openstack</code> dependency, which can
    +be done by including the `spark-hadoop-cloud` module for the specific version of spark used.
    +For example, for Maven support, add the following to the <code>pom.xml</code> file:
    +
    +{% highlight xml %}
    +<dependencyManagement>
    +  ...
    +  <dependency>
    +    <groupId>org.apache.spark</groupId>
    +    <artifactId>spark-hadoop-cloud_2.11</artifactId>
    +    <version>${spark.version}</version>
    +  </dependency>
    +  ...
    +</dependencyManagement>
    +{% endhighlight %}
    +
    +If using the Scala 2.10-compatible version of Spark, the artifact is of course `spark-hadoop-cloud_2.10`.
    +
    +### Basic Use
    +
    +You can refer to data in an object store just as you would data in a filesystem, by
    +using a URL to the data in methods like `SparkContext.textFile()` to read data, 
    +`saveAsTextFile()` to write it back.
    +
    +
    +Because object stores are viewed by Spark as filesystems, object stores can
    +be used as the source or destination of any spark work \u2014be it batch, SQL, DataFrame,
    +Streaming or something else.
    +
    +The steps to do so are as follows
    +
    +1. Use the full URI to refer to a bucket, including the prefix for the client-side library
    +to use. Example: `s3a://landsat-pds/scene_list.gz`
    +1. Have the Spark context configured with the authentication details of the object store.
    +In a YARN cluster, this may also be done in the `core-site.xml` file.
    +
    +
    +## <a name="output"></a>Object Stores as a substitute for HDFS
    +
    +As the examples show, you can write data to object stores. However, that does not mean
    +That they can be used as replacements for a cluster-wide filesystem.
    +
    +The full details are covered in [Cloud Object Stores are Not Real Filesystems](#cloud_stores_are_not_filesystems).
    +
    +The brief summary is:
    +
    +| Object Store Connector      |  Replace HDFS? |
    +|-----------------------------|--------------------|
    +| `s3a://` `s3n://`  from the ASF   | No  |
    +| Amazon EMR `s3://`          | Yes |
    +| Microsoft Azure `wasb://`   | Yes |
    +| OpenStack `swift://`        | No  |
    +
    +It is possible to use any of the object stores as a destination of work, i.e. use
    +`saveAsTextFile()` or `save()` to save data there, but the commit process may be slow
    +and, unreliable in the presence of failures.
    +
    +It is faster and safer to use the cluster filesystem as the destination of Spark jobs,
    +using that data as the data for follow-on work. The final results can
    +be persisted in to the object store using `distcp`.
    +
    +#### <a name="checkpointing"></a>Spark Streaming and object stores
    +
    +Spark Streaming can monitor files added to object stores, by
    +creating a `FileInputDStream` DStream monitoring a path under a bucket through
    +`StreamingContext.textFileStream()`.
    +
    +
    +1. The time to scan for new files is proportional to the number of files
    +under the path \u2014not the number of *new* files, and that it can become a slow operation.
    +The size of the window needs to be set to handle this.
    +
    +1. Files only appear in an object store once they are completely written; there
    +is no need for a worklow of write-then-rename to ensure that files aren't picked up
    +while they are still being written. Applications can write straight to the monitored directory.
    +
    +1. Streams should only be checkpointed to an object store considered compatible with
    +HDFS. Otherwise the checkpointing will be slow and potentially unreliable.
    +
    +### Recommended settings for writing to object stores
    +
    +Here are the settings to use when writing to object stores. This uses the "version 2" algorithm
    +for committing files \u2014which does less renaming than the v1 algorithm. Speculative execution is
    +disabled to avoid multiple writers corrupting the output.
    +
    +```
    +spark.speculation false
    +spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2
    +spark.hadoop.mapreduce.fileoutputcommitter.cleanup-failures.ignored true
    +```
    +
    +There's also the option of skipping the cleanup of temporary files in the output directory.
    +Enabling this option eliminates a small delay caused by listing and deleting any such files.
    +
    +```
    +spark.hadoop.mapreduce.fileoutputcommitter.cleanup.skipped true
    +```
    +
    +Bear in mind that storing temporary files can run up charges; Delete
    --- End diff --
    
    It's essentially, "should the job fail just because we couldn't delete stuff under _temporary"
    
    IF you set it to false, the exceptions get logged and swallowed
    Se it to true, and yes, the job fails. But then what? That's failed the job but its not going to make the data go away, is it? Instead the job failed and something is going to be left understand what went wrong and trying to recover. Which, given its a codepath which rarely occurs (at least with filesystems), isn't something the code does much of. 
    
    IMO: better cleanup logic here is needed. Like some retries. But really, given the more fundamental flaws with committing to blobstores, this is not something worth touching in FileOutputCommitter. (I should add: I've not personally hit this problem)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12004: [SPARK-7481] [build] Add spark-cloud module to pu...

Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12004#discussion_r113976861
  
    --- Diff: docs/cloud-integration.md ---
    @@ -0,0 +1,512 @@
    +---
    +layout: global
    +displayTitle: Integration with Cloud Infrastructures
    +title: Integration with Cloud Infrastructures
    +description: Introduction to cloud storage support in Apache Spark SPARK_VERSION_SHORT
    +---
    +<!---
    +  Licensed under the Apache License, Version 2.0 (the "License");
    +  you may not use this file except in compliance with the License.
    +  You may obtain a copy of the License at
    +
    +   http://www.apache.org/licenses/LICENSE-2.0
    +
    +  Unless required by applicable law or agreed to in writing, software
    +  distributed under the License is distributed on an "AS IS" BASIS,
    +  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +  See the License for the specific language governing permissions and
    +  limitations under the License. See accompanying LICENSE file.
    +-->
    +
    +* This will become a table of contents (this text will be scraped).
    +{:toc}
    +
    +## <a name="introduction"></a>Introduction
    +
    +
    +All the public cloud infrastructures, Amazon AWS, Microsoft Azure, Google GCS and others offer
    +persistent data storage systems, "object stores". These are not quite the same as classic file
    +systems: in order to scale to hundreds of Petabytes, without any single points of failure
    +or size limits, object stores, "blobstores", have a simpler model of `name => data`.
    +
    +Apache Spark can read or write data in object stores for data access.
    +through filesystem connectors implemented in Apache Hadoop or provided by third-parties.
    +These libraries make the object stores look *almost* like filesystems, with directories and
    +operations on files (rename) and directories (create, rename, delete) which mimic
    +those of a classic filesystem. Because of this, Spark and Spark-based applications
    +can work with object stores, generally treating them as as if they were slower-but-larger filesystems.
    +
    +With these connectors, Apache Spark supports object stores as the source
    +of data for analysis, including Spark Streaming and DataFrames.
    +
    +
    +## <a name="quick_start"></a>Quick Start
    +
    +Provided the relevant libraries are on the classpath, and Spark is configured with your credentials,
    +objects in an object store can be can be read or written through URLs which uses the name of the
    +object store client as the schema and the bucket/container as the hostname.
    +
    +
    +### Dependencies
    +
    +The Spark application neeeds the relevant Hadoop libraries, which can
    +be done by including the `spark-hadoop-cloud` module for the specific version of spark used.
    +
    +The Spark application should include <code>hadoop-openstack</code> dependency, which can
    +be done by including the `spark-hadoop-cloud` module for the specific version of spark used.
    +For example, for Maven support, add the following to the <code>pom.xml</code> file:
    +
    +{% highlight xml %}
    +<dependencyManagement>
    +  ...
    +  <dependency>
    +    <groupId>org.apache.spark</groupId>
    +    <artifactId>spark-hadoop-cloud_2.11</artifactId>
    +    <version>${spark.version}</version>
    +  </dependency>
    +  ...
    +</dependencyManagement>
    +{% endhighlight %}
    +
    +If using the Scala 2.10-compatible version of Spark, the artifact is of course `spark-hadoop-cloud_2.10`.
    +
    +### Basic Use
    +
    +You can refer to data in an object store just as you would data in a filesystem, by
    +using a URL to the data in methods like `SparkContext.textFile()` to read data, 
    +`saveAsTextFile()` to write it back.
    +
    +
    +Because object stores are viewed by Spark as filesystems, object stores can
    +be used as the source or destination of any spark work \u2014be it batch, SQL, DataFrame,
    +Streaming or something else.
    +
    +The steps to do so are as follows
    +
    +1. Use the full URI to refer to a bucket, including the prefix for the client-side library
    +to use. Example: `s3a://landsat-pds/scene_list.gz`
    +1. Have the Spark context configured with the authentication details of the object store.
    +In a YARN cluster, this may also be done in the `core-site.xml` file.
    +
    +
    +## <a name="output"></a>Object Stores as a substitute for HDFS
    +
    +As the examples show, you can write data to object stores. However, that does not mean
    +That they can be used as replacements for a cluster-wide filesystem.
    +
    +The full details are covered in [Cloud Object Stores are Not Real Filesystems](#cloud_stores_are_not_filesystems).
    +
    +The brief summary is:
    +
    +| Object Store Connector      |  Replace HDFS? |
    +|-----------------------------|--------------------|
    +| `s3a://` `s3n://`  from the ASF   | No  |
    +| Amazon EMR `s3://`          | Yes |
    +| Microsoft Azure `wasb://`   | Yes |
    +| OpenStack `swift://`        | No  |
    +
    +It is possible to use any of the object stores as a destination of work, i.e. use
    +`saveAsTextFile()` or `save()` to save data there, but the commit process may be slow
    +and, unreliable in the presence of failures.
    +
    +It is faster and safer to use the cluster filesystem as the destination of Spark jobs,
    +using that data as the data for follow-on work. The final results can
    +be persisted in to the object store using `distcp`.
    +
    +#### <a name="checkpointing"></a>Spark Streaming and object stores
    +
    +Spark Streaming can monitor files added to object stores, by
    +creating a `FileInputDStream` DStream monitoring a path under a bucket through
    +`StreamingContext.textFileStream()`.
    +
    +
    +1. The time to scan for new files is proportional to the number of files
    +under the path \u2014not the number of *new* files, and that it can become a slow operation.
    +The size of the window needs to be set to handle this.
    +
    +1. Files only appear in an object store once they are completely written; there
    +is no need for a worklow of write-then-rename to ensure that files aren't picked up
    +while they are still being written. Applications can write straight to the monitored directory.
    +
    +1. Streams should only be checkpointed to an object store considered compatible with
    +HDFS. Otherwise the checkpointing will be slow and potentially unreliable.
    +
    +### Recommended settings for writing to object stores
    +
    +Here are the settings to use when writing to object stores. This uses the "version 2" algorithm
    +for committing files \u2014which does less renaming than the v1 algorithm. Speculative execution is
    +disabled to avoid multiple writers corrupting the output.
    +
    +```
    +spark.speculation false
    +spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2
    +spark.hadoop.mapreduce.fileoutputcommitter.cleanup-failures.ignored true
    +```
    +
    +There's also the option of skipping the cleanup of temporary files in the output directory.
    +Enabling this option eliminates a small delay caused by listing and deleting any such files.
    +
    +```
    +spark.hadoop.mapreduce.fileoutputcommitter.cleanup.skipped true
    +```
    +
    +Bear in mind that storing temporary files can run up charges; Delete
    +directories called `"_temporary"` on a regular basis to avoid this.
    +
    +
    +### YARN Scheduler settings
    +
    +When running Spark in a YARN cluster running in EC2, turning off locality avoids any delays
    +waiting for the scheduler to find a node close to the data.
    +
    +```xml
    +  <property>
    +    <name>yarn.scheduler.capacity.node-locality-delay</name>
    +    <value>0</value>
    +  </property>
    +```
    +
    +This has to be set in the YARN cluster configuration, not in the Spark configuration.
    +
    +### Parquet I/O Settings
    +
    +For optimal performance when reading files saved in the Apache Parquet format,
    +read and write operations must be minimized, including generation of summary metadata,
    +and coalescing metadata from multiple files. The `filterPushdown` option
    +enables the Parquet library to optimize data reads itself, potentially saving bandwidth.
    +
    +```
    +spark.hadoop.parquet.enable.summary-metadata false
    +spark.sql.parquet.mergeSchema false
    +spark.sql.parquet.filterPushdown true
    +spark.sql.hive.metastorePartitionPruning true
    +```
    +
    +### ORC I/O Settings
    +
    +For optimal performance when reading files saved in the Apache ORC format,
    +read and write operations must be minimized. Here are the options to achieve this.
    +
    +
    +```
    +spark.sql.orc.filterPushdown true
    +spark.sql.orc.splits.include.file.footer true
    +spark.sql.orc.cache.stripe.details.size 10000
    +spark.sql.hive.metastorePartitionPruning true
    +```
    +
    +The `filterPushdown` option enables the ORC library to optimize data reads itself,
    +potentially saving bandwidth.
    +
    +The `spark.sql.orc.splits.include.file.footer` option means that the ORC file footer information,
    +is passed around with the file information \u2014so eliminating the need to reread this data.
    +
    +## <a name="authenticating"></a>Authenticating with Object Stores
    +
    +Apart from the special case of public read-only data, all object stores
    +require callers to authenticate themselves.
    +To do this, the Spark context must be configured with the authentication
    +details of the object store.
    +
    +1. In a YARN cluster, this may be done automatically in the `core-site.xml` file.
    +1. When Spark is running in cloud infrastructure (for example, on Amazon EC2, Google Cloud or
    +Microsoft Azure), the authentication details may be automatically derived from information
    +available to the VM.
    +1. `spark-submit` automatically picks up the contents of `AWS_ACCESS_KEY` and `AWS_SECRET_KEY`
    +environment variables and sets the associated configuration parameters for`s3n` and `s3a`
    +to these values. This essentially propagates the values across the Spark cluster.
    +1. Authentication details may be manually added to the Spark configuration
    +1. Alternatively, they can be programmatically added. *Important: never put authentication
    +secrets in source code. They will be compromised*.
    +
    +It is critical that the credentials used to access object stores are kept secret. Not only can
    +they be abused to run up compute charges, they can be used to read and alter private data.
    +
    +1. If adding login details to a spark configuration file, do not share this file, including
    +attaching to bug reports or committing it to SCM repositories.
    +1. Have different accounts for access to the storage for each application,
    +each with access rights restricted to those object storage buckets/containers used by the
    +application.
    +1. If the object store supports any form of session credential (e.g. Amazon's STS), issue
    +session credentials for the expected lifetime of the application.
    +1. When using a version of Spark with with Hadoop 2.8+ libraries, consider using Hadoop
    +credential files to store secrets, referencing
    +these files in the relevant ID/secret properties of the XML configuration file.
    +
    +
    +## <a name="object_stores"></a>Object stores and Their Library Dependencies
    +
    +The different object stores supported by Spark depend on specific Hadoop versions,
    +and require specific Hadoop JARs and dependent Java libraries on the classpath.
    +
    +<table class="table">
    +  <tr><th>Schema</th><th>Store</th><th>Details</th></tr>
    +  <tr>
    +    <td><code>s3a://</code></td>
    +    <td>Amazon S3</a>
    +    <td>
    +    Recommended S3 client for Spark releases built on Apache Hadoop 2.7 or later.
    +    </td>
    +  </tr>
    +  <tr>
    +    <td><code>s3n://</code></td>
    +    <td>Amazon S3</a>
    +    <td>
    +    Deprected S3 client; only use for Spark releases built on Apache Hadoop 2.6 or earlier.
    +    </td>
    +  </tr>
    +  <tr>
    +    <td><code>s3://</code></td>
    +    <td>Amazon S3 on Amazon EMR</a>
    +    <td>
    +    Amazon's own S3 client; use only and exclusivley in Amazon EMR.
    +    </td>
    +  </tr>
    +  <tr>
    +    <td><code>wasb://</code></td>
    +    <td>Azure Storage</a>
    +    <td>
    +    Client for Microsoft Azure Storage; since Hadoop 2.7.
    +    </td>
    +  </tr>
    +  <tr>
    +    <td><code>swift://</code></td>
    +    <td>OpenStack Swift</a>
    +    <td>
    +    Client for OpenStack Swift object stores.
    +    </td>
    +  </tr>
    +  <tr>
    +    <td><code>gs://</code></td>
    +    <td>Google Cloud Storage</a>
    +    <td>
    +    Google's client for their cloud object store.
    +    </td>
    +  </tr>
    +</table>
    +
    +
    +### <a name="working_with_amazon_s3"></a>Working with Amazon S3
    +
    +Amazon's S3 object store is probably the most widely used object store \u2014it is also the one
    +with the most client libraries. This is due to the evolution of Hadoop's support, and Amazon
    +offering Hadoop and Spark as its EMR service, along with its own S3 client.
    +
    +The recommendations for which client to use depend upon the version of Hadoop on the Spark classpath.
    +
    +<table class="table">
    +  <tr><th>Hadoop Library Version</th><th>Client</th></tr>
    +  <tr>
    +    <td>Hadoop 2.7+ and commercial products based on it</a>
    +    <td><code>s3a://</code></td>
    +  </tr>
    +  <tr>
    +    <td>Hadoop 2.6 or earlier</a>
    +    <td><code>s3n://</code></td>
    +  </tr>
    +  <tr>
    +    <td>Amazon EMR</a>
    +    <td><code>s3://</code></td>
    +  </tr>
    +</table>
    +
    +Authentication is generally via properties set in the spark context or, in YARN clusters,
    +`core-site.xml`.
    +Versions of the S3A client also support short-lived session credentials and IAM authentication to
    +automatically pick up credentials on EC2 deployments. Consult the appropriate Hadoop documentation for specifics.
    +
    +`spark-submit` will automatically pick up and propagate `AWS_ACCESS_KEY` and `AWS_SECRET_KEY`
    +from the environment variables set in the environment of the user running `spark-submit`; these
    +will override any set in the configuration files.
    +
    +Be aware that while S3 buckets support complex access control declarations, Spark needs
    +full read/write access to any bucket to which it must write data. That is: it does not support writing
    +to buckets where the root paths are read only, or not readable at all.
    +
    +#### <a name="s3a"></a>S3A Filesystem Client: `s3a://`
    +
    +The ["S3A" filesystem client](https://hadoop.apache.org/docs/stable2/hadoop-aws/tools/hadoop-aws/index.html)
    +is the sole S3 connector undergoing active maintenance at the Apache, and should be used wherever
    +possible.
    +
    +
    +**Tuning for performance:**
    +
    +For recent Hadoop versions, *when working with binary formats* (Parquet, ORC) use
    +
    +```
    +spark.hadoop.fs.s3a.experimental.input.fadvise random
    +```
    +
    +This reads from the object in blocks, which is efficient when seeking backwards as well as
    +forwards in a file \u2014at the expense of making full file reads slower.
    --- End diff --
    
    again, cut that whole section. Left to the Hadoop docs


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7481][build][WIP] Add Hadoop 2.6+ spark...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12004#issuecomment-211546332
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12004: [SPARK-7481] [build] Add spark-cloud module to pu...

Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12004#discussion_r114056158
  
    --- Diff: docs/cloud-integration.md ---
    @@ -0,0 +1,512 @@
    +---
    +layout: global
    +displayTitle: Integration with Cloud Infrastructures
    +title: Integration with Cloud Infrastructures
    +description: Introduction to cloud storage support in Apache Spark SPARK_VERSION_SHORT
    +---
    +<!---
    +  Licensed under the Apache License, Version 2.0 (the "License");
    +  you may not use this file except in compliance with the License.
    +  You may obtain a copy of the License at
    +
    +   http://www.apache.org/licenses/LICENSE-2.0
    +
    +  Unless required by applicable law or agreed to in writing, software
    +  distributed under the License is distributed on an "AS IS" BASIS,
    +  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +  See the License for the specific language governing permissions and
    +  limitations under the License. See accompanying LICENSE file.
    +-->
    +
    +* This will become a table of contents (this text will be scraped).
    +{:toc}
    +
    +## <a name="introduction"></a>Introduction
    +
    +
    +All the public cloud infrastructures, Amazon AWS, Microsoft Azure, Google GCS and others offer
    +persistent data storage systems, "object stores". These are not quite the same as classic file
    +systems: in order to scale to hundreds of Petabytes, without any single points of failure
    +or size limits, object stores, "blobstores", have a simpler model of `name => data`.
    +
    +Apache Spark can read or write data in object stores for data access.
    +through filesystem connectors implemented in Apache Hadoop or provided by third-parties.
    +These libraries make the object stores look *almost* like filesystems, with directories and
    +operations on files (rename) and directories (create, rename, delete) which mimic
    +those of a classic filesystem. Because of this, Spark and Spark-based applications
    +can work with object stores, generally treating them as as if they were slower-but-larger filesystems.
    +
    +With these connectors, Apache Spark supports object stores as the source
    +of data for analysis, including Spark Streaming and DataFrames.
    +
    +
    +## <a name="quick_start"></a>Quick Start
    +
    +Provided the relevant libraries are on the classpath, and Spark is configured with your credentials,
    +objects in an object store can be can be read or written through URLs which uses the name of the
    +object store client as the schema and the bucket/container as the hostname.
    +
    +
    +### Dependencies
    +
    +The Spark application neeeds the relevant Hadoop libraries, which can
    +be done by including the `spark-hadoop-cloud` module for the specific version of spark used.
    +
    +The Spark application should include <code>hadoop-openstack</code> dependency, which can
    +be done by including the `spark-hadoop-cloud` module for the specific version of spark used.
    +For example, for Maven support, add the following to the <code>pom.xml</code> file:
    +
    +{% highlight xml %}
    +<dependencyManagement>
    +  ...
    +  <dependency>
    +    <groupId>org.apache.spark</groupId>
    +    <artifactId>spark-hadoop-cloud_2.11</artifactId>
    +    <version>${spark.version}</version>
    +  </dependency>
    +  ...
    +</dependencyManagement>
    +{% endhighlight %}
    +
    +If using the Scala 2.10-compatible version of Spark, the artifact is of course `spark-hadoop-cloud_2.10`.
    +
    +### Basic Use
    +
    +You can refer to data in an object store just as you would data in a filesystem, by
    +using a URL to the data in methods like `SparkContext.textFile()` to read data, 
    +`saveAsTextFile()` to write it back.
    +
    +
    +Because object stores are viewed by Spark as filesystems, object stores can
    +be used as the source or destination of any spark work \u2014be it batch, SQL, DataFrame,
    +Streaming or something else.
    +
    +The steps to do so are as follows
    +
    +1. Use the full URI to refer to a bucket, including the prefix for the client-side library
    +to use. Example: `s3a://landsat-pds/scene_list.gz`
    +1. Have the Spark context configured with the authentication details of the object store.
    +In a YARN cluster, this may also be done in the `core-site.xml` file.
    +
    +
    +## <a name="output"></a>Object Stores as a substitute for HDFS
    +
    +As the examples show, you can write data to object stores. However, that does not mean
    +That they can be used as replacements for a cluster-wide filesystem.
    +
    +The full details are covered in [Cloud Object Stores are Not Real Filesystems](#cloud_stores_are_not_filesystems).
    +
    +The brief summary is:
    +
    +| Object Store Connector      |  Replace HDFS? |
    +|-----------------------------|--------------------|
    +| `s3a://` `s3n://`  from the ASF   | No  |
    +| Amazon EMR `s3://`          | Yes |
    +| Microsoft Azure `wasb://`   | Yes |
    +| OpenStack `swift://`        | No  |
    +
    +It is possible to use any of the object stores as a destination of work, i.e. use
    +`saveAsTextFile()` or `save()` to save data there, but the commit process may be slow
    +and, unreliable in the presence of failures.
    +
    +It is faster and safer to use the cluster filesystem as the destination of Spark jobs,
    +using that data as the data for follow-on work. The final results can
    +be persisted in to the object store using `distcp`.
    +
    +#### <a name="checkpointing"></a>Spark Streaming and object stores
    +
    +Spark Streaming can monitor files added to object stores, by
    +creating a `FileInputDStream` DStream monitoring a path under a bucket through
    +`StreamingContext.textFileStream()`.
    +
    +
    +1. The time to scan for new files is proportional to the number of files
    +under the path \u2014not the number of *new* files, and that it can become a slow operation.
    +The size of the window needs to be set to handle this.
    +
    +1. Files only appear in an object store once they are completely written; there
    +is no need for a worklow of write-then-rename to ensure that files aren't picked up
    +while they are still being written. Applications can write straight to the monitored directory.
    +
    +1. Streams should only be checkpointed to an object store considered compatible with
    +HDFS. Otherwise the checkpointing will be slow and potentially unreliable.
    +
    +### Recommended settings for writing to object stores
    +
    +Here are the settings to use when writing to object stores. This uses the "version 2" algorithm
    +for committing files \u2014which does less renaming than the v1 algorithm. Speculative execution is
    +disabled to avoid multiple writers corrupting the output.
    +
    +```
    +spark.speculation false
    +spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2
    +spark.hadoop.mapreduce.fileoutputcommitter.cleanup-failures.ignored true
    +```
    +
    +There's also the option of skipping the cleanup of temporary files in the output directory.
    +Enabling this option eliminates a small delay caused by listing and deleting any such files.
    +
    +```
    +spark.hadoop.mapreduce.fileoutputcommitter.cleanup.skipped true
    +```
    +
    +Bear in mind that storing temporary files can run up charges; Delete
    +directories called `"_temporary"` on a regular basis to avoid this.
    +
    +
    +### YARN Scheduler settings
    +
    +When running Spark in a YARN cluster running in EC2, turning off locality avoids any delays
    --- End diff --
    
    Followup:
    * Hive now [treats localhost as "anywhere"](https://issues.apache.org/jira/browse/HIVE-14060)
    * As [does Tez](https://issues.apache.org/jira/browse/TEZ-3291)
    
    That's a recent change in both projects; someone would need to test the Spark placement code to see what decisions it made here. That YARN scheduling is essentially a workaround.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481][build] [WIP] Add Hadoop 2.6+ spark-cloud mo...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    **[Test build #63984 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63984/consoleFull)** for PR 12004 at commit [`0d9f122`](https://github.com/apache/spark/commit/0d9f12250dd1b9f78acdac714b0bbeeda294cef5).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481][build] [WIP] Add Hadoop 2.7+ spark-cloud mo...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7481][build][WIP] Add Hadoop 2.6+ spark...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12004#issuecomment-214794580
  
    **[Test build #57003 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57003/consoleFull)** for PR 12004 at commit [`4e4e941`](https://github.com/apache/spark/commit/4e4e9419179a218a2c0e9df9a58ce649512b5e3a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12004: [SPARK-7481] [build] Add spark-cloud module to pu...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12004#discussion_r113694234
  
    --- Diff: cloud/pom.xml ---
    @@ -0,0 +1,158 @@
    +<?xml version="1.0" encoding="UTF-8"?>
    +<!--
    +  ~ Licensed to the Apache Software Foundation (ASF) under one or more
    +  ~ contributor license agreements.  See the NOTICE file distributed with
    +  ~ this work for additional information regarding copyright ownership.
    +  ~ The ASF licenses this file to You under the Apache License, Version 2.0
    +  ~ (the "License"); you may not use this file except in compliance with
    +  ~ the License.  You may obtain a copy of the License at
    +  ~
    +  ~    http://www.apache.org/licenses/LICENSE-2.0
    +  ~
    +  ~ Unless required by applicable law or agreed to in writing, software
    +  ~ distributed under the License is distributed on an "AS IS" BASIS,
    +  ~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +  ~ See the License for the specific language governing permissions and
    +  ~ limitations under the License.
    +  -->
    +<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    +  <modelVersion>4.0.0</modelVersion>
    +  <parent>
    +    <groupId>org.apache.spark</groupId>
    +    <artifactId>spark-parent_2.11</artifactId>
    +    <version>2.2.0-SNAPSHOT</version>
    +    <relativePath>../pom.xml</relativePath>
    +  </parent>
    +
    +  <artifactId>spark-hadoop-cloud_2.11</artifactId>
    +  <packaging>jar</packaging>
    +  <name>Spark Project Cloud Integration</name>
    +  <description>
    +    Contains support for cloud infrastructures, specifically the Hadoop JARs and
    +    transitive dependencies needed to interact with the infrastructures.
    +
    +    Any project which explicitly depends upon the spark-hadoop-cloud artifact will get the
    +    dependencies; the exact versions of which will depend upon the hadoop version Spark was compiled
    +    against.
    +
    +    The imports of transitive dependencies are managed to make them consistent
    +    with those of the Spark build.
    +
    +    WARNING: the signatures of methods in the AWS and Azure SDKs do change between
    --- End diff --
    
    I would only include the first sentence here. The description here should be short since nobody will likely read it. Anything substantive could go in docs.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481] [build] Add spark-cloud module to pull in o...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481][build] [WIP] Add Hadoop 2.6+ spark-cloud mo...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481] [build] Add spark-cloud module to pull in o...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    @steveloughran can you clarify what this does? It seems like just 5000 lines of examples and test cases? Users can already use these cloud stores by just adding the proper dependencies, can't they?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7481][build][WIP] Add Hadoop 2.6+ spark...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12004#issuecomment-221668697
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/59287/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7481][build][WIP] Add Hadoop 2.6+ spark...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12004#issuecomment-204941193
  
    **[Test build #54803 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54803/consoleFull)** for PR 12004 at commit [`6487d93`](https://github.com/apache/spark/commit/6487d93ea76420b67510360af0093cabcd9860d3).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7481][build][WIP] Add Hadoop 2.6+ spark...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12004#issuecomment-214830407
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7481][build][WIP] Add Hadoop 2.6+ spark...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12004#issuecomment-212598139
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481] [build] Add spark-cloud module to pull in o...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    I still don't think this answered my last questions? yes, I understand all this back story. That's why this is taking such a large amount of everyone's time. The purpose and discussion and commits keep shifting significantly so I have to re-read this from the start every time to see what it's done this time. I still only partly perceive the problem and why it takes this much to solve it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481] [build] Add spark-cloud module to pull in o...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    I have the impression that you can't really use Spark with S3 and only S3, not as an intermediate store, because it's too eventually-consistent. Does the presence of additional integration libraries alone change that, or am I mistaken? that is, I'm wondering whether this really does what it appears to say on the tin, which is to make Spark usable with just S3.
    
    My other question was indeed whether we need a different module if we're just about to only support 2.7, or 2.6/2.7. That's more of a detail of implementation. Does a build of Spark + Hadoop 2.7 right now have no ability at all to read from S3 out of the box, or just not full / ideal support?
    
    Finally, is the Spark build the best thing to provide these dependencies? well, it provides the core Hadoop FS support already, so yes. But on the other hand, Hadoop is farming this out as optional itself, it can be added by a user app (right?), it would already be present if running in the presence of a cluster.
    
    It looks like there's a little more to it than just adding one new dependency, but not much more. I'm working out just how much the module is needed for users to get it right, vs complexity. 
    
    The docs are valuable.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481] [build] Add spark-cloud module to pull in o...

Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    that's it warning that the manifest has changed. Which it has: there's now hadoop-azure, hadoop-openstack and hadoop-aws JARs on the CP, along with dependencies (amazon-aws SDK, microsoft-azure), all of which go into `SPARK_HOME/jars` if the cloud profile is enabled on the build.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7481][build][WIP] Add Hadoop 2.6+ spark...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12004#issuecomment-214785349
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/56998/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481][build] [WIP] Add Hadoop 2.6+ spark-cloud mo...

Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    downgrading to a WIP as to work reliably it needs [HADOOP-12636](https://issues.apache.org/jira/browse/HADOOP-12636) on the hadoop code else the presence of `hadoop-aws.jar` on the CP without the amazon SDK JARs breaks FileSystem startup. That'll be in the next version of Hadoop 2.7 as well as 2.8+. Once the 2.7.3 release is out, this module can be set up to be included for 2.7+ build profiles without problems. S3A isn't suitable for production use in Hadoop 2.6, so leaving it out of that shouldn't be a problem.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481][build] [WIP] Add Hadoop 2.6+ spark-cloud mo...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    **[Test build #63787 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63787/consoleFull)** for PR 12004 at commit [`2feade0`](https://github.com/apache/spark/commit/2feade078603c2bbfd5893cc1a0deb8f188cff02).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7481][build][WIP] Add Hadoop 2.6+ spark...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12004#issuecomment-214785346
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12004: [SPARK-7481] [build] Add spark-cloud module to pu...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12004#discussion_r113700008
  
    --- Diff: pom.xml ---
    @@ -621,6 +621,11 @@
             <version>${fasterxml.jackson.version}</version>
           </dependency>
           <dependency>
    +        <groupId>com.fasterxml.jackson.dataformat</groupId>
    +        <artifactId>jackson-dataformat-cbor</artifactId>
    --- End diff --
    
    what drives this -- is it just another Jackson component whose version has to be harmonized?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12004: [SPARK-7481] [build] Add spark-cloud module to pu...

Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12004#discussion_r113976699
  
    --- Diff: docs/cloud-integration.md ---
    @@ -0,0 +1,512 @@
    +---
    +layout: global
    +displayTitle: Integration with Cloud Infrastructures
    +title: Integration with Cloud Infrastructures
    +description: Introduction to cloud storage support in Apache Spark SPARK_VERSION_SHORT
    +---
    +<!---
    +  Licensed under the Apache License, Version 2.0 (the "License");
    +  you may not use this file except in compliance with the License.
    +  You may obtain a copy of the License at
    +
    +   http://www.apache.org/licenses/LICENSE-2.0
    +
    +  Unless required by applicable law or agreed to in writing, software
    +  distributed under the License is distributed on an "AS IS" BASIS,
    +  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +  See the License for the specific language governing permissions and
    +  limitations under the License. See accompanying LICENSE file.
    +-->
    +
    +* This will become a table of contents (this text will be scraped).
    +{:toc}
    +
    +## <a name="introduction"></a>Introduction
    +
    +
    +All the public cloud infrastructures, Amazon AWS, Microsoft Azure, Google GCS and others offer
    +persistent data storage systems, "object stores". These are not quite the same as classic file
    +systems: in order to scale to hundreds of Petabytes, without any single points of failure
    +or size limits, object stores, "blobstores", have a simpler model of `name => data`.
    +
    +Apache Spark can read or write data in object stores for data access.
    +through filesystem connectors implemented in Apache Hadoop or provided by third-parties.
    +These libraries make the object stores look *almost* like filesystems, with directories and
    +operations on files (rename) and directories (create, rename, delete) which mimic
    +those of a classic filesystem. Because of this, Spark and Spark-based applications
    +can work with object stores, generally treating them as as if they were slower-but-larger filesystems.
    +
    +With these connectors, Apache Spark supports object stores as the source
    +of data for analysis, including Spark Streaming and DataFrames.
    +
    +
    +## <a name="quick_start"></a>Quick Start
    +
    +Provided the relevant libraries are on the classpath, and Spark is configured with your credentials,
    +objects in an object store can be can be read or written through URLs which uses the name of the
    +object store client as the schema and the bucket/container as the hostname.
    +
    +
    +### Dependencies
    +
    +The Spark application neeeds the relevant Hadoop libraries, which can
    +be done by including the `spark-hadoop-cloud` module for the specific version of spark used.
    +
    +The Spark application should include <code>hadoop-openstack</code> dependency, which can
    +be done by including the `spark-hadoop-cloud` module for the specific version of spark used.
    +For example, for Maven support, add the following to the <code>pom.xml</code> file:
    +
    +{% highlight xml %}
    +<dependencyManagement>
    +  ...
    +  <dependency>
    +    <groupId>org.apache.spark</groupId>
    +    <artifactId>spark-hadoop-cloud_2.11</artifactId>
    +    <version>${spark.version}</version>
    +  </dependency>
    +  ...
    +</dependencyManagement>
    +{% endhighlight %}
    +
    +If using the Scala 2.10-compatible version of Spark, the artifact is of course `spark-hadoop-cloud_2.10`.
    +
    +### Basic Use
    +
    +You can refer to data in an object store just as you would data in a filesystem, by
    +using a URL to the data in methods like `SparkContext.textFile()` to read data, 
    +`saveAsTextFile()` to write it back.
    +
    +
    +Because object stores are viewed by Spark as filesystems, object stores can
    +be used as the source or destination of any spark work \u2014be it batch, SQL, DataFrame,
    +Streaming or something else.
    +
    +The steps to do so are as follows
    +
    +1. Use the full URI to refer to a bucket, including the prefix for the client-side library
    +to use. Example: `s3a://landsat-pds/scene_list.gz`
    +1. Have the Spark context configured with the authentication details of the object store.
    +In a YARN cluster, this may also be done in the `core-site.xml` file.
    +
    +
    +## <a name="output"></a>Object Stores as a substitute for HDFS
    +
    +As the examples show, you can write data to object stores. However, that does not mean
    +That they can be used as replacements for a cluster-wide filesystem.
    +
    +The full details are covered in [Cloud Object Stores are Not Real Filesystems](#cloud_stores_are_not_filesystems).
    +
    +The brief summary is:
    +
    +| Object Store Connector      |  Replace HDFS? |
    +|-----------------------------|--------------------|
    +| `s3a://` `s3n://`  from the ASF   | No  |
    +| Amazon EMR `s3://`          | Yes |
    +| Microsoft Azure `wasb://`   | Yes |
    +| OpenStack `swift://`        | No  |
    +
    +It is possible to use any of the object stores as a destination of work, i.e. use
    +`saveAsTextFile()` or `save()` to save data there, but the commit process may be slow
    +and, unreliable in the presence of failures.
    +
    +It is faster and safer to use the cluster filesystem as the destination of Spark jobs,
    +using that data as the data for follow-on work. The final results can
    +be persisted in to the object store using `distcp`.
    +
    +#### <a name="checkpointing"></a>Spark Streaming and object stores
    +
    +Spark Streaming can monitor files added to object stores, by
    +creating a `FileInputDStream` DStream monitoring a path under a bucket through
    +`StreamingContext.textFileStream()`.
    +
    +
    +1. The time to scan for new files is proportional to the number of files
    +under the path \u2014not the number of *new* files, and that it can become a slow operation.
    --- End diff --
    
    done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7481][build][WIP] Add Hadoop 2.6+ spark...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12004#issuecomment-220265664
  
    **[Test build #58856 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/58856/consoleFull)** for PR 12004 at commit [`4e37c7a`](https://github.com/apache/spark/commit/4e37c7a0e5509fee74f7af519dd81aa97de60762).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481][build] [WIP] Add Hadoop 2.7+ spark-cloud mo...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65396/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481][build] [WIP] Add Hadoop 2.7+ spark-cloud mo...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    **[Test build #64487 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64487/consoleFull)** for PR 12004 at commit [`9f1bf6b`](https://github.com/apache/spark/commit/9f1bf6b55daff23b7dc356b30582a9b0594834bf).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12004: [SPARK-7481] [build] Add spark-cloud module to pu...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12004#discussion_r113696235
  
    --- Diff: docs/cloud-integration.md ---
    @@ -0,0 +1,512 @@
    +---
    +layout: global
    +displayTitle: Integration with Cloud Infrastructures
    +title: Integration with Cloud Infrastructures
    +description: Introduction to cloud storage support in Apache Spark SPARK_VERSION_SHORT
    +---
    +<!---
    +  Licensed under the Apache License, Version 2.0 (the "License");
    +  you may not use this file except in compliance with the License.
    +  You may obtain a copy of the License at
    +
    +   http://www.apache.org/licenses/LICENSE-2.0
    +
    +  Unless required by applicable law or agreed to in writing, software
    +  distributed under the License is distributed on an "AS IS" BASIS,
    +  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +  See the License for the specific language governing permissions and
    +  limitations under the License. See accompanying LICENSE file.
    +-->
    +
    +* This will become a table of contents (this text will be scraped).
    +{:toc}
    +
    +## <a name="introduction"></a>Introduction
    +
    +
    +All the public cloud infrastructures, Amazon AWS, Microsoft Azure, Google GCS and others offer
    +persistent data storage systems, "object stores". These are not quite the same as classic file
    +systems: in order to scale to hundreds of Petabytes, without any single points of failure
    +or size limits, object stores, "blobstores", have a simpler model of `name => data`.
    +
    +Apache Spark can read or write data in object stores for data access.
    +through filesystem connectors implemented in Apache Hadoop or provided by third-parties.
    +These libraries make the object stores look *almost* like filesystems, with directories and
    +operations on files (rename) and directories (create, rename, delete) which mimic
    +those of a classic filesystem. Because of this, Spark and Spark-based applications
    +can work with object stores, generally treating them as as if they were slower-but-larger filesystems.
    +
    +With these connectors, Apache Spark supports object stores as the source
    +of data for analysis, including Spark Streaming and DataFrames.
    +
    +
    +## <a name="quick_start"></a>Quick Start
    +
    +Provided the relevant libraries are on the classpath, and Spark is configured with your credentials,
    +objects in an object store can be can be read or written through URLs which uses the name of the
    +object store client as the schema and the bucket/container as the hostname.
    +
    +
    +### Dependencies
    +
    +The Spark application neeeds the relevant Hadoop libraries, which can
    +be done by including the `spark-hadoop-cloud` module for the specific version of spark used.
    +
    +The Spark application should include <code>hadoop-openstack</code> dependency, which can
    +be done by including the `spark-hadoop-cloud` module for the specific version of spark used.
    +For example, for Maven support, add the following to the <code>pom.xml</code> file:
    +
    +{% highlight xml %}
    +<dependencyManagement>
    +  ...
    +  <dependency>
    +    <groupId>org.apache.spark</groupId>
    +    <artifactId>spark-hadoop-cloud_2.11</artifactId>
    +    <version>${spark.version}</version>
    +  </dependency>
    +  ...
    +</dependencyManagement>
    +{% endhighlight %}
    +
    +If using the Scala 2.10-compatible version of Spark, the artifact is of course `spark-hadoop-cloud_2.10`.
    +
    +### Basic Use
    +
    +You can refer to data in an object store just as you would data in a filesystem, by
    +using a URL to the data in methods like `SparkContext.textFile()` to read data, 
    +`saveAsTextFile()` to write it back.
    +
    +
    +Because object stores are viewed by Spark as filesystems, object stores can
    +be used as the source or destination of any spark work \u2014be it batch, SQL, DataFrame,
    +Streaming or something else.
    +
    +The steps to do so are as follows
    +
    +1. Use the full URI to refer to a bucket, including the prefix for the client-side library
    +to use. Example: `s3a://landsat-pds/scene_list.gz`
    +1. Have the Spark context configured with the authentication details of the object store.
    +In a YARN cluster, this may also be done in the `core-site.xml` file.
    +
    +
    +## <a name="output"></a>Object Stores as a substitute for HDFS
    +
    +As the examples show, you can write data to object stores. However, that does not mean
    +That they can be used as replacements for a cluster-wide filesystem.
    +
    +The full details are covered in [Cloud Object Stores are Not Real Filesystems](#cloud_stores_are_not_filesystems).
    +
    +The brief summary is:
    +
    +| Object Store Connector      |  Replace HDFS? |
    +|-----------------------------|--------------------|
    +| `s3a://` `s3n://`  from the ASF   | No  |
    +| Amazon EMR `s3://`          | Yes |
    +| Microsoft Azure `wasb://`   | Yes |
    +| OpenStack `swift://`        | No  |
    +
    +It is possible to use any of the object stores as a destination of work, i.e. use
    +`saveAsTextFile()` or `save()` to save data there, but the commit process may be slow
    +and, unreliable in the presence of failures.
    +
    +It is faster and safer to use the cluster filesystem as the destination of Spark jobs,
    --- End diff --
    
    I sort of know what this is about, but is it really a problem to use say S3 as the result of a job? it seems like that's a case that's relatively fine. It's using it for intermediate storage where the eventual consistency could be a problem.
    
    I guess, generally, the object stores are more prone to errors. I'm just wondering how actionable this is -- can we really say, here's how to do X but X doesn't really work, so work around it?
    
    Is it really any more reliable to distcp -- why would that be more reliable w.r.t. S3?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481][build] [WIP] Add Hadoop 2.6+ spark-cloud mo...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481] [build] Add spark-cloud module to pull in o...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71148/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7481][build][WIP] Add Hadoop 2.6+ spark...

Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on the pull request:

    https://github.com/apache/spark/pull/12004#issuecomment-214747671
  
    The latest version of this does, among other things, call FileSystem.toString after operations. In HADOOP-13028, along with seek optimisation, S3aFileSystem.toString() now dumps all the statistics to date. This means that the aggregate state of all test runs are displayed; if you run a specific test standalone you can see the stats purely for that test
    
    Here's a test with the maven args `-Phadoop-2.7 -DwildcardSuites=org.apache.spark.cloud.s3.S3aIOSuite -Dcloud.test.configuration.file=../cloud.xml -Dhadoop.version=2.9.0-SNAPSHOT -Dtest.method.keys=CSVgz` 
    ```
    2016-04-26 14:32:17,104 INFO  scheduler.TaskSetManager (Logging.scala:logInfo(54)) - Starting task 0.0 in stage 0.0 (TID 0, localhost, partition 0,PROCESS_LOCAL, 5261 bytes)
    2016-04-26 14:32:17,105 INFO  executor.Executor (Logging.scala:logInfo(54)) - Running task 0.0 in stage 0.0 (TID 0)
    2016-04-26 14:32:17,111 INFO  rdd.HadoopRDD (Logging.scala:logInfo(54)) - Input split: s3a://landsat-pds/scene_list.gz:0+20430493
    2016-04-26 14:32:17,285 INFO  compress.CodecPool (CodecPool.java:getDecompressor(181)) - Got brand-new decompressor [.gz]
    2016-04-26 14:32:21,724 INFO  executor.Executor (Logging.scala:logInfo(54)) - Finished task 0.0 in stage 0.0 (TID 0). 2643 bytes result sent to driver
    2016-04-26 14:32:21,727 INFO  scheduler.TaskSetManager (Logging.scala:logInfo(54)) - Finished task 0.0 in stage 0.0 (TID 0) in 4625 ms on localhost (1/1)
    2016-04-26 14:32:21,727 INFO  scheduler.TaskSchedulerImpl (Logging.scala:logInfo(54)) - Removed TaskSet 0.0, whose tasks have all completed, from pool 
    2016-04-26 14:32:21,728 INFO  scheduler.DAGScheduler (Logging.scala:logInfo(54)) - ResultStage 0 (count at S3aIOSuite.scala:127) finished in 4.626 s
    2016-04-26 14:32:21,728 INFO  scheduler.DAGScheduler (Logging.scala:logInfo(54)) - Job 0 finished: count at S3aIOSuite.scala:127, took 4.636417 s
    2016-04-26 14:32:21,729 INFO  s3.S3aIOSuite (Logging.scala:logInfo(54)) -  size of s3a://landsat-pds/scene_list.gz = 464105 rows read in 4815885000 nS
    2016-04-26 14:32:21,729 INFO  s3.S3aIOSuite (Logging.scala:logInfo(54)) - Filesystem statistics S3AFileSystem{uri=s3a://landsat-pds, workingDir=s3a://landsat-pds/user/stevel, partSize=104857600, enableMultiObjectsDelete=true, multiPartThreshold=2147483647, statistics {40864879 bytes read, 7786 bytes written, 110 read ops, 0 large read ops, 26 write ops}, metrics {{Context=S3AFileSystem} {FileSystemId=bc5db77d-e17d-41bb-88ab-44b26cf3eda4-landsat-pds} {fsURI=s3a://landsat-pds/scene_list.gz} {files_created=0} {files_copied=0} {files_copied_bytes=0} {files_deleted=0} {directories_created=0} {directories_deleted=0} {ignored_errors=0} {streamForwardSeekOperations=0} {streamCloseOperations=2} {streamBytesSkippedOnSeek=0} {streamReadOperations=2821} {streamReadExceptions=0} {streamAborted=0} {streamBackwardSeekOperations=0} {streamClosed=2} {streamOpened=2} {streamSeekOperations=0} {streamBytesRead=40860986} {streamReadOperationsIncomplete=2821} {streamReadFullyOperations=0} }}
    2016-04-26 14:32:21,729 INFO  s3.S3aIOSuite (Logging.scala:logInfo(54)) - 
    ```



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12004: [SPARK-7481] [build] Add spark-cloud module to pu...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12004#discussion_r113698062
  
    --- Diff: docs/cloud-integration.md ---
    @@ -0,0 +1,512 @@
    +---
    +layout: global
    +displayTitle: Integration with Cloud Infrastructures
    +title: Integration with Cloud Infrastructures
    +description: Introduction to cloud storage support in Apache Spark SPARK_VERSION_SHORT
    +---
    +<!---
    +  Licensed under the Apache License, Version 2.0 (the "License");
    +  you may not use this file except in compliance with the License.
    +  You may obtain a copy of the License at
    +
    +   http://www.apache.org/licenses/LICENSE-2.0
    +
    +  Unless required by applicable law or agreed to in writing, software
    +  distributed under the License is distributed on an "AS IS" BASIS,
    +  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +  See the License for the specific language governing permissions and
    +  limitations under the License. See accompanying LICENSE file.
    +-->
    +
    +* This will become a table of contents (this text will be scraped).
    +{:toc}
    +
    +## <a name="introduction"></a>Introduction
    +
    +
    +All the public cloud infrastructures, Amazon AWS, Microsoft Azure, Google GCS and others offer
    +persistent data storage systems, "object stores". These are not quite the same as classic file
    +systems: in order to scale to hundreds of Petabytes, without any single points of failure
    +or size limits, object stores, "blobstores", have a simpler model of `name => data`.
    +
    +Apache Spark can read or write data in object stores for data access.
    +through filesystem connectors implemented in Apache Hadoop or provided by third-parties.
    +These libraries make the object stores look *almost* like filesystems, with directories and
    +operations on files (rename) and directories (create, rename, delete) which mimic
    +those of a classic filesystem. Because of this, Spark and Spark-based applications
    +can work with object stores, generally treating them as as if they were slower-but-larger filesystems.
    +
    +With these connectors, Apache Spark supports object stores as the source
    +of data for analysis, including Spark Streaming and DataFrames.
    +
    +
    +## <a name="quick_start"></a>Quick Start
    +
    +Provided the relevant libraries are on the classpath, and Spark is configured with your credentials,
    +objects in an object store can be can be read or written through URLs which uses the name of the
    +object store client as the schema and the bucket/container as the hostname.
    +
    +
    +### Dependencies
    +
    +The Spark application neeeds the relevant Hadoop libraries, which can
    +be done by including the `spark-hadoop-cloud` module for the specific version of spark used.
    +
    +The Spark application should include <code>hadoop-openstack</code> dependency, which can
    +be done by including the `spark-hadoop-cloud` module for the specific version of spark used.
    +For example, for Maven support, add the following to the <code>pom.xml</code> file:
    +
    +{% highlight xml %}
    +<dependencyManagement>
    +  ...
    +  <dependency>
    +    <groupId>org.apache.spark</groupId>
    +    <artifactId>spark-hadoop-cloud_2.11</artifactId>
    +    <version>${spark.version}</version>
    +  </dependency>
    +  ...
    +</dependencyManagement>
    +{% endhighlight %}
    +
    +If using the Scala 2.10-compatible version of Spark, the artifact is of course `spark-hadoop-cloud_2.10`.
    +
    +### Basic Use
    +
    +You can refer to data in an object store just as you would data in a filesystem, by
    +using a URL to the data in methods like `SparkContext.textFile()` to read data, 
    +`saveAsTextFile()` to write it back.
    +
    +
    +Because object stores are viewed by Spark as filesystems, object stores can
    +be used as the source or destination of any spark work \u2014be it batch, SQL, DataFrame,
    +Streaming or something else.
    +
    +The steps to do so are as follows
    +
    +1. Use the full URI to refer to a bucket, including the prefix for the client-side library
    +to use. Example: `s3a://landsat-pds/scene_list.gz`
    +1. Have the Spark context configured with the authentication details of the object store.
    +In a YARN cluster, this may also be done in the `core-site.xml` file.
    +
    +
    +## <a name="output"></a>Object Stores as a substitute for HDFS
    +
    +As the examples show, you can write data to object stores. However, that does not mean
    +That they can be used as replacements for a cluster-wide filesystem.
    +
    +The full details are covered in [Cloud Object Stores are Not Real Filesystems](#cloud_stores_are_not_filesystems).
    +
    +The brief summary is:
    +
    +| Object Store Connector      |  Replace HDFS? |
    +|-----------------------------|--------------------|
    +| `s3a://` `s3n://`  from the ASF   | No  |
    +| Amazon EMR `s3://`          | Yes |
    +| Microsoft Azure `wasb://`   | Yes |
    +| OpenStack `swift://`        | No  |
    +
    +It is possible to use any of the object stores as a destination of work, i.e. use
    +`saveAsTextFile()` or `save()` to save data there, but the commit process may be slow
    +and, unreliable in the presence of failures.
    +
    +It is faster and safer to use the cluster filesystem as the destination of Spark jobs,
    +using that data as the data for follow-on work. The final results can
    +be persisted in to the object store using `distcp`.
    +
    +#### <a name="checkpointing"></a>Spark Streaming and object stores
    +
    +Spark Streaming can monitor files added to object stores, by
    +creating a `FileInputDStream` DStream monitoring a path under a bucket through
    +`StreamingContext.textFileStream()`.
    +
    +
    +1. The time to scan for new files is proportional to the number of files
    +under the path \u2014not the number of *new* files, and that it can become a slow operation.
    +The size of the window needs to be set to handle this.
    +
    +1. Files only appear in an object store once they are completely written; there
    +is no need for a worklow of write-then-rename to ensure that files aren't picked up
    +while they are still being written. Applications can write straight to the monitored directory.
    +
    +1. Streams should only be checkpointed to an object store considered compatible with
    +HDFS. Otherwise the checkpointing will be slow and potentially unreliable.
    +
    +### Recommended settings for writing to object stores
    +
    +Here are the settings to use when writing to object stores. This uses the "version 2" algorithm
    +for committing files \u2014which does less renaming than the v1 algorithm. Speculative execution is
    +disabled to avoid multiple writers corrupting the output.
    +
    +```
    +spark.speculation false
    +spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2
    +spark.hadoop.mapreduce.fileoutputcommitter.cleanup-failures.ignored true
    +```
    +
    +There's also the option of skipping the cleanup of temporary files in the output directory.
    +Enabling this option eliminates a small delay caused by listing and deleting any such files.
    +
    +```
    +spark.hadoop.mapreduce.fileoutputcommitter.cleanup.skipped true
    +```
    +
    +Bear in mind that storing temporary files can run up charges; Delete
    +directories called `"_temporary"` on a regular basis to avoid this.
    +
    +
    +### YARN Scheduler settings
    +
    +When running Spark in a YARN cluster running in EC2, turning off locality avoids any delays
    +waiting for the scheduler to find a node close to the data.
    +
    +```xml
    +  <property>
    +    <name>yarn.scheduler.capacity.node-locality-delay</name>
    +    <value>0</value>
    +  </property>
    +```
    +
    +This has to be set in the YARN cluster configuration, not in the Spark configuration.
    +
    +### Parquet I/O Settings
    +
    +For optimal performance when reading files saved in the Apache Parquet format,
    +read and write operations must be minimized, including generation of summary metadata,
    +and coalescing metadata from multiple files. The `filterPushdown` option
    +enables the Parquet library to optimize data reads itself, potentially saving bandwidth.
    +
    +```
    +spark.hadoop.parquet.enable.summary-metadata false
    +spark.sql.parquet.mergeSchema false
    +spark.sql.parquet.filterPushdown true
    +spark.sql.hive.metastorePartitionPruning true
    +```
    +
    +### ORC I/O Settings
    +
    +For optimal performance when reading files saved in the Apache ORC format,
    +read and write operations must be minimized. Here are the options to achieve this.
    +
    +
    +```
    +spark.sql.orc.filterPushdown true
    +spark.sql.orc.splits.include.file.footer true
    +spark.sql.orc.cache.stripe.details.size 10000
    +spark.sql.hive.metastorePartitionPruning true
    +```
    +
    +The `filterPushdown` option enables the ORC library to optimize data reads itself,
    +potentially saving bandwidth.
    +
    +The `spark.sql.orc.splits.include.file.footer` option means that the ORC file footer information,
    +is passed around with the file information \u2014so eliminating the need to reread this data.
    +
    +## <a name="authenticating"></a>Authenticating with Object Stores
    +
    +Apart from the special case of public read-only data, all object stores
    +require callers to authenticate themselves.
    +To do this, the Spark context must be configured with the authentication
    +details of the object store.
    +
    +1. In a YARN cluster, this may be done automatically in the `core-site.xml` file.
    +1. When Spark is running in cloud infrastructure (for example, on Amazon EC2, Google Cloud or
    +Microsoft Azure), the authentication details may be automatically derived from information
    +available to the VM.
    +1. `spark-submit` automatically picks up the contents of `AWS_ACCESS_KEY` and `AWS_SECRET_KEY`
    +environment variables and sets the associated configuration parameters for`s3n` and `s3a`
    +to these values. This essentially propagates the values across the Spark cluster.
    +1. Authentication details may be manually added to the Spark configuration
    +1. Alternatively, they can be programmatically added. *Important: never put authentication
    +secrets in source code. They will be compromised*.
    +
    +It is critical that the credentials used to access object stores are kept secret. Not only can
    +they be abused to run up compute charges, they can be used to read and alter private data.
    +
    +1. If adding login details to a spark configuration file, do not share this file, including
    +attaching to bug reports or committing it to SCM repositories.
    +1. Have different accounts for access to the storage for each application,
    +each with access rights restricted to those object storage buckets/containers used by the
    +application.
    +1. If the object store supports any form of session credential (e.g. Amazon's STS), issue
    +session credentials for the expected lifetime of the application.
    +1. When using a version of Spark with with Hadoop 2.8+ libraries, consider using Hadoop
    +credential files to store secrets, referencing
    +these files in the relevant ID/secret properties of the XML configuration file.
    +
    +
    +## <a name="object_stores"></a>Object stores and Their Library Dependencies
    +
    +The different object stores supported by Spark depend on specific Hadoop versions,
    +and require specific Hadoop JARs and dependent Java libraries on the classpath.
    +
    +<table class="table">
    +  <tr><th>Schema</th><th>Store</th><th>Details</th></tr>
    +  <tr>
    +    <td><code>s3a://</code></td>
    +    <td>Amazon S3</a>
    +    <td>
    +    Recommended S3 client for Spark releases built on Apache Hadoop 2.7 or later.
    +    </td>
    +  </tr>
    +  <tr>
    +    <td><code>s3n://</code></td>
    +    <td>Amazon S3</a>
    +    <td>
    +    Deprected S3 client; only use for Spark releases built on Apache Hadoop 2.6 or earlier.
    +    </td>
    +  </tr>
    +  <tr>
    +    <td><code>s3://</code></td>
    +    <td>Amazon S3 on Amazon EMR</a>
    +    <td>
    +    Amazon's own S3 client; use only and exclusivley in Amazon EMR.
    +    </td>
    +  </tr>
    +  <tr>
    +    <td><code>wasb://</code></td>
    +    <td>Azure Storage</a>
    +    <td>
    +    Client for Microsoft Azure Storage; since Hadoop 2.7.
    +    </td>
    +  </tr>
    +  <tr>
    +    <td><code>swift://</code></td>
    +    <td>OpenStack Swift</a>
    +    <td>
    +    Client for OpenStack Swift object stores.
    +    </td>
    +  </tr>
    +  <tr>
    +    <td><code>gs://</code></td>
    +    <td>Google Cloud Storage</a>
    +    <td>
    +    Google's client for their cloud object store.
    +    </td>
    +  </tr>
    +</table>
    +
    +
    +### <a name="working_with_amazon_s3"></a>Working with Amazon S3
    +
    +Amazon's S3 object store is probably the most widely used object store \u2014it is also the one
    +with the most client libraries. This is due to the evolution of Hadoop's support, and Amazon
    +offering Hadoop and Spark as its EMR service, along with its own S3 client.
    +
    +The recommendations for which client to use depend upon the version of Hadoop on the Spark classpath.
    +
    +<table class="table">
    +  <tr><th>Hadoop Library Version</th><th>Client</th></tr>
    +  <tr>
    +    <td>Hadoop 2.7+ and commercial products based on it</a>
    +    <td><code>s3a://</code></td>
    +  </tr>
    +  <tr>
    +    <td>Hadoop 2.6 or earlier</a>
    +    <td><code>s3n://</code></td>
    +  </tr>
    +  <tr>
    +    <td>Amazon EMR</a>
    +    <td><code>s3://</code></td>
    +  </tr>
    +</table>
    +
    +Authentication is generally via properties set in the spark context or, in YARN clusters,
    +`core-site.xml`.
    +Versions of the S3A client also support short-lived session credentials and IAM authentication to
    +automatically pick up credentials on EC2 deployments. Consult the appropriate Hadoop documentation for specifics.
    +
    +`spark-submit` will automatically pick up and propagate `AWS_ACCESS_KEY` and `AWS_SECRET_KEY`
    +from the environment variables set in the environment of the user running `spark-submit`; these
    +will override any set in the configuration files.
    +
    +Be aware that while S3 buckets support complex access control declarations, Spark needs
    +full read/write access to any bucket to which it must write data. That is: it does not support writing
    +to buckets where the root paths are read only, or not readable at all.
    +
    +#### <a name="s3a"></a>S3A Filesystem Client: `s3a://`
    +
    +The ["S3A" filesystem client](https://hadoop.apache.org/docs/stable2/hadoop-aws/tools/hadoop-aws/index.html)
    +is the sole S3 connector undergoing active maintenance at the Apache, and should be used wherever
    +possible.
    +
    +
    +**Tuning for performance:**
    +
    +For recent Hadoop versions, *when working with binary formats* (Parquet, ORC) use
    +
    +```
    +spark.hadoop.fs.s3a.experimental.input.fadvise random
    +```
    +
    +This reads from the object in blocks, which is efficient when seeking backwards as well as
    +forwards in a file \u2014at the expense of making full file reads slower.
    --- End diff --
    
    same comment about the dash here


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481] [build] Add spark-cloud module to pull in o...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481][build] [WIP] Add Hadoop 2.7+ spark-cloud mo...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    **[Test build #65396 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65396/consoleFull)** for PR 12004 at commit [`ca3163d`](https://github.com/apache/spark/commit/ca3163ddc18c18cb626e50ab5ba1a650a2d5c8ea).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481][build] [WIP] Add Hadoop 2.6+ spark-cloud mo...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481][build] [WIP] Add Hadoop 2.6+ spark-cloud mo...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    **[Test build #64206 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64206/consoleFull)** for PR 12004 at commit [`63cf84f`](https://github.com/apache/spark/commit/63cf84f17d79813404b03c259a52bccb2dcb5853).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12004: [SPARK-7481] [build] Add spark-cloud module to pu...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12004#discussion_r89132623
  
    --- Diff: cloud/pom.xml ---
    @@ -0,0 +1,158 @@
    +<?xml version="1.0" encoding="UTF-8"?>
    +<!--
    +  ~ Licensed to the Apache Software Foundation (ASF) under one or more
    +  ~ contributor license agreements.  See the NOTICE file distributed with
    +  ~ this work for additional information regarding copyright ownership.
    +  ~ The ASF licenses this file to You under the Apache License, Version 2.0
    +  ~ (the "License"); you may not use this file except in compliance with
    +  ~ the License.  You may obtain a copy of the License at
    +  ~
    +  ~    http://www.apache.org/licenses/LICENSE-2.0
    +  ~
    +  ~ Unless required by applicable law or agreed to in writing, software
    +  ~ distributed under the License is distributed on an "AS IS" BASIS,
    +  ~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +  ~ See the License for the specific language governing permissions and
    +  ~ limitations under the License.
    +  -->
    +<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    +  <modelVersion>4.0.0</modelVersion>
    +  <parent>
    +    <groupId>org.apache.spark</groupId>
    +    <artifactId>spark-parent_2.11</artifactId>
    +    <version>2.1.0-SNAPSHOT</version>
    +    <relativePath>../pom.xml</relativePath>
    +  </parent>
    +
    +  <artifactId>spark-cloud_2.11</artifactId>
    +  <packaging>jar</packaging>
    +  <name>Spark Project Cloud Integration</name>
    +  <description>
    +    Contains support for cloud infrastructures, specifically the Hadoop JARs and
    +    transitive dependencies needed to interact with the infrastructures.
    +
    +    Any project which explicitly depends upon the spark-cloud artifact will get the dependencies;
    +    the exact versions of which will depend upon the hadoop version Spark was compiled against.
    +
    +    Hadoop 2.7:
    +      hadoop-aws
    +      aws-java-sdk-s3
    +      hadoop-azure
    +      azure-storage
    +      hadoop-openstack
    +
    +    WARNING: the signatures of methods in aws-java-sdk/aws-java-sdk-s3 can change between versions:
    +    use the same version against which Hadoop was compiled.
    +
    +  </description>
    +  <properties>
    +    <sbt.project.name>cloud</sbt.project.name>
    +  </properties>
    +
    +  <dependencies>
    +    <dependency>
    +      <groupId>org.apache.spark</groupId>
    +      <artifactId>spark-core_${scala.binary.version}</artifactId>
    +      <version>${project.version}</version>
    +    </dependency>
    +
    +    <!--Used for test classes -->
    +    <dependency>
    +      <groupId>org.apache.spark</groupId>
    +      <artifactId>spark-core_${scala.binary.version}</artifactId>
    +      <version>${project.version}</version>
    +      <type>test-jar</type>
    +      <scope>test</scope>
    +    </dependency>
    +
    +
    +    <!-- Jets3t is needed for s3n and s3 classic to work-->
    +    <dependency>
    +      <groupId>net.java.dev.jets3t</groupId>
    +      <artifactId>jets3t</artifactId>
    +    </dependency>
    +
    +    <!-- Explicit listing of transitive deps that are shaded. Otherwise, odd compiler crashes. -->
    +    <dependency>
    +      <groupId>com.google.guava</groupId>
    +      <artifactId>guava</artifactId>
    +    </dependency>
    +    <!-- End of shaded deps. -->
    +  </dependencies>
    +
    +  <build>
    +    <outputDirectory>target/scala-${scala.binary.version}/classes</outputDirectory>
    +    <testOutputDirectory>target/scala-${scala.binary.version}/test-classes</testOutputDirectory>
    +  </build>
    +
    +  <profiles>
    +
    +    <!--
    +      This profile is enabled automatically by the sbt build. It changes the scope for the guava
    +      dependency, since we don't shade it in the artifacts generated by the sbt build.
    +    -->
    +    <profile>
    +      <id>sbt</id>
    +      <dependencies>
    +        <dependency>
    +          <groupId>com.google.guava</groupId>
    +          <artifactId>guava</artifactId>
    +          <scope>compile</scope>
    +        </dependency>
    +      </dependencies>
    +    </profile>
    +
    +    <profile>
    +      <id>hadoop-2.7</id>
    +        <dependencies>
    +          <dependency>
    +            <groupId>org.apache.hadoop</groupId>
    +            <artifactId>hadoop-aws</artifactId>
    --- End diff --
    
    OK, so the idea here is that these dependencies used to be baked in to other artifacts in Hadoop before 2.7? or weren't available at all before?
    
    I am guessing there is no license issue here, coming from Hadoop.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12004: [SPARK-7481] [build] Add spark-cloud module to pu...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12004#discussion_r89134807
  
    --- Diff: pom.xml ---
    @@ -2558,6 +2660,26 @@
           </modules>
         </profile>
     
    +    <!--
    +      The cloud profile enables the cloud module.
    +      It does not declare the hadoop-* artifacts which
    +      the cloud module pulls in; these are delegated to
    +      the hadoop-x.y protocols, so permitting different
    +      hadoop versions to declare different include/exclude
    +      rules (especially transient dependencies).
    +
    +      To use this profile, the hadoop-2.7 profile must also
    --- End diff --
    
    hadoop-2.7 already adds this module. Is this redundant?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12004: [SPARK-7481] [build] Add spark-cloud module to pu...

Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12004#discussion_r89346124
  
    --- Diff: cloud/src/test/scala/org/apache/spark/cloud/AzureInstantiationSuite.scala ---
    @@ -0,0 +1,29 @@
    +/*
    --- End diff --
    
    In the absence of any real tests, these do check that the transient stuff is picked up, and, at a surface level, there aren't fundamental differences between jackson versions.
    
    I have a copy of the same tests in my downstream integration tests, where they are the first stop for detecting problems, the advantage being you don't need any credentials to run the tests \u2014and they validate the spark-cloud exports, rather than the spark-cloud test CP. I'll cut them from here if you want


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12004: [SPARK-7481] [build] Add spark-cloud module to pu...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12004#discussion_r113697831
  
    --- Diff: docs/cloud-integration.md ---
    @@ -0,0 +1,512 @@
    +---
    +layout: global
    +displayTitle: Integration with Cloud Infrastructures
    +title: Integration with Cloud Infrastructures
    +description: Introduction to cloud storage support in Apache Spark SPARK_VERSION_SHORT
    +---
    +<!---
    +  Licensed under the Apache License, Version 2.0 (the "License");
    +  you may not use this file except in compliance with the License.
    +  You may obtain a copy of the License at
    +
    +   http://www.apache.org/licenses/LICENSE-2.0
    +
    +  Unless required by applicable law or agreed to in writing, software
    +  distributed under the License is distributed on an "AS IS" BASIS,
    +  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +  See the License for the specific language governing permissions and
    +  limitations under the License. See accompanying LICENSE file.
    +-->
    +
    +* This will become a table of contents (this text will be scraped).
    +{:toc}
    +
    +## <a name="introduction"></a>Introduction
    +
    +
    +All the public cloud infrastructures, Amazon AWS, Microsoft Azure, Google GCS and others offer
    +persistent data storage systems, "object stores". These are not quite the same as classic file
    +systems: in order to scale to hundreds of Petabytes, without any single points of failure
    +or size limits, object stores, "blobstores", have a simpler model of `name => data`.
    +
    +Apache Spark can read or write data in object stores for data access.
    +through filesystem connectors implemented in Apache Hadoop or provided by third-parties.
    +These libraries make the object stores look *almost* like filesystems, with directories and
    +operations on files (rename) and directories (create, rename, delete) which mimic
    +those of a classic filesystem. Because of this, Spark and Spark-based applications
    +can work with object stores, generally treating them as as if they were slower-but-larger filesystems.
    +
    +With these connectors, Apache Spark supports object stores as the source
    +of data for analysis, including Spark Streaming and DataFrames.
    +
    +
    +## <a name="quick_start"></a>Quick Start
    +
    +Provided the relevant libraries are on the classpath, and Spark is configured with your credentials,
    +objects in an object store can be can be read or written through URLs which uses the name of the
    +object store client as the schema and the bucket/container as the hostname.
    +
    +
    +### Dependencies
    +
    +The Spark application neeeds the relevant Hadoop libraries, which can
    +be done by including the `spark-hadoop-cloud` module for the specific version of spark used.
    +
    +The Spark application should include <code>hadoop-openstack</code> dependency, which can
    +be done by including the `spark-hadoop-cloud` module for the specific version of spark used.
    +For example, for Maven support, add the following to the <code>pom.xml</code> file:
    +
    +{% highlight xml %}
    +<dependencyManagement>
    +  ...
    +  <dependency>
    +    <groupId>org.apache.spark</groupId>
    +    <artifactId>spark-hadoop-cloud_2.11</artifactId>
    +    <version>${spark.version}</version>
    +  </dependency>
    +  ...
    +</dependencyManagement>
    +{% endhighlight %}
    +
    +If using the Scala 2.10-compatible version of Spark, the artifact is of course `spark-hadoop-cloud_2.10`.
    +
    +### Basic Use
    +
    +You can refer to data in an object store just as you would data in a filesystem, by
    +using a URL to the data in methods like `SparkContext.textFile()` to read data, 
    +`saveAsTextFile()` to write it back.
    +
    +
    +Because object stores are viewed by Spark as filesystems, object stores can
    +be used as the source or destination of any spark work \u2014be it batch, SQL, DataFrame,
    +Streaming or something else.
    +
    +The steps to do so are as follows
    +
    +1. Use the full URI to refer to a bucket, including the prefix for the client-side library
    +to use. Example: `s3a://landsat-pds/scene_list.gz`
    +1. Have the Spark context configured with the authentication details of the object store.
    +In a YARN cluster, this may also be done in the `core-site.xml` file.
    +
    +
    +## <a name="output"></a>Object Stores as a substitute for HDFS
    +
    +As the examples show, you can write data to object stores. However, that does not mean
    +That they can be used as replacements for a cluster-wide filesystem.
    +
    +The full details are covered in [Cloud Object Stores are Not Real Filesystems](#cloud_stores_are_not_filesystems).
    +
    +The brief summary is:
    +
    +| Object Store Connector      |  Replace HDFS? |
    +|-----------------------------|--------------------|
    +| `s3a://` `s3n://`  from the ASF   | No  |
    +| Amazon EMR `s3://`          | Yes |
    +| Microsoft Azure `wasb://`   | Yes |
    +| OpenStack `swift://`        | No  |
    +
    +It is possible to use any of the object stores as a destination of work, i.e. use
    +`saveAsTextFile()` or `save()` to save data there, but the commit process may be slow
    +and, unreliable in the presence of failures.
    +
    +It is faster and safer to use the cluster filesystem as the destination of Spark jobs,
    +using that data as the data for follow-on work. The final results can
    +be persisted in to the object store using `distcp`.
    +
    +#### <a name="checkpointing"></a>Spark Streaming and object stores
    +
    +Spark Streaming can monitor files added to object stores, by
    +creating a `FileInputDStream` DStream monitoring a path under a bucket through
    +`StreamingContext.textFileStream()`.
    +
    +
    +1. The time to scan for new files is proportional to the number of files
    +under the path \u2014not the number of *new* files, and that it can become a slow operation.
    +The size of the window needs to be set to handle this.
    +
    +1. Files only appear in an object store once they are completely written; there
    +is no need for a worklow of write-then-rename to ensure that files aren't picked up
    +while they are still being written. Applications can write straight to the monitored directory.
    +
    +1. Streams should only be checkpointed to an object store considered compatible with
    +HDFS. Otherwise the checkpointing will be slow and potentially unreliable.
    +
    +### Recommended settings for writing to object stores
    +
    +Here are the settings to use when writing to object stores. This uses the "version 2" algorithm
    +for committing files \u2014which does less renaming than the v1 algorithm. Speculative execution is
    +disabled to avoid multiple writers corrupting the output.
    +
    +```
    +spark.speculation false
    +spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2
    +spark.hadoop.mapreduce.fileoutputcommitter.cleanup-failures.ignored true
    +```
    +
    +There's also the option of skipping the cleanup of temporary files in the output directory.
    +Enabling this option eliminates a small delay caused by listing and deleting any such files.
    +
    +```
    +spark.hadoop.mapreduce.fileoutputcommitter.cleanup.skipped true
    +```
    +
    +Bear in mind that storing temporary files can run up charges; Delete
    +directories called `"_temporary"` on a regular basis to avoid this.
    +
    +
    +### YARN Scheduler settings
    +
    +When running Spark in a YARN cluster running in EC2, turning off locality avoids any delays
    +waiting for the scheduler to find a node close to the data.
    +
    +```xml
    +  <property>
    +    <name>yarn.scheduler.capacity.node-locality-delay</name>
    +    <value>0</value>
    +  </property>
    +```
    +
    +This has to be set in the YARN cluster configuration, not in the Spark configuration.
    +
    +### Parquet I/O Settings
    +
    +For optimal performance when reading files saved in the Apache Parquet format,
    +read and write operations must be minimized, including generation of summary metadata,
    +and coalescing metadata from multiple files. The `filterPushdown` option
    +enables the Parquet library to optimize data reads itself, potentially saving bandwidth.
    +
    +```
    +spark.hadoop.parquet.enable.summary-metadata false
    +spark.sql.parquet.mergeSchema false
    +spark.sql.parquet.filterPushdown true
    +spark.sql.hive.metastorePartitionPruning true
    +```
    +
    +### ORC I/O Settings
    +
    +For optimal performance when reading files saved in the Apache ORC format,
    +read and write operations must be minimized. Here are the options to achieve this.
    +
    +
    +```
    +spark.sql.orc.filterPushdown true
    +spark.sql.orc.splits.include.file.footer true
    +spark.sql.orc.cache.stripe.details.size 10000
    +spark.sql.hive.metastorePartitionPruning true
    +```
    +
    +The `filterPushdown` option enables the ORC library to optimize data reads itself,
    +potentially saving bandwidth.
    +
    +The `spark.sql.orc.splits.include.file.footer` option means that the ORC file footer information,
    +is passed around with the file information \u2014so eliminating the need to reread this data.
    +
    +## <a name="authenticating"></a>Authenticating with Object Stores
    +
    +Apart from the special case of public read-only data, all object stores
    +require callers to authenticate themselves.
    +To do this, the Spark context must be configured with the authentication
    +details of the object store.
    +
    +1. In a YARN cluster, this may be done automatically in the `core-site.xml` file.
    +1. When Spark is running in cloud infrastructure (for example, on Amazon EC2, Google Cloud or
    +Microsoft Azure), the authentication details may be automatically derived from information
    +available to the VM.
    +1. `spark-submit` automatically picks up the contents of `AWS_ACCESS_KEY` and `AWS_SECRET_KEY`
    +environment variables and sets the associated configuration parameters for`s3n` and `s3a`
    +to these values. This essentially propagates the values across the Spark cluster.
    +1. Authentication details may be manually added to the Spark configuration
    +1. Alternatively, they can be programmatically added. *Important: never put authentication
    +secrets in source code. They will be compromised*.
    +
    +It is critical that the credentials used to access object stores are kept secret. Not only can
    +they be abused to run up compute charges, they can be used to read and alter private data.
    +
    +1. If adding login details to a spark configuration file, do not share this file, including
    +attaching to bug reports or committing it to SCM repositories.
    +1. Have different accounts for access to the storage for each application,
    +each with access rights restricted to those object storage buckets/containers used by the
    +application.
    +1. If the object store supports any form of session credential (e.g. Amazon's STS), issue
    +session credentials for the expected lifetime of the application.
    +1. When using a version of Spark with with Hadoop 2.8+ libraries, consider using Hadoop
    +credential files to store secrets, referencing
    +these files in the relevant ID/secret properties of the XML configuration file.
    +
    +
    +## <a name="object_stores"></a>Object stores and Their Library Dependencies
    +
    +The different object stores supported by Spark depend on specific Hadoop versions,
    +and require specific Hadoop JARs and dependent Java libraries on the classpath.
    +
    +<table class="table">
    +  <tr><th>Schema</th><th>Store</th><th>Details</th></tr>
    +  <tr>
    +    <td><code>s3a://</code></td>
    +    <td>Amazon S3</a>
    +    <td>
    +    Recommended S3 client for Spark releases built on Apache Hadoop 2.7 or later.
    +    </td>
    +  </tr>
    +  <tr>
    +    <td><code>s3n://</code></td>
    +    <td>Amazon S3</a>
    +    <td>
    +    Deprected S3 client; only use for Spark releases built on Apache Hadoop 2.6 or earlier.
    +    </td>
    +  </tr>
    +  <tr>
    +    <td><code>s3://</code></td>
    +    <td>Amazon S3 on Amazon EMR</a>
    +    <td>
    +    Amazon's own S3 client; use only and exclusivley in Amazon EMR.
    +    </td>
    +  </tr>
    +  <tr>
    +    <td><code>wasb://</code></td>
    +    <td>Azure Storage</a>
    +    <td>
    +    Client for Microsoft Azure Storage; since Hadoop 2.7.
    +    </td>
    +  </tr>
    +  <tr>
    +    <td><code>swift://</code></td>
    +    <td>OpenStack Swift</a>
    +    <td>
    +    Client for OpenStack Swift object stores.
    +    </td>
    +  </tr>
    +  <tr>
    +    <td><code>gs://</code></td>
    +    <td>Google Cloud Storage</a>
    +    <td>
    +    Google's client for their cloud object store.
    +    </td>
    +  </tr>
    +</table>
    +
    +
    +### <a name="working_with_amazon_s3"></a>Working with Amazon S3
    +
    +Amazon's S3 object store is probably the most widely used object store \u2014it is also the one
    --- End diff --
    
    "store -it" -> "store. It"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481][build] [WIP] Add Hadoop 2.7+ spark-cloud mo...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12004: [SPARK-7481] [build] Add spark-cloud module to pu...

Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12004#discussion_r113970275
  
    --- Diff: docs/cloud-integration.md ---
    @@ -0,0 +1,512 @@
    +---
    +layout: global
    +displayTitle: Integration with Cloud Infrastructures
    +title: Integration with Cloud Infrastructures
    +description: Introduction to cloud storage support in Apache Spark SPARK_VERSION_SHORT
    +---
    +<!---
    +  Licensed under the Apache License, Version 2.0 (the "License");
    +  you may not use this file except in compliance with the License.
    +  You may obtain a copy of the License at
    +
    +   http://www.apache.org/licenses/LICENSE-2.0
    +
    +  Unless required by applicable law or agreed to in writing, software
    +  distributed under the License is distributed on an "AS IS" BASIS,
    +  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +  See the License for the specific language governing permissions and
    +  limitations under the License. See accompanying LICENSE file.
    +-->
    +
    +* This will become a table of contents (this text will be scraped).
    +{:toc}
    +
    +## <a name="introduction"></a>Introduction
    +
    +
    +All the public cloud infrastructures, Amazon AWS, Microsoft Azure, Google GCS and others offer
    +persistent data storage systems, "object stores". These are not quite the same as classic file
    +systems: in order to scale to hundreds of Petabytes, without any single points of failure
    +or size limits, object stores, "blobstores", have a simpler model of `name => data`.
    +
    +Apache Spark can read or write data in object stores for data access.
    +through filesystem connectors implemented in Apache Hadoop or provided by third-parties.
    +These libraries make the object stores look *almost* like filesystems, with directories and
    +operations on files (rename) and directories (create, rename, delete) which mimic
    +those of a classic filesystem. Because of this, Spark and Spark-based applications
    +can work with object stores, generally treating them as as if they were slower-but-larger filesystems.
    +
    +With these connectors, Apache Spark supports object stores as the source
    +of data for analysis, including Spark Streaming and DataFrames.
    +
    +
    +## <a name="quick_start"></a>Quick Start
    +
    +Provided the relevant libraries are on the classpath, and Spark is configured with your credentials,
    +objects in an object store can be can be read or written through URLs which uses the name of the
    +object store client as the schema and the bucket/container as the hostname.
    +
    +
    +### Dependencies
    +
    +The Spark application neeeds the relevant Hadoop libraries, which can
    +be done by including the `spark-hadoop-cloud` module for the specific version of spark used.
    +
    +The Spark application should include <code>hadoop-openstack</code> dependency, which can
    +be done by including the `spark-hadoop-cloud` module for the specific version of spark used.
    +For example, for Maven support, add the following to the <code>pom.xml</code> file:
    +
    +{% highlight xml %}
    +<dependencyManagement>
    +  ...
    +  <dependency>
    +    <groupId>org.apache.spark</groupId>
    +    <artifactId>spark-hadoop-cloud_2.11</artifactId>
    +    <version>${spark.version}</version>
    +  </dependency>
    +  ...
    +</dependencyManagement>
    +{% endhighlight %}
    +
    +If using the Scala 2.10-compatible version of Spark, the artifact is of course `spark-hadoop-cloud_2.10`.
    +
    +### Basic Use
    +
    +You can refer to data in an object store just as you would data in a filesystem, by
    +using a URL to the data in methods like `SparkContext.textFile()` to read data, 
    +`saveAsTextFile()` to write it back.
    +
    +
    +Because object stores are viewed by Spark as filesystems, object stores can
    +be used as the source or destination of any spark work \u2014be it batch, SQL, DataFrame,
    +Streaming or something else.
    +
    +The steps to do so are as follows
    +
    +1. Use the full URI to refer to a bucket, including the prefix for the client-side library
    +to use. Example: `s3a://landsat-pds/scene_list.gz`
    +1. Have the Spark context configured with the authentication details of the object store.
    +In a YARN cluster, this may also be done in the `core-site.xml` file.
    +
    +
    +## <a name="output"></a>Object Stores as a substitute for HDFS
    +
    +As the examples show, you can write data to object stores. However, that does not mean
    +That they can be used as replacements for a cluster-wide filesystem.
    +
    +The full details are covered in [Cloud Object Stores are Not Real Filesystems](#cloud_stores_are_not_filesystems).
    +
    +The brief summary is:
    +
    +| Object Store Connector      |  Replace HDFS? |
    +|-----------------------------|--------------------|
    +| `s3a://` `s3n://`  from the ASF   | No  |
    +| Amazon EMR `s3://`          | Yes |
    +| Microsoft Azure `wasb://`   | Yes |
    +| OpenStack `swift://`        | No  |
    +
    +It is possible to use any of the object stores as a destination of work, i.e. use
    +`saveAsTextFile()` or `save()` to save data there, but the commit process may be slow
    +and, unreliable in the presence of failures.
    +
    +It is faster and safer to use the cluster filesystem as the destination of Spark jobs,
    +using that data as the data for follow-on work. The final results can
    +be persisted in to the object store using `distcp`.
    +
    +#### <a name="checkpointing"></a>Spark Streaming and object stores
    +
    +Spark Streaming can monitor files added to object stores, by
    +creating a `FileInputDStream` DStream monitoring a path under a bucket through
    +`StreamingContext.textFileStream()`.
    +
    +
    +1. The time to scan for new files is proportional to the number of files
    +under the path \u2014not the number of *new* files, and that it can become a slow operation.
    +The size of the window needs to be set to handle this.
    +
    +1. Files only appear in an object store once they are completely written; there
    +is no need for a worklow of write-then-rename to ensure that files aren't picked up
    +while they are still being written. Applications can write straight to the monitored directory.
    +
    +1. Streams should only be checkpointed to an object store considered compatible with
    +HDFS. Otherwise the checkpointing will be slow and potentially unreliable.
    +
    +### Recommended settings for writing to object stores
    +
    +Here are the settings to use when writing to object stores. This uses the "version 2" algorithm
    +for committing files \u2014which does less renaming than the v1 algorithm. Speculative execution is
    +disabled to avoid multiple writers corrupting the output.
    +
    +```
    +spark.speculation false
    +spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2
    --- End diff --
    
    That `FileOutputCommitter` code is a mess: I've stepped through it repeatedly and never quite worked out what they do. As usual, the design comes down to big Y! queries and things that went wrong. A big part of the design is to handle the failure of entire MR jobs and their restart, trying to recover all data already generated by the first one and committed into that first attempt.
    
    These were MR jobs with many workers taking hours, the probability of failure is high. The faster you can get the work done, or the fewer executors you need, the less frequent failures are. That whole scenario "new app instance trying to recover incomplete work of previous instance" doesn't exist any more. (What may still exist though is: old app instance still running due to some network partition)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481] [build] Add spark-cloud module to pull in o...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    **[Test build #66505 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66505/consoleFull)** for PR 12004 at commit [`e983fa6`](https://github.com/apache/spark/commit/e983fa6fce643d1965982e381cfe2fb5a288b819).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481] [build] Add spark-cloud module to pull in o...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/68668/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481] [WiP] [build] Add spark-cloud module to pul...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66504/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481] [build] Add spark-cloud module to pull in o...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    **[Test build #68936 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68936/consoleFull)** for PR 12004 at commit [`9726d6c`](https://github.com/apache/spark/commit/9726d6c857439b870e08d49889a5da3c35a708e3).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481] [build] Add spark-cloud module to pull in o...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    I do think it would be better to consider, first, just the module and doc bit. What do you think @rxin et al?
    
    No I may be arguing against something nobody is suggesting. This here is entirely fine of course. It sounded like the topic was feature branches in the main git repo. That's not what this is. A long-running 'WIP' branch runs some of the same risk of becoming too big before anyone is asked to look at it. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7481][build][WIP] Add Hadoop 2.6+ spark...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12004#issuecomment-211487887
  
    **[Test build #56092 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56092/consoleFull)** for PR 12004 at commit [`ce48b8c`](https://github.com/apache/spark/commit/ce48b8c377446e210c552d70ce4a391f4a4db4e6).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12004: [SPARK-7481] [build] Add spark-cloud module to pu...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12004#discussion_r113697019
  
    --- Diff: docs/cloud-integration.md ---
    @@ -0,0 +1,512 @@
    +---
    +layout: global
    +displayTitle: Integration with Cloud Infrastructures
    +title: Integration with Cloud Infrastructures
    +description: Introduction to cloud storage support in Apache Spark SPARK_VERSION_SHORT
    +---
    +<!---
    +  Licensed under the Apache License, Version 2.0 (the "License");
    +  you may not use this file except in compliance with the License.
    +  You may obtain a copy of the License at
    +
    +   http://www.apache.org/licenses/LICENSE-2.0
    +
    +  Unless required by applicable law or agreed to in writing, software
    +  distributed under the License is distributed on an "AS IS" BASIS,
    +  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +  See the License for the specific language governing permissions and
    +  limitations under the License. See accompanying LICENSE file.
    +-->
    +
    +* This will become a table of contents (this text will be scraped).
    +{:toc}
    +
    +## <a name="introduction"></a>Introduction
    +
    +
    +All the public cloud infrastructures, Amazon AWS, Microsoft Azure, Google GCS and others offer
    +persistent data storage systems, "object stores". These are not quite the same as classic file
    +systems: in order to scale to hundreds of Petabytes, without any single points of failure
    +or size limits, object stores, "blobstores", have a simpler model of `name => data`.
    +
    +Apache Spark can read or write data in object stores for data access.
    +through filesystem connectors implemented in Apache Hadoop or provided by third-parties.
    +These libraries make the object stores look *almost* like filesystems, with directories and
    +operations on files (rename) and directories (create, rename, delete) which mimic
    +those of a classic filesystem. Because of this, Spark and Spark-based applications
    +can work with object stores, generally treating them as as if they were slower-but-larger filesystems.
    +
    +With these connectors, Apache Spark supports object stores as the source
    +of data for analysis, including Spark Streaming and DataFrames.
    +
    +
    +## <a name="quick_start"></a>Quick Start
    +
    +Provided the relevant libraries are on the classpath, and Spark is configured with your credentials,
    +objects in an object store can be can be read or written through URLs which uses the name of the
    +object store client as the schema and the bucket/container as the hostname.
    +
    +
    +### Dependencies
    +
    +The Spark application neeeds the relevant Hadoop libraries, which can
    +be done by including the `spark-hadoop-cloud` module for the specific version of spark used.
    +
    +The Spark application should include <code>hadoop-openstack</code> dependency, which can
    +be done by including the `spark-hadoop-cloud` module for the specific version of spark used.
    +For example, for Maven support, add the following to the <code>pom.xml</code> file:
    +
    +{% highlight xml %}
    +<dependencyManagement>
    +  ...
    +  <dependency>
    +    <groupId>org.apache.spark</groupId>
    +    <artifactId>spark-hadoop-cloud_2.11</artifactId>
    +    <version>${spark.version}</version>
    +  </dependency>
    +  ...
    +</dependencyManagement>
    +{% endhighlight %}
    +
    +If using the Scala 2.10-compatible version of Spark, the artifact is of course `spark-hadoop-cloud_2.10`.
    +
    +### Basic Use
    +
    +You can refer to data in an object store just as you would data in a filesystem, by
    +using a URL to the data in methods like `SparkContext.textFile()` to read data, 
    +`saveAsTextFile()` to write it back.
    +
    +
    +Because object stores are viewed by Spark as filesystems, object stores can
    +be used as the source or destination of any spark work \u2014be it batch, SQL, DataFrame,
    +Streaming or something else.
    +
    +The steps to do so are as follows
    +
    +1. Use the full URI to refer to a bucket, including the prefix for the client-side library
    +to use. Example: `s3a://landsat-pds/scene_list.gz`
    +1. Have the Spark context configured with the authentication details of the object store.
    +In a YARN cluster, this may also be done in the `core-site.xml` file.
    +
    +
    +## <a name="output"></a>Object Stores as a substitute for HDFS
    +
    +As the examples show, you can write data to object stores. However, that does not mean
    +That they can be used as replacements for a cluster-wide filesystem.
    +
    +The full details are covered in [Cloud Object Stores are Not Real Filesystems](#cloud_stores_are_not_filesystems).
    +
    +The brief summary is:
    +
    +| Object Store Connector      |  Replace HDFS? |
    +|-----------------------------|--------------------|
    +| `s3a://` `s3n://`  from the ASF   | No  |
    +| Amazon EMR `s3://`          | Yes |
    +| Microsoft Azure `wasb://`   | Yes |
    +| OpenStack `swift://`        | No  |
    +
    +It is possible to use any of the object stores as a destination of work, i.e. use
    +`saveAsTextFile()` or `save()` to save data there, but the commit process may be slow
    +and, unreliable in the presence of failures.
    +
    +It is faster and safer to use the cluster filesystem as the destination of Spark jobs,
    +using that data as the data for follow-on work. The final results can
    +be persisted in to the object store using `distcp`.
    +
    +#### <a name="checkpointing"></a>Spark Streaming and object stores
    +
    +Spark Streaming can monitor files added to object stores, by
    +creating a `FileInputDStream` DStream monitoring a path under a bucket through
    +`StreamingContext.textFileStream()`.
    +
    +
    +1. The time to scan for new files is proportional to the number of files
    +under the path \u2014not the number of *new* files, and that it can become a slow operation.
    +The size of the window needs to be set to handle this.
    +
    +1. Files only appear in an object store once they are completely written; there
    +is no need for a worklow of write-then-rename to ensure that files aren't picked up
    +while they are still being written. Applications can write straight to the monitored directory.
    +
    +1. Streams should only be checkpointed to an object store considered compatible with
    +HDFS. Otherwise the checkpointing will be slow and potentially unreliable.
    +
    +### Recommended settings for writing to object stores
    +
    +Here are the settings to use when writing to object stores. This uses the "version 2" algorithm
    +for committing files \u2014which does less renaming than the v1 algorithm. Speculative execution is
    +disabled to avoid multiple writers corrupting the output.
    --- End diff --
    
    What is the consequence of setting or not setting these things? does it fail if speculation or v1 is used? I think that's the kind of info that is actionable for a user who needs to decide what to configure.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12004: [SPARK-7481] [build] Add spark-cloud module to pu...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12004#discussion_r113695445
  
    --- Diff: docs/cloud-integration.md ---
    @@ -0,0 +1,512 @@
    +---
    +layout: global
    +displayTitle: Integration with Cloud Infrastructures
    +title: Integration with Cloud Infrastructures
    +description: Introduction to cloud storage support in Apache Spark SPARK_VERSION_SHORT
    +---
    +<!---
    +  Licensed under the Apache License, Version 2.0 (the "License");
    +  you may not use this file except in compliance with the License.
    +  You may obtain a copy of the License at
    +
    +   http://www.apache.org/licenses/LICENSE-2.0
    +
    +  Unless required by applicable law or agreed to in writing, software
    +  distributed under the License is distributed on an "AS IS" BASIS,
    +  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +  See the License for the specific language governing permissions and
    +  limitations under the License. See accompanying LICENSE file.
    +-->
    +
    +* This will become a table of contents (this text will be scraped).
    +{:toc}
    +
    +## <a name="introduction"></a>Introduction
    +
    +
    +All the public cloud infrastructures, Amazon AWS, Microsoft Azure, Google GCS and others offer
    +persistent data storage systems, "object stores". These are not quite the same as classic file
    +systems: in order to scale to hundreds of Petabytes, without any single points of failure
    +or size limits, object stores, "blobstores", have a simpler model of `name => data`.
    +
    +Apache Spark can read or write data in object stores for data access.
    +through filesystem connectors implemented in Apache Hadoop or provided by third-parties.
    +These libraries make the object stores look *almost* like filesystems, with directories and
    +operations on files (rename) and directories (create, rename, delete) which mimic
    +those of a classic filesystem. Because of this, Spark and Spark-based applications
    +can work with object stores, generally treating them as as if they were slower-but-larger filesystems.
    +
    +With these connectors, Apache Spark supports object stores as the source
    +of data for analysis, including Spark Streaming and DataFrames.
    +
    +
    +## <a name="quick_start"></a>Quick Start
    +
    +Provided the relevant libraries are on the classpath, and Spark is configured with your credentials,
    +objects in an object store can be can be read or written through URLs which uses the name of the
    +object store client as the schema and the bucket/container as the hostname.
    +
    +
    +### Dependencies
    +
    +The Spark application neeeds the relevant Hadoop libraries, which can
    +be done by including the `spark-hadoop-cloud` module for the specific version of spark used.
    +
    +The Spark application should include <code>hadoop-openstack</code> dependency, which can
    +be done by including the `spark-hadoop-cloud` module for the specific version of spark used.
    +For example, for Maven support, add the following to the <code>pom.xml</code> file:
    +
    +{% highlight xml %}
    +<dependencyManagement>
    +  ...
    +  <dependency>
    +    <groupId>org.apache.spark</groupId>
    +    <artifactId>spark-hadoop-cloud_2.11</artifactId>
    +    <version>${spark.version}</version>
    +  </dependency>
    +  ...
    +</dependencyManagement>
    +{% endhighlight %}
    +
    +If using the Scala 2.10-compatible version of Spark, the artifact is of course `spark-hadoop-cloud_2.10`.
    +
    +### Basic Use
    +
    +You can refer to data in an object store just as you would data in a filesystem, by
    +using a URL to the data in methods like `SparkContext.textFile()` to read data, 
    +`saveAsTextFile()` to write it back.
    +
    +
    +Because object stores are viewed by Spark as filesystems, object stores can
    +be used as the source or destination of any spark work \u2014be it batch, SQL, DataFrame,
    --- End diff --
    
    "spark work -be" -> "Spark work, be"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12004: [SPARK-7481] [build] Add spark-cloud module to pu...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12004#discussion_r113695246
  
    --- Diff: docs/cloud-integration.md ---
    @@ -0,0 +1,512 @@
    +---
    +layout: global
    +displayTitle: Integration with Cloud Infrastructures
    +title: Integration with Cloud Infrastructures
    +description: Introduction to cloud storage support in Apache Spark SPARK_VERSION_SHORT
    +---
    +<!---
    +  Licensed under the Apache License, Version 2.0 (the "License");
    +  you may not use this file except in compliance with the License.
    +  You may obtain a copy of the License at
    +
    +   http://www.apache.org/licenses/LICENSE-2.0
    +
    +  Unless required by applicable law or agreed to in writing, software
    +  distributed under the License is distributed on an "AS IS" BASIS,
    +  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +  See the License for the specific language governing permissions and
    +  limitations under the License. See accompanying LICENSE file.
    +-->
    +
    +* This will become a table of contents (this text will be scraped).
    +{:toc}
    +
    +## <a name="introduction"></a>Introduction
    +
    +
    +All the public cloud infrastructures, Amazon AWS, Microsoft Azure, Google GCS and others offer
    +persistent data storage systems, "object stores". These are not quite the same as classic file
    +systems: in order to scale to hundreds of Petabytes, without any single points of failure
    +or size limits, object stores, "blobstores", have a simpler model of `name => data`.
    +
    +Apache Spark can read or write data in object stores for data access.
    +through filesystem connectors implemented in Apache Hadoop or provided by third-parties.
    +These libraries make the object stores look *almost* like filesystems, with directories and
    +operations on files (rename) and directories (create, rename, delete) which mimic
    +those of a classic filesystem. Because of this, Spark and Spark-based applications
    +can work with object stores, generally treating them as as if they were slower-but-larger filesystems.
    +
    +With these connectors, Apache Spark supports object stores as the source
    +of data for analysis, including Spark Streaming and DataFrames.
    +
    +
    +## <a name="quick_start"></a>Quick Start
    +
    +Provided the relevant libraries are on the classpath, and Spark is configured with your credentials,
    +objects in an object store can be can be read or written through URLs which uses the name of the
    +object store client as the schema and the bucket/container as the hostname.
    +
    +
    +### Dependencies
    +
    +The Spark application neeeds the relevant Hadoop libraries, which can
    +be done by including the `spark-hadoop-cloud` module for the specific version of spark used.
    +
    +The Spark application should include <code>hadoop-openstack</code> dependency, which can
    --- End diff --
    
    back-tick rather than `<code>`? it's trivial


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481] [build] Add spark-cloud module to pull in o...

Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    github isn't letting me reopen this, so I'm going to submit the patch with reworked docs as a new PR. The machines do not like me today.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12004: [SPARK-7481] [build] Add spark-cloud module to pu...

Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12004#discussion_r113969133
  
    --- Diff: docs/cloud-integration.md ---
    @@ -0,0 +1,512 @@
    +---
    +layout: global
    +displayTitle: Integration with Cloud Infrastructures
    +title: Integration with Cloud Infrastructures
    +description: Introduction to cloud storage support in Apache Spark SPARK_VERSION_SHORT
    +---
    +<!---
    +  Licensed under the Apache License, Version 2.0 (the "License");
    +  you may not use this file except in compliance with the License.
    +  You may obtain a copy of the License at
    +
    +   http://www.apache.org/licenses/LICENSE-2.0
    +
    +  Unless required by applicable law or agreed to in writing, software
    +  distributed under the License is distributed on an "AS IS" BASIS,
    +  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +  See the License for the specific language governing permissions and
    +  limitations under the License. See accompanying LICENSE file.
    +-->
    +
    +* This will become a table of contents (this text will be scraped).
    +{:toc}
    +
    +## <a name="introduction"></a>Introduction
    +
    +
    +All the public cloud infrastructures, Amazon AWS, Microsoft Azure, Google GCS and others offer
    +persistent data storage systems, "object stores". These are not quite the same as classic file
    +systems: in order to scale to hundreds of Petabytes, without any single points of failure
    +or size limits, object stores, "blobstores", have a simpler model of `name => data`.
    +
    +Apache Spark can read or write data in object stores for data access.
    +through filesystem connectors implemented in Apache Hadoop or provided by third-parties.
    +These libraries make the object stores look *almost* like filesystems, with directories and
    +operations on files (rename) and directories (create, rename, delete) which mimic
    +those of a classic filesystem. Because of this, Spark and Spark-based applications
    +can work with object stores, generally treating them as as if they were slower-but-larger filesystems.
    +
    +With these connectors, Apache Spark supports object stores as the source
    +of data for analysis, including Spark Streaming and DataFrames.
    +
    +
    +## <a name="quick_start"></a>Quick Start
    +
    +Provided the relevant libraries are on the classpath, and Spark is configured with your credentials,
    +objects in an object store can be can be read or written through URLs which uses the name of the
    +object store client as the schema and the bucket/container as the hostname.
    +
    +
    +### Dependencies
    +
    +The Spark application neeeds the relevant Hadoop libraries, which can
    +be done by including the `spark-hadoop-cloud` module for the specific version of spark used.
    +
    +The Spark application should include <code>hadoop-openstack</code> dependency, which can
    +be done by including the `spark-hadoop-cloud` module for the specific version of spark used.
    +For example, for Maven support, add the following to the <code>pom.xml</code> file:
    +
    +{% highlight xml %}
    +<dependencyManagement>
    +  ...
    +  <dependency>
    +    <groupId>org.apache.spark</groupId>
    +    <artifactId>spark-hadoop-cloud_2.11</artifactId>
    +    <version>${spark.version}</version>
    +  </dependency>
    +  ...
    +</dependencyManagement>
    +{% endhighlight %}
    +
    +If using the Scala 2.10-compatible version of Spark, the artifact is of course `spark-hadoop-cloud_2.10`.
    +
    +### Basic Use
    +
    +You can refer to data in an object store just as you would data in a filesystem, by
    +using a URL to the data in methods like `SparkContext.textFile()` to read data, 
    +`saveAsTextFile()` to write it back.
    +
    +
    +Because object stores are viewed by Spark as filesystems, object stores can
    +be used as the source or destination of any spark work \u2014be it batch, SQL, DataFrame,
    +Streaming or something else.
    +
    +The steps to do so are as follows
    +
    +1. Use the full URI to refer to a bucket, including the prefix for the client-side library
    +to use. Example: `s3a://landsat-pds/scene_list.gz`
    +1. Have the Spark context configured with the authentication details of the object store.
    +In a YARN cluster, this may also be done in the `core-site.xml` file.
    +
    +
    +## <a name="output"></a>Object Stores as a substitute for HDFS
    +
    +As the examples show, you can write data to object stores. However, that does not mean
    +That they can be used as replacements for a cluster-wide filesystem.
    +
    +The full details are covered in [Cloud Object Stores are Not Real Filesystems](#cloud_stores_are_not_filesystems).
    +
    +The brief summary is:
    +
    +| Object Store Connector      |  Replace HDFS? |
    +|-----------------------------|--------------------|
    +| `s3a://` `s3n://`  from the ASF   | No  |
    +| Amazon EMR `s3://`          | Yes |
    +| Microsoft Azure `wasb://`   | Yes |
    +| OpenStack `swift://`        | No  |
    +
    +It is possible to use any of the object stores as a destination of work, i.e. use
    +`saveAsTextFile()` or `save()` to save data there, but the commit process may be slow
    +and, unreliable in the presence of failures.
    +
    +It is faster and safer to use the cluster filesystem as the destination of Spark jobs,
    +using that data as the data for follow-on work. The final results can
    +be persisted in to the object store using `distcp`.
    +
    +#### <a name="checkpointing"></a>Spark Streaming and object stores
    +
    +Spark Streaming can monitor files added to object stores, by
    +creating a `FileInputDStream` DStream monitoring a path under a bucket through
    +`StreamingContext.textFileStream()`.
    +
    +
    +1. The time to scan for new files is proportional to the number of files
    +under the path \u2014not the number of *new* files, and that it can become a slow operation.
    +The size of the window needs to be set to handle this.
    +
    +1. Files only appear in an object store once they are completely written; there
    +is no need for a worklow of write-then-rename to ensure that files aren't picked up
    +while they are still being written. Applications can write straight to the monitored directory.
    +
    +1. Streams should only be checkpointed to an object store considered compatible with
    +HDFS. Otherwise the checkpointing will be slow and potentially unreliable.
    +
    +### Recommended settings for writing to object stores
    +
    +Here are the settings to use when writing to object stores. This uses the "version 2" algorithm
    +for committing files \u2014which does less renaming than the v1 algorithm. Speculative execution is
    +disabled to avoid multiple writers corrupting the output.
    --- End diff --
    
    MRv1 is just way slower, with more renaming.
    Speculation is potentially dangerous due to the way rename() is used for some atomic bits of the commit. It's why the Direct Committer always turned off speculation, to avoid > 1 executor trying to write to the same destination directory. 
    
    That said, if it is up to the job committer to do the final commit action, then speculation isn't so risky. After all: the commit algorithm needs to be able to handle executor failure anyway, and speculation is similar, but actually easier, as theres a commit protocol to help abort work.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12004: [SPARK-7481] [build] Add spark-cloud module to pu...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12004#discussion_r89365091
  
    --- Diff: docs/cloud-integration.md ---
    @@ -0,0 +1,953 @@
    +---
    +layout: global
    +displayTitle: Integration with Cloud Infrastructures
    +title: Integration with Cloud Infrastructures
    +description: Introduction to cloud storage support in Apache Spark SPARK_VERSION_SHORT
    +---
    +<!---
    +  Licensed under the Apache License, Version 2.0 (the "License");
    +  you may not use this file except in compliance with the License.
    +  You may obtain a copy of the License at
    +
    +   http://www.apache.org/licenses/LICENSE-2.0
    +
    +  Unless required by applicable law or agreed to in writing, software
    +  distributed under the License is distributed on an "AS IS" BASIS,
    +  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +  See the License for the specific language governing permissions and
    +  limitations under the License. See accompanying LICENSE file.
    +-->
    +
    +* This will become a table of contents (this text will be scraped).
    +{:toc}
    +
    +## <a name="introduction"></a>Introduction
    +
    +
    +All the public cloud infrastructures, Amazon AWS, Microsoft Azure, Google GCS and others offer
    +persistent data storage systems, "object stores". These are not quite the same as classic file
    +systems: in order to scale to hundreds of Petabytes, without any single points of failure
    +or size limits, object stores, "blobstores", have a simpler model of `name => data`.
    +
    +Apache Spark can read or write data in object stores for data access.
    +through filesystem connectors implemented in Apache Hadoop or provided by third-parties.
    +These libraries make the object stores look *almost* like filesystems, with directories and
    +operations on files (rename) and directories (create, rename, delete) which mimic
    +those of a classic filesystem. Because of this, Spark and Spark-based applications
    +can work with object stores, generally treating them as as if they were slower-but-larger filesystems.
    +
    +With these connectors, Apache Spark supports object stores as the source
    +of data for analysis, including Spark Streaming and DataFrames.
    +
    +
    +## <a name="quick_start"></a>Quick Start
    +
    +Provided the relevant libraries are on the classpath, and Spark is configured with your credentials,
    +objects in an object store can be can be read or written through URLs which uses the name of the
    +object store client as the schema and the bucket/container as the hostname.
    +
    +
    +### Dependencies
    +
    +The Spark application neeeds the relevant Hadoop libraries, which can
    +be done by including the `spark-cloud` module for the specific version of spark used.
    +
    +The Spark application should include <code>hadoop-openstack</code> dependency, which can
    +be done by including the `spark-cloud` module for the specific version of spark used.
    +For example, for Maven support, add the following to the <code>pom.xml</code> file:
    +
    +{% highlight xml %}
    +<dependencyManagement>
    +  ...
    +  <dependency>
    +    <groupId>org.apache.spark</groupId>
    +    <artifactId>spark-cloud_2.11</artifactId>
    +    <version>${spark.version}</version>
    +  </dependency>
    +  ...
    +</dependencyManagement>
    +{% endhighlight %}
    +
    +If using the Scala 2.10-compatible version of Spark, the artifact is of course `spark-cloud_2.10`.
    +
    +### Basic Use
    +
    +
    +
    +To refer to a path in Amazon S3, use `s3a://` as the scheme (Hadoop 2.7+) or `s3n://` on older versions.
    +
    +{% highlight scala %}
    +sparkContext.textFile("s3a://landsat-pds/scene_list.gz").count()
    +{% endhighlight %}
    +
    +Similarly, an RDD can be saved to an object store via `saveAsTextFile()`
    +
    +
    +{% highlight scala %}
    +val numbers = sparkContext.parallelize(1 to 1000)
    +
    +// save to Amazon S3 (or compatible implementation)
    +numbers.saveAsTextFile("s3a://testbucket/counts")
    +
    +// Save to Azure Object store
    +numbers.saveAsTextFile("wasb://testbucket@example.blob.core.windows.net/counts")
    +
    +// save to an OpenStack Swift implementation
    +numbers.saveAsTextFile("swift://testbucket.openstack1/counts")
    +{% endhighlight %}
    +
    +That's essentially it: object stores can act as a source and destination of data, using exactly
    +the same APIs to load and save data as one uses to work with data in HDFS or other filesystems.
    +
    +Because object stores are viewed by Spark as filesystems, object stores can
    +be used as the source or destination of any spark work \u2014be it batch, SQL, DataFrame,
    +Streaming or something else.
    +
    +The steps to do so are as follows
    +
    +1. Use the full URI to refer to a bucket, including the prefix for the client-side library
    +to use. Example: `s3a://landsat-pds/scene_list.gz`
    +1. Have the Spark context configured with the authentication details of the object store.
    +In a YARN cluster, this may also be done in the `core-site.xml` file.
    +1. Have the JAR containing the filesystem classes on the classpath \u2014along with all of its dependencies.
    +
    +### <a name="dataframes"></a>Example: DataFrames
    +
    +DataFrames can be created from and saved to object stores through the `read()` and `write()` methods.
    +
    +{% highlight scala %}
    +import org.apache.spark.SparkConf
    +import org.apache.spark.sql.SparkSession
    +import org.apache.spark.sql.types.StringType
    +
    +val spark = SparkSession
    +    .builder
    +    .appName("DataFrames")
    +    .config(sparkConf)
    +    .getOrCreate()
    +import spark.implicits._
    +val numRows = 1000
    +
    +// generate test data
    +val sourceData = spark.range(0, numRows).select($"id".as("l"), $"id".cast(StringType).as("s"))
    +
    +// define the destination
    +val dest = "wasb://yourcontainer@youraccount.blob.core.windows.net/dataframes"
    +
    +// write the data
    +val orcFile = dest + "/data.orc"
    +sourceData.write.format("orc").save(orcFile)
    +
    +// now read it back
    +val orcData = spark.read.format("orc").load(orcFile)
    +
    +// finally, write the data as Parquet
    +orcData.write.format("parquet").save(dest + "/data.parquet")
    +spark.stop()
    +{% endhighlight %}
    +
    +### <a name="streaming"></a>Example: Spark Streaming and Cloud Storage
    +
    +Spark Streaming can monitor files added to object stores, by
    +creating a `FileInputDStream` DStream monitoring a path under a bucket.
    +
    +{% highlight scala %}
    +import org.apache.spark.SparkConf
    +import org.apache.spark.sql.SparkSession
    +import org.apache.spark.streaming._
    +
    +val sparkConf = new SparkConf()
    +val ssc = new StreamingContext(sparkConf, Milliseconds(5000))
    +try {
    +  val lines = ssc.textFileStream("s3a://bucket/incoming")
    +  val matches = lines.filter(_.endsWith("3"))
    +  matches.print()
    +  ssc.start()
    +  ssc.awaitTermination()
    +} finally {
    +  ssc.stop(true)
    +}
    +{% endhighlight %}
    +
    +1. The time to scan for new files is proportional to the number of files
    +under the path \u2014not the number of *new* files, and that it can become a slow operation.
    +The size of the window needs to be set to handle this.
    +
    +1. Files only appear in an object store once they are completely written; there
    +is no need for a worklow of write-then-rename to ensure that files aren't picked up
    +while they are still being written. Applications can write straight to the monitored directory.
    +
    +#### <a name="checkpointing"></a>Checkpointing Streams to object stores
    +
    +Streams should only be checkpointed to an object store considered compatible with
    +HDFS. As the checkpoint operation includes a `rename()` operation, checkpointing to
    +an object store can be so slow that streaming throughput collapses.
    +
    +
    +## <a name="output"></a>Object Stores as a substitute for HDFS
    --- End diff --
    
    Agree, I'd like to have more caveats documented. That said, I am not sure how recommendable this is at all; is it actually OK if you disable speculation and set all these knobs? does that really count as 'working'? _shrug_ but I do favor putting the info out there, in the upstream spark docs here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12004: [SPARK-7481] [build] Add spark-cloud module to pu...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12004#discussion_r89476597
  
    --- Diff: pom.xml ---
    @@ -2558,6 +2660,26 @@
           </modules>
         </profile>
     
    +    <!--
    +      The cloud profile enables the cloud module.
    +      It does not declare the hadoop-* artifacts which
    +      the cloud module pulls in; these are delegated to
    +      the hadoop-x.y protocols, so permitting different
    +      hadoop versions to declare different include/exclude
    +      rules (especially transient dependencies).
    +
    +      To use this profile, the hadoop-2.7 profile must also
    --- End diff --
    
    Hm, let's think about this. I guess all the modules are really named like `spark-*`, so module `spark-hadoop-cloud`? the profiles aren't, so could be called `hadoop-cloud`? whatever is consistent yet narrows this down a bit to being Hadoop-library-specific cloud support.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12004: [SPARK-7481] [build] Add spark-cloud module to pu...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12004#discussion_r104287049
  
    --- Diff: cloud/pom.xml ---
    @@ -0,0 +1,158 @@
    +<?xml version="1.0" encoding="UTF-8"?>
    +<!--
    +  ~ Licensed to the Apache Software Foundation (ASF) under one or more
    +  ~ contributor license agreements.  See the NOTICE file distributed with
    +  ~ this work for additional information regarding copyright ownership.
    +  ~ The ASF licenses this file to You under the Apache License, Version 2.0
    +  ~ (the "License"); you may not use this file except in compliance with
    +  ~ the License.  You may obtain a copy of the License at
    +  ~
    +  ~    http://www.apache.org/licenses/LICENSE-2.0
    +  ~
    +  ~ Unless required by applicable law or agreed to in writing, software
    +  ~ distributed under the License is distributed on an "AS IS" BASIS,
    +  ~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +  ~ See the License for the specific language governing permissions and
    +  ~ limitations under the License.
    +  -->
    +<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    +  <modelVersion>4.0.0</modelVersion>
    +  <parent>
    +    <groupId>org.apache.spark</groupId>
    +    <artifactId>spark-parent_2.11</artifactId>
    +    <version>2.2.0-SNAPSHOT</version>
    +    <relativePath>../pom.xml</relativePath>
    +  </parent>
    +
    +  <artifactId>spark-hadoop-cloud_2.11</artifactId>
    +  <packaging>jar</packaging>
    +  <name>Spark Project Cloud Integration</name>
    +  <description>
    +    Contains support for cloud infrastructures, specifically the Hadoop JARs and
    +    transitive dependencies needed to interact with the infrastructures.
    +
    +    Any project which explicitly depends upon the spark-hadoop-cloud artifact will get the
    +    dependencies; the exact versions of which will depend upon the hadoop version Spark was compiled
    +    against.
    +
    +    The imports of transitive dependencies are managed to make them consistent
    +    with those of the Spark build.
    +
    +    WARNING: the signatures of methods in the AWS and Azure SDKs do change between
    --- End diff --
    
    Where does an end user need to act on this -- the profile is in theory setting all this up correctly right? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481] [build] Add spark-cloud module to pull in o...

Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    Any comments on the latest patch? Anyone?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7481][build][WIP] Add Hadoop 2.6+ profi...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12004#issuecomment-202508415
  
    **[Test build #54333 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54333/consoleFull)** for PR 12004 at commit [`5e9cfbe`](https://github.com/apache/spark/commit/5e9cfbe30a5aff78e5b807a2d2cf38aa1a2b814d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481] [build] Add spark-cloud module to pull in o...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/69480/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481][build] [WIP] Add Hadoop 2.6+ spark-cloud mo...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    **[Test build #64428 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64428/consoleFull)** for PR 12004 at commit [`b25d497`](https://github.com/apache/spark/commit/b25d49701b4015b49efc6c89734301525d803524).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12004: [SPARK-7481] [build] Add spark-cloud module to pu...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12004#discussion_r113699310
  
    --- Diff: docs/cloud-integration.md ---
    @@ -0,0 +1,512 @@
    +---
    +layout: global
    +displayTitle: Integration with Cloud Infrastructures
    +title: Integration with Cloud Infrastructures
    +description: Introduction to cloud storage support in Apache Spark SPARK_VERSION_SHORT
    +---
    +<!---
    +  Licensed under the Apache License, Version 2.0 (the "License");
    +  you may not use this file except in compliance with the License.
    +  You may obtain a copy of the License at
    +
    +   http://www.apache.org/licenses/LICENSE-2.0
    +
    +  Unless required by applicable law or agreed to in writing, software
    +  distributed under the License is distributed on an "AS IS" BASIS,
    +  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +  See the License for the specific language governing permissions and
    +  limitations under the License. See accompanying LICENSE file.
    +-->
    +
    +* This will become a table of contents (this text will be scraped).
    +{:toc}
    +
    +## <a name="introduction"></a>Introduction
    +
    +
    +All the public cloud infrastructures, Amazon AWS, Microsoft Azure, Google GCS and others offer
    +persistent data storage systems, "object stores". These are not quite the same as classic file
    +systems: in order to scale to hundreds of Petabytes, without any single points of failure
    +or size limits, object stores, "blobstores", have a simpler model of `name => data`.
    +
    +Apache Spark can read or write data in object stores for data access.
    +through filesystem connectors implemented in Apache Hadoop or provided by third-parties.
    +These libraries make the object stores look *almost* like filesystems, with directories and
    +operations on files (rename) and directories (create, rename, delete) which mimic
    +those of a classic filesystem. Because of this, Spark and Spark-based applications
    +can work with object stores, generally treating them as as if they were slower-but-larger filesystems.
    +
    +With these connectors, Apache Spark supports object stores as the source
    +of data for analysis, including Spark Streaming and DataFrames.
    +
    +
    +## <a name="quick_start"></a>Quick Start
    +
    +Provided the relevant libraries are on the classpath, and Spark is configured with your credentials,
    +objects in an object store can be can be read or written through URLs which uses the name of the
    +object store client as the schema and the bucket/container as the hostname.
    +
    +
    +### Dependencies
    +
    +The Spark application neeeds the relevant Hadoop libraries, which can
    +be done by including the `spark-hadoop-cloud` module for the specific version of spark used.
    +
    +The Spark application should include <code>hadoop-openstack</code> dependency, which can
    +be done by including the `spark-hadoop-cloud` module for the specific version of spark used.
    +For example, for Maven support, add the following to the <code>pom.xml</code> file:
    +
    +{% highlight xml %}
    +<dependencyManagement>
    +  ...
    +  <dependency>
    +    <groupId>org.apache.spark</groupId>
    +    <artifactId>spark-hadoop-cloud_2.11</artifactId>
    +    <version>${spark.version}</version>
    +  </dependency>
    +  ...
    +</dependencyManagement>
    +{% endhighlight %}
    +
    +If using the Scala 2.10-compatible version of Spark, the artifact is of course `spark-hadoop-cloud_2.10`.
    +
    +### Basic Use
    +
    +You can refer to data in an object store just as you would data in a filesystem, by
    +using a URL to the data in methods like `SparkContext.textFile()` to read data, 
    +`saveAsTextFile()` to write it back.
    +
    +
    +Because object stores are viewed by Spark as filesystems, object stores can
    +be used as the source or destination of any spark work \u2014be it batch, SQL, DataFrame,
    +Streaming or something else.
    +
    +The steps to do so are as follows
    +
    +1. Use the full URI to refer to a bucket, including the prefix for the client-side library
    +to use. Example: `s3a://landsat-pds/scene_list.gz`
    +1. Have the Spark context configured with the authentication details of the object store.
    +In a YARN cluster, this may also be done in the `core-site.xml` file.
    +
    +
    +## <a name="output"></a>Object Stores as a substitute for HDFS
    +
    +As the examples show, you can write data to object stores. However, that does not mean
    +That they can be used as replacements for a cluster-wide filesystem.
    +
    +The full details are covered in [Cloud Object Stores are Not Real Filesystems](#cloud_stores_are_not_filesystems).
    +
    +The brief summary is:
    +
    +| Object Store Connector      |  Replace HDFS? |
    +|-----------------------------|--------------------|
    +| `s3a://` `s3n://`  from the ASF   | No  |
    +| Amazon EMR `s3://`          | Yes |
    +| Microsoft Azure `wasb://`   | Yes |
    +| OpenStack `swift://`        | No  |
    +
    +It is possible to use any of the object stores as a destination of work, i.e. use
    +`saveAsTextFile()` or `save()` to save data there, but the commit process may be slow
    +and, unreliable in the presence of failures.
    +
    +It is faster and safer to use the cluster filesystem as the destination of Spark jobs,
    +using that data as the data for follow-on work. The final results can
    +be persisted in to the object store using `distcp`.
    +
    +#### <a name="checkpointing"></a>Spark Streaming and object stores
    +
    +Spark Streaming can monitor files added to object stores, by
    +creating a `FileInputDStream` DStream monitoring a path under a bucket through
    +`StreamingContext.textFileStream()`.
    +
    +
    +1. The time to scan for new files is proportional to the number of files
    +under the path \u2014not the number of *new* files, and that it can become a slow operation.
    +The size of the window needs to be set to handle this.
    +
    +1. Files only appear in an object store once they are completely written; there
    +is no need for a worklow of write-then-rename to ensure that files aren't picked up
    +while they are still being written. Applications can write straight to the monitored directory.
    +
    +1. Streams should only be checkpointed to an object store considered compatible with
    +HDFS. Otherwise the checkpointing will be slow and potentially unreliable.
    +
    +### Recommended settings for writing to object stores
    +
    +Here are the settings to use when writing to object stores. This uses the "version 2" algorithm
    +for committing files \u2014which does less renaming than the v1 algorithm. Speculative execution is
    +disabled to avoid multiple writers corrupting the output.
    +
    +```
    +spark.speculation false
    +spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2
    +spark.hadoop.mapreduce.fileoutputcommitter.cleanup-failures.ignored true
    +```
    +
    +There's also the option of skipping the cleanup of temporary files in the output directory.
    +Enabling this option eliminates a small delay caused by listing and deleting any such files.
    +
    +```
    +spark.hadoop.mapreduce.fileoutputcommitter.cleanup.skipped true
    +```
    +
    +Bear in mind that storing temporary files can run up charges; Delete
    +directories called `"_temporary"` on a regular basis to avoid this.
    +
    +
    +### YARN Scheduler settings
    +
    +When running Spark in a YARN cluster running in EC2, turning off locality avoids any delays
    +waiting for the scheduler to find a node close to the data.
    +
    +```xml
    +  <property>
    +    <name>yarn.scheduler.capacity.node-locality-delay</name>
    +    <value>0</value>
    +  </property>
    +```
    +
    +This has to be set in the YARN cluster configuration, not in the Spark configuration.
    +
    +### Parquet I/O Settings
    +
    +For optimal performance when reading files saved in the Apache Parquet format,
    +read and write operations must be minimized, including generation of summary metadata,
    +and coalescing metadata from multiple files. The `filterPushdown` option
    +enables the Parquet library to optimize data reads itself, potentially saving bandwidth.
    +
    +```
    +spark.hadoop.parquet.enable.summary-metadata false
    +spark.sql.parquet.mergeSchema false
    +spark.sql.parquet.filterPushdown true
    +spark.sql.hive.metastorePartitionPruning true
    +```
    +
    +### ORC I/O Settings
    +
    +For optimal performance when reading files saved in the Apache ORC format,
    +read and write operations must be minimized. Here are the options to achieve this.
    +
    +
    +```
    +spark.sql.orc.filterPushdown true
    +spark.sql.orc.splits.include.file.footer true
    +spark.sql.orc.cache.stripe.details.size 10000
    +spark.sql.hive.metastorePartitionPruning true
    +```
    +
    +The `filterPushdown` option enables the ORC library to optimize data reads itself,
    +potentially saving bandwidth.
    +
    +The `spark.sql.orc.splits.include.file.footer` option means that the ORC file footer information,
    +is passed around with the file information \u2014so eliminating the need to reread this data.
    +
    +## <a name="authenticating"></a>Authenticating with Object Stores
    +
    +Apart from the special case of public read-only data, all object stores
    +require callers to authenticate themselves.
    +To do this, the Spark context must be configured with the authentication
    +details of the object store.
    +
    +1. In a YARN cluster, this may be done automatically in the `core-site.xml` file.
    +1. When Spark is running in cloud infrastructure (for example, on Amazon EC2, Google Cloud or
    +Microsoft Azure), the authentication details may be automatically derived from information
    +available to the VM.
    +1. `spark-submit` automatically picks up the contents of `AWS_ACCESS_KEY` and `AWS_SECRET_KEY`
    +environment variables and sets the associated configuration parameters for`s3n` and `s3a`
    +to these values. This essentially propagates the values across the Spark cluster.
    +1. Authentication details may be manually added to the Spark configuration
    +1. Alternatively, they can be programmatically added. *Important: never put authentication
    +secrets in source code. They will be compromised*.
    +
    +It is critical that the credentials used to access object stores are kept secret. Not only can
    +they be abused to run up compute charges, they can be used to read and alter private data.
    +
    +1. If adding login details to a spark configuration file, do not share this file, including
    +attaching to bug reports or committing it to SCM repositories.
    +1. Have different accounts for access to the storage for each application,
    +each with access rights restricted to those object storage buckets/containers used by the
    +application.
    +1. If the object store supports any form of session credential (e.g. Amazon's STS), issue
    +session credentials for the expected lifetime of the application.
    +1. When using a version of Spark with with Hadoop 2.8+ libraries, consider using Hadoop
    +credential files to store secrets, referencing
    +these files in the relevant ID/secret properties of the XML configuration file.
    +
    +
    +## <a name="object_stores"></a>Object stores and Their Library Dependencies
    +
    +The different object stores supported by Spark depend on specific Hadoop versions,
    +and require specific Hadoop JARs and dependent Java libraries on the classpath.
    +
    +<table class="table">
    +  <tr><th>Schema</th><th>Store</th><th>Details</th></tr>
    +  <tr>
    +    <td><code>s3a://</code></td>
    +    <td>Amazon S3</a>
    +    <td>
    +    Recommended S3 client for Spark releases built on Apache Hadoop 2.7 or later.
    +    </td>
    +  </tr>
    +  <tr>
    +    <td><code>s3n://</code></td>
    +    <td>Amazon S3</a>
    +    <td>
    +    Deprected S3 client; only use for Spark releases built on Apache Hadoop 2.6 or earlier.
    +    </td>
    +  </tr>
    +  <tr>
    +    <td><code>s3://</code></td>
    +    <td>Amazon S3 on Amazon EMR</a>
    +    <td>
    +    Amazon's own S3 client; use only and exclusivley in Amazon EMR.
    +    </td>
    +  </tr>
    +  <tr>
    +    <td><code>wasb://</code></td>
    +    <td>Azure Storage</a>
    +    <td>
    +    Client for Microsoft Azure Storage; since Hadoop 2.7.
    +    </td>
    +  </tr>
    +  <tr>
    +    <td><code>swift://</code></td>
    +    <td>OpenStack Swift</a>
    +    <td>
    +    Client for OpenStack Swift object stores.
    +    </td>
    +  </tr>
    +  <tr>
    +    <td><code>gs://</code></td>
    +    <td>Google Cloud Storage</a>
    +    <td>
    +    Google's client for their cloud object store.
    +    </td>
    +  </tr>
    +</table>
    +
    +
    +### <a name="working_with_amazon_s3"></a>Working with Amazon S3
    +
    +Amazon's S3 object store is probably the most widely used object store \u2014it is also the one
    +with the most client libraries. This is due to the evolution of Hadoop's support, and Amazon
    +offering Hadoop and Spark as its EMR service, along with its own S3 client.
    +
    +The recommendations for which client to use depend upon the version of Hadoop on the Spark classpath.
    +
    +<table class="table">
    +  <tr><th>Hadoop Library Version</th><th>Client</th></tr>
    +  <tr>
    +    <td>Hadoop 2.7+ and commercial products based on it</a>
    +    <td><code>s3a://</code></td>
    +  </tr>
    +  <tr>
    +    <td>Hadoop 2.6 or earlier</a>
    +    <td><code>s3n://</code></td>
    +  </tr>
    +  <tr>
    +    <td>Amazon EMR</a>
    +    <td><code>s3://</code></td>
    +  </tr>
    +</table>
    +
    +Authentication is generally via properties set in the spark context or, in YARN clusters,
    +`core-site.xml`.
    +Versions of the S3A client also support short-lived session credentials and IAM authentication to
    +automatically pick up credentials on EC2 deployments. Consult the appropriate Hadoop documentation for specifics.
    +
    +`spark-submit` will automatically pick up and propagate `AWS_ACCESS_KEY` and `AWS_SECRET_KEY`
    +from the environment variables set in the environment of the user running `spark-submit`; these
    +will override any set in the configuration files.
    +
    +Be aware that while S3 buckets support complex access control declarations, Spark needs
    +full read/write access to any bucket to which it must write data. That is: it does not support writing
    +to buckets where the root paths are read only, or not readable at all.
    +
    +#### <a name="s3a"></a>S3A Filesystem Client: `s3a://`
    +
    +The ["S3A" filesystem client](https://hadoop.apache.org/docs/stable2/hadoop-aws/tools/hadoop-aws/index.html)
    +is the sole S3 connector undergoing active maintenance at the Apache, and should be used wherever
    +possible.
    +
    +
    +**Tuning for performance:**
    +
    +For recent Hadoop versions, *when working with binary formats* (Parquet, ORC) use
    +
    +```
    +spark.hadoop.fs.s3a.experimental.input.fadvise random
    +```
    +
    +This reads from the object in blocks, which is efficient when seeking backwards as well as
    +forwards in a file \u2014at the expense of making full file reads slower.
    +
    +When working with text formats (text, CSV), or any sequential read through an entire file
    +(including .gzip compressed data),
    +this "random" I/O policy should be disabled. This is the default, but can be done
    +explicitly:
    +
    +```
    +spark.hadoop.fs.s3a.experimental.input.fadvise normal
    +spark.hadoop.fs.s3a.readahead.range 157810688
    +```
    +
    +This optimizes the object read for sequential input, and when there is a forward `seek()` call
    +up to that readahead range, will simply read the data in the current HTTPS request, rather than
    +abort it and start a new one.
    +
    +
    +#### <a name="s3n"></a>S3 Native Client `s3n://`
    +
    +The ["S3N" filesystem client](https://hadoop.apache.org/docs/stable2/hadoop-aws/tools/hadoop-aws/index.html)
    +was implemented in 2008 and has been widely used.
    +
    +While stable, S3N is essentially unmaintained, and deprecated in favor of S3A.
    +As well as being slower and limited in authentication mechanisms, the
    +only maintenance it receives are critical security issues.
    +
    +
    +#### <a name="emrs3"></a>Amazon EMR's S3 Client: `s3://`
    +
    +
    +In Amazon EMR, `s3://` is the URL schema used to refer to
    +[Amazon's own filesystem client](https://aws.amazon.com/premiumsupport/knowledge-center/emr-file-system-s3/),
    +one that is closed-source.
    +
    +As EMR also maps `s3n://` to the same filesystem, using URLs with the `s3n://` schema avoids
    +some confusion. Bear in mind, however, that Amazon's S3 client library is not the Apache one:
    +only Amazon can field bug reports related to it.
    +
    +To work with this data outside of EMR itself, use `s3a://` or `s3n://` instead.
    +
    +
    +#### <a name="asf_s3"></a>Obsolete: Apache Hadoop's S3 client, `s3://`
    +
    +Apache's own Hadoop releases (i.e not EMR), uses URL `s3://` to refer to a
    +deprecated inode-based filesystem implemented on top of S3.
    +This filesystem is obsolete, deprecated and has been dropped from Hadoop 3.x.
    +
    +*Important: * Do not use `s3://` URLs with Apache Spark except on Amazon EMR*
    +It is not the same as the Amazon EMR one and incompatible with all other applications.
    +
    +
    +### <a name="working_with_azure"></a>Working with Microsoft Azure Storage
    +
    +Azure support comes with the [`wasb` filesystem client](https://hadoop.apache.org/docs/stable2/hadoop-azure/index.html).
    +
    +The Apache implementation is that used by Microsoft in Azure itself: it can be used
    +to access data in Azure as well as remotely. The object store itself is *consistent*, and
    +can be reliably used as the destination of queries.
    +
    +
    +### <a name="working_with_swift"></a>Working with OpenStack Swift
    +
    +
    +The OpenStack [`swift://` filesystem client](https://hadoop.apache.org/docs/stable2/hadoop-openstack/index.html)
    +works with Swift object stores in private OpenStack installations, public installations
    +including Rackspace Cloud and IBM Softlayer.
    +
    +### <a name="working_with_google_cloud_storage"></a>Working with Google Cloud Storage
    +
    +[Google Cloud Storage](https://cloud.google.com/storage) is supported via Google's own
    +[GCS filesystem client](https://cloud.google.com/hadoop/google-cloud-storage-connector).
    +
    +
    +For use outside of Google cloud, `gcs-connector.jar` must be be manually downloaded then added
    +to `$SPARK_HOME/jars`.
    +
    +
    +## <a name="cloud_stores_are_not_filesystems"></a>Important: Cloud Object Stores are Not Real Filesystems
    +
    +Object stores are not filesystems: they are not a hierarchical tree of directories and files.
    +
    +The Hadoop filesystem APIs offer a filesystem API to the object stores, but underneath
    +they are still object stores, [and the difference is significant](http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/filesystem/introduction.html)
    +
    +While object stores can be used as the source and store
    +for persistent data, they cannot be used as a direct replacement for a cluster-wide filesystem such as HDFS.
    +This is important to know, as the fact they are accessed with the same APIs can be misleading.
    +
    +### Directory Operations May be Slow and Non-atomic
    +
    +Directory rename and delete may be performed as a series of operations. Specifically, recursive
    +directory deletion may be implemented as "list the objects, delete them singly or in batches".
    +File and directory renames may be implemented as "copy all the objects" followed by the delete operation.
    +
    +1. The time to delete a directory depends on the number of files in the directory.
    +1. Directory deletion may fail partway through, leaving a partially deleted directory.
    +1. Directory renaming may fail part way through, leaving the destination directory containing some of the files
    +being renamed, the source directory untouched.
    +1. The time to rename files and directories increases with the amount of data to rename.
    +1. If the rename is done on the client, the time to rename
    +each file will depend upon the bandwidth between client and the filesystem. The further away the client
    +is, the longer the rename will take.
    +1. Recursive directory listing can be very slow. This can slow down some parts of job submission
    +and execution.
    +
    +Because of these behaviours, committing of work by renaming directories is neither efficient nor
    +reliable. In Spark 1.6 and predecessors, there was a special output committer for Parquet,
    +the `org.apache.spark.sql.execution.datasources.parquet.DirectParquetOutputCommitter`
    +which bypasses the rename phase. However, as well as having major problems when used
    + with speculative execution enabled, it handled failures badly. For this reason, it
    +[was removed from Spark 2.0](https://issues.apache.org/jira/browse/SPARK-10063).
    +
    +*Critical* speculative execution does not work against object
    +stores which do not support atomic directory renames. Your output may get
    +corrupted.
    +
    +*Warning* even non-speculative execution is at risk of leaving the output of a job in an inconsistent
    +state if a "Direct" output committer is used and executors fail.
    +
    +### Data is Not Written Until the OutputStream's `close()` Operation.
    +
    +Data written to the object store is often buffered to a local file or stored in memory,
    --- End diff --
    
    This section I'm less clear is relevant to a Spark app. Anything's possible but it's rare that someone would write a stream directly (?) Dunno, it's not bad info, just want to balance writing and maintaining this info in Spark docs vs pointing to other resources with a summary of key points to know.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12004: [SPARK-7481] [build] Add spark-cloud module to pu...

Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12004#discussion_r113749781
  
    --- Diff: cloud/pom.xml ---
    @@ -0,0 +1,117 @@
    +<?xml version="1.0" encoding="UTF-8"?>
    +<!--
    +  ~ Licensed to the Apache Software Foundation (ASF) under one or more
    +  ~ contributor license agreements.  See the NOTICE file distributed with
    +  ~ this work for additional information regarding copyright ownership.
    +  ~ The ASF licenses this file to You under the Apache License, Version 2.0
    +  ~ (the "License"); you may not use this file except in compliance with
    +  ~ the License.  You may obtain a copy of the License at
    +  ~
    +  ~    http://www.apache.org/licenses/LICENSE-2.0
    +  ~
    +  ~ Unless required by applicable law or agreed to in writing, software
    +  ~ distributed under the License is distributed on an "AS IS" BASIS,
    +  ~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +  ~ See the License for the specific language governing permissions and
    +  ~ limitations under the License.
    +  -->
    +<project xmlns="http://maven.apache.org/POM/4.0.0"
    +  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    +  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    +  <modelVersion>4.0.0</modelVersion>
    +  <parent>
    +    <groupId>org.apache.spark</groupId>
    +    <artifactId>spark-parent_2.11</artifactId>
    +    <version>2.2.0-SNAPSHOT</version>
    +    <relativePath>../pom.xml</relativePath>
    +  </parent>
    +
    +  <artifactId>spark-hadoop-cloud_2.11</artifactId>
    +  <packaging>jar</packaging>
    +  <name>Spark Project Cloud Integration</name>
    +  <description>
    +    Contains support for cloud infrastructures, specifically the Hadoop JARs and
    +    transitive dependencies needed to interact with the infrastructures.
    +
    +    Any project which explicitly depends upon the spark-hadoop-cloud artifact will get the
    +    dependencies; the exact versions of which will depend upon the hadoop version Spark was compiled
    +    against.
    +
    +    The imports of transitive dependencies are managed to make them consistent
    +    with those of the Spark build.
    +
    +    WARNING: the signatures of methods in the AWS and Azure SDKs do change between
    +    versions: use exactly the same version with which the Hadoop JARs were
    +    built.
    +  </description>
    +  <properties>
    +    <sbt.project.name>hadoop-cloud</sbt.project.name>
    +  </properties>
    +
    +  <dependencies>
    +    <dependency>
    +      <groupId>org.apache.hadoop</groupId>
    +      <artifactId>hadoop-aws</artifactId>
    +      <scope>${hadoop.deps.scope}</scope>
    +    </dependency>
    +
    +    <dependency>
    +      <groupId>org.apache.hadoop</groupId>
    +      <artifactId>hadoop-openstack</artifactId>
    +      <scope>${hadoop.deps.scope}</scope>
    +    </dependency>
    +    <!--
    +    Add joda time to ensure that anything downstream which doesn't pull in spark-hive
    +    gets the correct joda time artifact, so doesn't have auth failures on later Java 8 JVMs
    +    -->
    +    <dependency>
    +      <groupId>joda-time</groupId>
    +      <artifactId>joda-time</artifactId>
    +      <scope>${hadoop.deps.scope}</scope>
    +    </dependency>
    +    <!-- explicitly declare the jackson artifacts desired -->
    +    <dependency>
    +      <groupId>com.fasterxml.jackson.core</groupId>
    +      <artifactId>jackson-databind</artifactId>
    +      <scope>${hadoop.deps.scope}</scope>
    +    </dependency>
    +    <dependency>
    +      <groupId>com.fasterxml.jackson.core</groupId>
    +      <artifactId>jackson-annotations</artifactId>
    +      <scope>${hadoop.deps.scope}</scope>
    +    </dependency>
    +    <dependency>
    +      <groupId>com.fasterxml.jackson.dataformat</groupId>
    +      <artifactId>jackson-dataformat-cbor</artifactId>
    +      <scope>${hadoop.deps.scope}</scope>
    +    </dependency>
    +    <!--Explicit declaration to force in Spark version into transitive dependencies -->
    +    <dependency>
    +      <groupId>org.apache.httpcomponents</groupId>
    +      <artifactId>httpclient</artifactId>
    +      <scope>${hadoop.deps.scope}</scope>
    +    </dependency>
    +    <!--Explicit declaration to force in Spark version into transitive dependencies -->
    +    <dependency>
    +      <groupId>org.apache.httpcomponents</groupId>
    +      <artifactId>httpcore</artifactId>
    +      <scope>${hadoop.deps.scope}</scope>
    +    </dependency>
    +  </dependencies>
    +
    +  <profiles>
    +
    +    <profile>
    +      <id>hadoop-2.7</id>
    --- End diff --
    
    yes
    
    * 2.7 adds `hadoop-azure` for `wasb:`
    * 2.8 adds `hadoop-azure-datalake` for `adl:`
    
    There's going to be an aggregate POM in trunk, `hadoop-cloud-storage`, which declares all transitive stuff, ideally stripping down cruft we don't need. That way if new things go in, anything pulling that JAR shouldn't have to add new declarations. There's still the problem of transitive breakage of JARs (i.e. Jackson)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12004: [SPARK-7481] [build] Add spark-cloud module to pu...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12004#discussion_r113696820
  
    --- Diff: docs/cloud-integration.md ---
    @@ -0,0 +1,512 @@
    +---
    +layout: global
    +displayTitle: Integration with Cloud Infrastructures
    +title: Integration with Cloud Infrastructures
    +description: Introduction to cloud storage support in Apache Spark SPARK_VERSION_SHORT
    +---
    +<!---
    +  Licensed under the Apache License, Version 2.0 (the "License");
    +  you may not use this file except in compliance with the License.
    +  You may obtain a copy of the License at
    +
    +   http://www.apache.org/licenses/LICENSE-2.0
    +
    +  Unless required by applicable law or agreed to in writing, software
    +  distributed under the License is distributed on an "AS IS" BASIS,
    +  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +  See the License for the specific language governing permissions and
    +  limitations under the License. See accompanying LICENSE file.
    +-->
    +
    +* This will become a table of contents (this text will be scraped).
    +{:toc}
    +
    +## <a name="introduction"></a>Introduction
    +
    +
    +All the public cloud infrastructures, Amazon AWS, Microsoft Azure, Google GCS and others offer
    +persistent data storage systems, "object stores". These are not quite the same as classic file
    +systems: in order to scale to hundreds of Petabytes, without any single points of failure
    +or size limits, object stores, "blobstores", have a simpler model of `name => data`.
    +
    +Apache Spark can read or write data in object stores for data access.
    +through filesystem connectors implemented in Apache Hadoop or provided by third-parties.
    +These libraries make the object stores look *almost* like filesystems, with directories and
    +operations on files (rename) and directories (create, rename, delete) which mimic
    +those of a classic filesystem. Because of this, Spark and Spark-based applications
    +can work with object stores, generally treating them as as if they were slower-but-larger filesystems.
    +
    +With these connectors, Apache Spark supports object stores as the source
    +of data for analysis, including Spark Streaming and DataFrames.
    +
    +
    +## <a name="quick_start"></a>Quick Start
    +
    +Provided the relevant libraries are on the classpath, and Spark is configured with your credentials,
    +objects in an object store can be can be read or written through URLs which uses the name of the
    +object store client as the schema and the bucket/container as the hostname.
    +
    +
    +### Dependencies
    +
    +The Spark application neeeds the relevant Hadoop libraries, which can
    +be done by including the `spark-hadoop-cloud` module for the specific version of spark used.
    +
    +The Spark application should include <code>hadoop-openstack</code> dependency, which can
    +be done by including the `spark-hadoop-cloud` module for the specific version of spark used.
    +For example, for Maven support, add the following to the <code>pom.xml</code> file:
    +
    +{% highlight xml %}
    +<dependencyManagement>
    +  ...
    +  <dependency>
    +    <groupId>org.apache.spark</groupId>
    +    <artifactId>spark-hadoop-cloud_2.11</artifactId>
    +    <version>${spark.version}</version>
    +  </dependency>
    +  ...
    +</dependencyManagement>
    +{% endhighlight %}
    +
    +If using the Scala 2.10-compatible version of Spark, the artifact is of course `spark-hadoop-cloud_2.10`.
    +
    +### Basic Use
    +
    +You can refer to data in an object store just as you would data in a filesystem, by
    +using a URL to the data in methods like `SparkContext.textFile()` to read data, 
    +`saveAsTextFile()` to write it back.
    +
    +
    +Because object stores are viewed by Spark as filesystems, object stores can
    +be used as the source or destination of any spark work \u2014be it batch, SQL, DataFrame,
    +Streaming or something else.
    +
    +The steps to do so are as follows
    +
    +1. Use the full URI to refer to a bucket, including the prefix for the client-side library
    +to use. Example: `s3a://landsat-pds/scene_list.gz`
    +1. Have the Spark context configured with the authentication details of the object store.
    +In a YARN cluster, this may also be done in the `core-site.xml` file.
    +
    +
    +## <a name="output"></a>Object Stores as a substitute for HDFS
    +
    +As the examples show, you can write data to object stores. However, that does not mean
    +That they can be used as replacements for a cluster-wide filesystem.
    +
    +The full details are covered in [Cloud Object Stores are Not Real Filesystems](#cloud_stores_are_not_filesystems).
    +
    +The brief summary is:
    +
    +| Object Store Connector      |  Replace HDFS? |
    +|-----------------------------|--------------------|
    +| `s3a://` `s3n://`  from the ASF   | No  |
    +| Amazon EMR `s3://`          | Yes |
    +| Microsoft Azure `wasb://`   | Yes |
    +| OpenStack `swift://`        | No  |
    +
    +It is possible to use any of the object stores as a destination of work, i.e. use
    +`saveAsTextFile()` or `save()` to save data there, but the commit process may be slow
    +and, unreliable in the presence of failures.
    +
    +It is faster and safer to use the cluster filesystem as the destination of Spark jobs,
    +using that data as the data for follow-on work. The final results can
    +be persisted in to the object store using `distcp`.
    +
    +#### <a name="checkpointing"></a>Spark Streaming and object stores
    +
    +Spark Streaming can monitor files added to object stores, by
    +creating a `FileInputDStream` DStream monitoring a path under a bucket through
    +`StreamingContext.textFileStream()`.
    +
    +
    +1. The time to scan for new files is proportional to the number of files
    +under the path \u2014not the number of *new* files, and that it can become a slow operation.
    +The size of the window needs to be set to handle this.
    +
    +1. Files only appear in an object store once they are completely written; there
    +is no need for a worklow of write-then-rename to ensure that files aren't picked up
    +while they are still being written. Applications can write straight to the monitored directory.
    +
    +1. Streams should only be checkpointed to an object store considered compatible with
    +HDFS. Otherwise the checkpointing will be slow and potentially unreliable.
    +
    +### Recommended settings for writing to object stores
    +
    +Here are the settings to use when writing to object stores. This uses the "version 2" algorithm
    +for committing files \u2014which does less renaming than the v1 algorithm. Speculative execution is
    +disabled to avoid multiple writers corrupting the output.
    +
    +```
    +spark.speculation false
    +spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2
    --- End diff --
    
    CC @vanzin I remember we were talking about v1 vs v2 but I don't remember what the outcome was.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481][build] [WIP] Add Hadoop 2.6+ spark-cloud mo...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    **[Test build #63789 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63789/consoleFull)** for PR 12004 at commit [`2001dd0`](https://github.com/apache/spark/commit/2001dd075cad851aa7bb958e7b2fcdc23268999d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481][build] [WIP] Add Hadoop 2.7+ spark-cloud mo...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    **[Test build #64503 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64503/consoleFull)** for PR 12004 at commit [`a00a555`](https://github.com/apache/spark/commit/a00a5554f9c3789510bedd24292c17b4c9a7efbd).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481] [build] Add spark-cloud module to pull in o...

Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    The latest patch embraces the fact that 2.6 is the base hadoop version so the `hadoop-aws` JAR is always pulled in, dependencies set up. One thing to bear in mind here that the [Phase I fixes|https://issues.apache.org/jira/browse/HADOOP-11571] aren't in there, And s3a absolutely must not be used in production, the big killers being:
    
    * [HADOOP-11570](https://issues.apache.org/jira/browse/HADOOP-11570) closing the stream reads to the EOF, which means every seek() can read in 2x file size.
    * [HADOOP-11584](https://issues.apache.org/jira/browse/HADOOP-11584) block size returned in `getFileStatus()` ==0. That is bad because both Pig and Spark use that block size in partitioning, so will split up a file into single byte partitions: 20MB file, 2*10^7 tasks. Each of which will open the file at byte (0), then call seek to offset, then close(). As a result, 2*10e7 * tasks reading 2* 2 2 * 10e7 bytes. This is generally considered "pathologically suboptimal". I've had to modify my downstream tests to recognise when the block size of a file ==0 and skip those tests.
    
    s3n will work; in 2.6 it moved to the aws JAR, so reinstate the functionality which was in spark builds against hadoop 2.2-2.5


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481] [build] Add spark-cloud module to pull in o...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7481][build][WIP] Add Hadoop 2.6+ spark...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12004#issuecomment-210418231
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/55916/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481][build] [WIP] Add Hadoop 2.6+ spark-cloud mo...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63588/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7481][build][WIP] Add Hadoop 2.6+ spark...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12004#issuecomment-203480753
  
    **[Test build #54525 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54525/consoleFull)** for PR 12004 at commit [`6beafb5`](https://github.com/apache/spark/commit/6beafb551551f3cfd28ff3f0f085b156c9e5fb38).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481] [WiP] [build] Add spark-cloud module to pul...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    **[Test build #66182 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66182/consoleFull)** for PR 12004 at commit [`ab293f4`](https://github.com/apache/spark/commit/ab293f40bb8fc470e53bab208a9785e9c3474a41).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481][build] [WIP] Add Hadoop 2.6+ spark-cloud mo...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7481][build][WIP] Add Hadoop 2.6+ spark...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12004#issuecomment-212548919
  
    **[Test build #56390 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56390/consoleFull)** for PR 12004 at commit [`2fca815`](https://github.com/apache/spark/commit/2fca815198cd0cd578f9fa52408ac38f9142c2b4).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481] [build] Add spark-cloud module to pull in o...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481] [build] Add spark-cloud module to pull in o...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    **[Test build #71148 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71148/testReport)** for PR 12004 at commit [`c911ccb`](https://github.com/apache/spark/commit/c911ccb8b0d4218919fb3a6781ed5e19933fd7dc).
     * This patch **fails build dependency tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7481][build][WIP] Add Hadoop 2.6+ profi...

Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on the pull request:

    https://github.com/apache/spark/pull/12004#issuecomment-202793164
  
    test failures are in hive; unrelated


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7481][build][WIP] Add Hadoop 2.6+ spark...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12004#issuecomment-203484639
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/54525/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7481][build][WIP] Add Hadoop 2.6+ profi...

Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on the pull request:

    https://github.com/apache/spark/pull/12004#issuecomment-202517351
  
    Note that as this patch is is playing with the maven build and the hadoop-2.6 and hadoop-2.7 profiles, the SparkQA builds aren't going to pick up on much here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12004: [SPARK-7481] [build] Add spark-cloud module to pu...

Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12004#discussion_r113950929
  
    --- Diff: docs/cloud-integration.md ---
    @@ -0,0 +1,512 @@
    +---
    +layout: global
    +displayTitle: Integration with Cloud Infrastructures
    +title: Integration with Cloud Infrastructures
    +description: Introduction to cloud storage support in Apache Spark SPARK_VERSION_SHORT
    +---
    +<!---
    +  Licensed under the Apache License, Version 2.0 (the "License");
    +  you may not use this file except in compliance with the License.
    +  You may obtain a copy of the License at
    +
    +   http://www.apache.org/licenses/LICENSE-2.0
    +
    +  Unless required by applicable law or agreed to in writing, software
    +  distributed under the License is distributed on an "AS IS" BASIS,
    +  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +  See the License for the specific language governing permissions and
    +  limitations under the License. See accompanying LICENSE file.
    +-->
    +
    +* This will become a table of contents (this text will be scraped).
    +{:toc}
    +
    +## <a name="introduction"></a>Introduction
    +
    +
    +All the public cloud infrastructures, Amazon AWS, Microsoft Azure, Google GCS and others offer
    +persistent data storage systems, "object stores". These are not quite the same as classic file
    +systems: in order to scale to hundreds of Petabytes, without any single points of failure
    +or size limits, object stores, "blobstores", have a simpler model of `name => data`.
    +
    +Apache Spark can read or write data in object stores for data access.
    +through filesystem connectors implemented in Apache Hadoop or provided by third-parties.
    +These libraries make the object stores look *almost* like filesystems, with directories and
    +operations on files (rename) and directories (create, rename, delete) which mimic
    +those of a classic filesystem. Because of this, Spark and Spark-based applications
    +can work with object stores, generally treating them as as if they were slower-but-larger filesystems.
    +
    +With these connectors, Apache Spark supports object stores as the source
    +of data for analysis, including Spark Streaming and DataFrames.
    +
    +
    +## <a name="quick_start"></a>Quick Start
    +
    +Provided the relevant libraries are on the classpath, and Spark is configured with your credentials,
    +objects in an object store can be can be read or written through URLs which uses the name of the
    +object store client as the schema and the bucket/container as the hostname.
    +
    +
    +### Dependencies
    +
    +The Spark application neeeds the relevant Hadoop libraries, which can
    +be done by including the `spark-hadoop-cloud` module for the specific version of spark used.
    +
    +The Spark application should include <code>hadoop-openstack</code> dependency, which can
    --- End diff --
    
    done throughout


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481] [build] Add spark-cloud module to pull in o...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    (Continuing email thread): Yes, try `./dev/test-dependencies.sh --replace-manifest`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12004: [SPARK-7481] [build] Add spark-cloud module to pu...

Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12004#discussion_r113725132
  
    --- Diff: assembly/pom.xml ---
    @@ -226,5 +226,19 @@
             <parquet.deps.scope>provided</parquet.deps.scope>
           </properties>
         </profile>
    +
    +    <!--
    +     Pull in spark-hadoop-cloud and its associated JARs,
    +    -->
    +    <profile>
    +      <id>cloud</id>
    --- End diff --
    
    so org/apache/spark + hadoop-cloud?  I'll cause too much confusion were any JAR created thrown into a lib/ directory; you'd get 
    
    ```
    hadoop-aws-2.8.1.jar
    spark-core-2.3.0
    hadoop-cloud-2.3.0
    ```
    & people would be trying to understand why the hadoop-* was out of sync, who to ping, etc. 
    
    There's actually a [hadoop-cloudproject POM](https://github.com/apache/hadoop/blob/trunk/hadoop-cloud-storage-project/hadoop-cloud-storage/pom.xml) coming in hadoop-trunk to try and be a one-stop-dependency for all cloud bindings (avoiding the ongoing "declare new dependencies per version"). the names are way too close.
    
    I'd had it as spark-cloud, you'd felt spark-hadoop-cloud was better. I can't think of what else would do, but I do think spark- is the string which should go at the front


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481] [build] Add spark-cloud module to pull in o...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66505/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12004: [SPARK-7481] [build] Add spark-cloud module to pu...

Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12004#discussion_r89315595
  
    --- Diff: pom.xml ---
    @@ -2558,6 +2660,26 @@
           </modules>
         </profile>
     
    +    <!--
    +      The cloud profile enables the cloud module.
    +      It does not declare the hadoop-* artifacts which
    +      the cloud module pulls in; these are delegated to
    +      the hadoop-x.y protocols, so permitting different
    +      hadoop versions to declare different include/exclude
    +      rules (especially transient dependencies).
    +
    +      To use this profile, the hadoop-2.7 profile must also
    --- End diff --
    
    you'd never want to have cloud without hadoop-2.7, but you may want to do hadoop-2.7 without cloud. That really mattered on spark-1.6, as it would make for a very large spark-assembly; in 2.x it will result in more files in SPARK_HOME/jars, and for a bigger spark tarball.
    
    I'd left it as an option, as with hive, mesos and yarn. However, if you do try to build without hadoop-2.7 set, things won't build


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481][build] [WIP] Add Hadoop 2.6+ spark-cloud mo...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    **[Test build #64206 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64206/consoleFull)** for PR 12004 at commit [`63cf84f`](https://github.com/apache/spark/commit/63cf84f17d79813404b03c259a52bccb2dcb5853).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481] [build] Add spark-cloud module to pull in o...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/68936/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481][build] [WIP] Add Hadoop 2.6+ spark-cloud mo...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7481][build][WIP] Add Hadoop 2.6+ spark...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12004#issuecomment-211546335
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/56092/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7481][build][WIP] Add Hadoop 2.6+ spark...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12004#issuecomment-212444385
  
    **[Test build #56360 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56360/consoleFull)** for PR 12004 at commit [`8845af0`](https://github.com/apache/spark/commit/8845af0be6e516b956f6acda222ddc0dd85ad17c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481] [build] Add spark-cloud module to pull in o...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12004: [SPARK-7481] [build] Add spark-cloud module to pu...

Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12004#discussion_r89176373
  
    --- Diff: pom.xml ---
    @@ -2558,6 +2660,26 @@
           </modules>
         </profile>
     
    +    <!--
    +      The cloud profile enables the cloud module.
    +      It does not declare the hadoop-* artifacts which
    +      the cloud module pulls in; these are delegated to
    +      the hadoop-x.y protocols, so permitting different
    +      hadoop versions to declare different include/exclude
    +      rules (especially transient dependencies).
    +
    +      To use this profile, the hadoop-2.7 profile must also
    --- End diff --
    
    as long as you always want to include the cloud JARS on the CP, yes. I'd kept it as a separate option to avoid forcing it in, but it would be a lot simpler if it was unified


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12004: [SPARK-7481] [build] Add spark-cloud module to pu...

Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12004#discussion_r114331264
  
    --- Diff: docs/cloud-integration.md ---
    @@ -0,0 +1,512 @@
    +---
    +layout: global
    +displayTitle: Integration with Cloud Infrastructures
    +title: Integration with Cloud Infrastructures
    +description: Introduction to cloud storage support in Apache Spark SPARK_VERSION_SHORT
    +---
    +<!---
    +  Licensed under the Apache License, Version 2.0 (the "License");
    +  you may not use this file except in compliance with the License.
    +  You may obtain a copy of the License at
    +
    +   http://www.apache.org/licenses/LICENSE-2.0
    +
    +  Unless required by applicable law or agreed to in writing, software
    +  distributed under the License is distributed on an "AS IS" BASIS,
    +  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +  See the License for the specific language governing permissions and
    +  limitations under the License. See accompanying LICENSE file.
    +-->
    +
    +* This will become a table of contents (this text will be scraped).
    +{:toc}
    +
    +## <a name="introduction"></a>Introduction
    +
    +
    +All the public cloud infrastructures, Amazon AWS, Microsoft Azure, Google GCS and others offer
    +persistent data storage systems, "object stores". These are not quite the same as classic file
    +systems: in order to scale to hundreds of Petabytes, without any single points of failure
    +or size limits, object stores, "blobstores", have a simpler model of `name => data`.
    +
    +Apache Spark can read or write data in object stores for data access.
    +through filesystem connectors implemented in Apache Hadoop or provided by third-parties.
    +These libraries make the object stores look *almost* like filesystems, with directories and
    +operations on files (rename) and directories (create, rename, delete) which mimic
    +those of a classic filesystem. Because of this, Spark and Spark-based applications
    +can work with object stores, generally treating them as as if they were slower-but-larger filesystems.
    +
    +With these connectors, Apache Spark supports object stores as the source
    +of data for analysis, including Spark Streaming and DataFrames.
    +
    +
    +## <a name="quick_start"></a>Quick Start
    +
    +Provided the relevant libraries are on the classpath, and Spark is configured with your credentials,
    +objects in an object store can be can be read or written through URLs which uses the name of the
    +object store client as the schema and the bucket/container as the hostname.
    +
    +
    +### Dependencies
    +
    +The Spark application neeeds the relevant Hadoop libraries, which can
    +be done by including the `spark-hadoop-cloud` module for the specific version of spark used.
    +
    +The Spark application should include <code>hadoop-openstack</code> dependency, which can
    +be done by including the `spark-hadoop-cloud` module for the specific version of spark used.
    +For example, for Maven support, add the following to the <code>pom.xml</code> file:
    +
    +{% highlight xml %}
    +<dependencyManagement>
    +  ...
    +  <dependency>
    +    <groupId>org.apache.spark</groupId>
    +    <artifactId>spark-hadoop-cloud_2.11</artifactId>
    +    <version>${spark.version}</version>
    +  </dependency>
    +  ...
    +</dependencyManagement>
    +{% endhighlight %}
    +
    +If using the Scala 2.10-compatible version of Spark, the artifact is of course `spark-hadoop-cloud_2.10`.
    +
    +### Basic Use
    +
    +You can refer to data in an object store just as you would data in a filesystem, by
    +using a URL to the data in methods like `SparkContext.textFile()` to read data, 
    +`saveAsTextFile()` to write it back.
    +
    +
    +Because object stores are viewed by Spark as filesystems, object stores can
    +be used as the source or destination of any spark work —be it batch, SQL, DataFrame,
    +Streaming or something else.
    +
    +The steps to do so are as follows
    +
    +1. Use the full URI to refer to a bucket, including the prefix for the client-side library
    +to use. Example: `s3a://landsat-pds/scene_list.gz`
    +1. Have the Spark context configured with the authentication details of the object store.
    +In a YARN cluster, this may also be done in the `core-site.xml` file.
    +
    +
    +## <a name="output"></a>Object Stores as a substitute for HDFS
    +
    +As the examples show, you can write data to object stores. However, that does not mean
    +That they can be used as replacements for a cluster-wide filesystem.
    +
    +The full details are covered in [Cloud Object Stores are Not Real Filesystems](#cloud_stores_are_not_filesystems).
    +
    +The brief summary is:
    +
    +| Object Store Connector      |  Replace HDFS? |
    +|-----------------------------|--------------------|
    +| `s3a://` `s3n://`  from the ASF   | No  |
    +| Amazon EMR `s3://`          | Yes |
    +| Microsoft Azure `wasb://`   | Yes |
    +| OpenStack `swift://`        | No  |
    +
    +It is possible to use any of the object stores as a destination of work, i.e. use
    +`saveAsTextFile()` or `save()` to save data there, but the commit process may be slow
    +and, unreliable in the presence of failures.
    --- End diff --
    
    fixed by cutting whole section


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481][build] [WIP] Add Hadoop 2.6+ spark-cloud mo...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    **[Test build #62260 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62260/consoleFull)** for PR 12004 at commit [`4bc668a`](https://github.com/apache/spark/commit/4bc668a7ddaf455b442a4d7fed170d19678d800c).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481] [build] Add spark-cloud module to pull in o...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481] [build] Add spark-cloud module to pull in o...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    **[Test build #69578 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69578/consoleFull)** for PR 12004 at commit [`bd50732`](https://github.com/apache/spark/commit/bd50732993e4e4ab98f3037bf29b83907f437f2a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481] [WiP] [build] Add spark-cloud module to pul...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    **[Test build #66504 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66504/consoleFull)** for PR 12004 at commit [`764294b`](https://github.com/apache/spark/commit/764294bc3466e99d9743f922a05e0873a8f0f4b9).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481] [WiP] [build] Add spark-cloud module to pul...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    **[Test build #66182 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66182/consoleFull)** for PR 12004 at commit [`ab293f4`](https://github.com/apache/spark/commit/ab293f40bb8fc470e53bab208a9785e9c3474a41).
     * This patch **fails build dependency tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481] [build] Add spark-cloud module to pull in o...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/67991/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7481][build][WIP] Add Hadoop 2.6+ spark...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12004#issuecomment-214830134
  
    **[Test build #57003 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57003/consoleFull)** for PR 12004 at commit [`4e4e941`](https://github.com/apache/spark/commit/4e4e9419179a218a2c0e9df9a58ce649512b5e3a).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481] [build] Add spark-cloud module to pull in o...

Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    @nchammas the AWS SDK you get will be in sync with hadoop-aws; you have to keep them in sync.
    
    what is more brittle is the transients: httpclient, joda time, jackson, etc, which is what recent patches in hadoop-aws have been trying to lock down for hadoop consistency. Which doesn't help spark, as its choices are different. Hence the explicit declaration of the version of things in the aws module; force exclusion of the aws dependencies because the spark ones are declared closer to the root of the tree. Oddly enough, some of the hadoop explicit version declarations make things worse, as they raise up the declaration of some artifacts higher, and with maven's closest-version-wins policy, that breaks other things. Fault there is mvn conflict resolution policy of closeness over newness, for better or worse.
    
    A particular issue is {{jackson-dataformat-cbor}}, which is a jackson artifact not used/declared by the rest of spark. Because it's not used elsewhere, there's no eviction of the one coming from the aws sdk, so packaging works, but link fails at runtime time. This patch declares the JAR, using Spark's jackson version to fix it in place. Without this, you will see stack traces against some versions of hadoop-aws/aws-sdk-s3
    
    Those problems aren't being checked in this module, grab https://github.com/steveloughran/spark-cloud-examples/tree/master for that. 
    
    Actually, joda-time is only correctly picked up if you grab spark-hive. I've added a declaration of it here so that if someone pulls in spark-cloud without spark-hive they don't get auth errors against s3 caused by misformatting of timestamps on http requests. Dependency management is an enternal conflict



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481] [build] Add spark-cloud module to pull in o...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    **[Test build #73430 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73430/testReport)** for PR 12004 at commit [`94aa9ea`](https://github.com/apache/spark/commit/94aa9eaa41aa4217cea81e78e4b92a5f06349670).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481][build] [WIP] Add Hadoop 2.7+ spark-cloud mo...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    **[Test build #65396 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65396/consoleFull)** for PR 12004 at commit [`ca3163d`](https://github.com/apache/spark/commit/ca3163ddc18c18cb626e50ab5ba1a650a2d5c8ea).
     * This patch **fails build dependency tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481] [build] Add spark-cloud module to pull in o...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    **[Test build #74899 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74899/testReport)** for PR 12004 at commit [`83d9368`](https://github.com/apache/spark/commit/83d936870ad0651fc2622593e53d3e31d7eb8d4b).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12004: [SPARK-7481] [build] Add spark-cloud module to pu...

Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12004#discussion_r113997967
  
    --- Diff: docs/cloud-integration.md ---
    @@ -0,0 +1,512 @@
    +---
    +layout: global
    +displayTitle: Integration with Cloud Infrastructures
    +title: Integration with Cloud Infrastructures
    +description: Introduction to cloud storage support in Apache Spark SPARK_VERSION_SHORT
    +---
    +<!---
    +  Licensed under the Apache License, Version 2.0 (the "License");
    +  you may not use this file except in compliance with the License.
    +  You may obtain a copy of the License at
    +
    +   http://www.apache.org/licenses/LICENSE-2.0
    +
    +  Unless required by applicable law or agreed to in writing, software
    +  distributed under the License is distributed on an "AS IS" BASIS,
    +  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +  See the License for the specific language governing permissions and
    +  limitations under the License. See accompanying LICENSE file.
    +-->
    +
    +* This will become a table of contents (this text will be scraped).
    +{:toc}
    +
    +## <a name="introduction"></a>Introduction
    +
    +
    +All the public cloud infrastructures, Amazon AWS, Microsoft Azure, Google GCS and others offer
    +persistent data storage systems, "object stores". These are not quite the same as classic file
    +systems: in order to scale to hundreds of Petabytes, without any single points of failure
    +or size limits, object stores, "blobstores", have a simpler model of `name => data`.
    +
    +Apache Spark can read or write data in object stores for data access.
    +through filesystem connectors implemented in Apache Hadoop or provided by third-parties.
    +These libraries make the object stores look *almost* like filesystems, with directories and
    +operations on files (rename) and directories (create, rename, delete) which mimic
    +those of a classic filesystem. Because of this, Spark and Spark-based applications
    +can work with object stores, generally treating them as as if they were slower-but-larger filesystems.
    +
    +With these connectors, Apache Spark supports object stores as the source
    +of data for analysis, including Spark Streaming and DataFrames.
    +
    +
    +## <a name="quick_start"></a>Quick Start
    +
    +Provided the relevant libraries are on the classpath, and Spark is configured with your credentials,
    +objects in an object store can be can be read or written through URLs which uses the name of the
    +object store client as the schema and the bucket/container as the hostname.
    +
    +
    +### Dependencies
    +
    +The Spark application neeeds the relevant Hadoop libraries, which can
    +be done by including the `spark-hadoop-cloud` module for the specific version of spark used.
    +
    +The Spark application should include <code>hadoop-openstack</code> dependency, which can
    +be done by including the `spark-hadoop-cloud` module for the specific version of spark used.
    +For example, for Maven support, add the following to the <code>pom.xml</code> file:
    +
    +{% highlight xml %}
    +<dependencyManagement>
    +  ...
    +  <dependency>
    +    <groupId>org.apache.spark</groupId>
    +    <artifactId>spark-hadoop-cloud_2.11</artifactId>
    +    <version>${spark.version}</version>
    +  </dependency>
    +  ...
    +</dependencyManagement>
    +{% endhighlight %}
    +
    +If using the Scala 2.10-compatible version of Spark, the artifact is of course `spark-hadoop-cloud_2.10`.
    +
    +### Basic Use
    +
    +You can refer to data in an object store just as you would data in a filesystem, by
    +using a URL to the data in methods like `SparkContext.textFile()` to read data, 
    +`saveAsTextFile()` to write it back.
    +
    +
    +Because object stores are viewed by Spark as filesystems, object stores can
    +be used as the source or destination of any spark work \u2014be it batch, SQL, DataFrame,
    +Streaming or something else.
    +
    +The steps to do so are as follows
    +
    +1. Use the full URI to refer to a bucket, including the prefix for the client-side library
    +to use. Example: `s3a://landsat-pds/scene_list.gz`
    +1. Have the Spark context configured with the authentication details of the object store.
    +In a YARN cluster, this may also be done in the `core-site.xml` file.
    +
    +
    +## <a name="output"></a>Object Stores as a substitute for HDFS
    +
    +As the examples show, you can write data to object stores. However, that does not mean
    +That they can be used as replacements for a cluster-wide filesystem.
    +
    +The full details are covered in [Cloud Object Stores are Not Real Filesystems](#cloud_stores_are_not_filesystems).
    +
    +The brief summary is:
    +
    +| Object Store Connector      |  Replace HDFS? |
    +|-----------------------------|--------------------|
    +| `s3a://` `s3n://`  from the ASF   | No  |
    +| Amazon EMR `s3://`          | Yes |
    +| Microsoft Azure `wasb://`   | Yes |
    +| OpenStack `swift://`        | No  |
    +
    +It is possible to use any of the object stores as a destination of work, i.e. use
    +`saveAsTextFile()` or `save()` to save data there, but the commit process may be slow
    +and, unreliable in the presence of failures.
    +
    +It is faster and safer to use the cluster filesystem as the destination of Spark jobs,
    +using that data as the data for follow-on work. The final results can
    +be persisted in to the object store using `distcp`.
    +
    +#### <a name="checkpointing"></a>Spark Streaming and object stores
    +
    +Spark Streaming can monitor files added to object stores, by
    +creating a `FileInputDStream` DStream monitoring a path under a bucket through
    +`StreamingContext.textFileStream()`.
    +
    +
    +1. The time to scan for new files is proportional to the number of files
    +under the path \u2014not the number of *new* files, and that it can become a slow operation.
    +The size of the window needs to be set to handle this.
    +
    +1. Files only appear in an object store once they are completely written; there
    +is no need for a worklow of write-then-rename to ensure that files aren't picked up
    +while they are still being written. Applications can write straight to the monitored directory.
    +
    +1. Streams should only be checkpointed to an object store considered compatible with
    +HDFS. Otherwise the checkpointing will be slow and potentially unreliable.
    +
    +### Recommended settings for writing to object stores
    +
    +Here are the settings to use when writing to object stores. This uses the "version 2" algorithm
    +for committing files \u2014which does less renaming than the v1 algorithm. Speculative execution is
    +disabled to avoid multiple writers corrupting the output.
    --- End diff --
    
    See [SPARK-4879](https://issues.apache.org/jira/browse/SPARK-4879) for this surfacing in Spark & S3, albeit before Josh added the coordination layer in Spark. The underlying committer was using that O(1) rename as an implicit co-ordinator amongst workers, which doesn't work when its a non-atomic O(file *data) kind of call.
    
    Thinking about it, I could just be over-cautious based on my fear of the current s3a/FileOutputCommitter knowledge; other object stores may behave better (Azure has leases to manage some of this), and Josh's protocol should make the problem go away on all but the direct committer, which is disabled when speculate=true.
    
    so, 3 options
    
    1. Don't mention it at all
    2. Say "you should do this"
    3. Say "speculation may not work, consult your cloud storage provider/connector docs"?
    
    option 3 means that when you use the current netflix staging committer, my s3guard derivative, etc, all is well, same for anything that others provide. And I'll be doing my best to test all of this to see if I can create a problem


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481][build] [WIP] Add Hadoop 2.7+ spark-cloud mo...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64503/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7481][build][WIP] Add Hadoop 2.6+ profi...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12004#issuecomment-202556003
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481][build] [WIP] Add Hadoop 2.6+ spark-cloud mo...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481][build] [WIP] Add Hadoop 2.6+ spark-cloud mo...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    **[Test build #64290 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64290/consoleFull)** for PR 12004 at commit [`4601b0a`](https://github.com/apache/spark/commit/4601b0a38c6f794959d3760e457800909816fb5e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481][build] [WIP] Add Hadoop 2.6+ spark-cloud mo...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64290/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481][build] [WIP] Add Hadoop 2.6+ spark-cloud mo...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63789/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481] [build] Add spark-cloud module to pull in o...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    **[Test build #72155 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72155/testReport)** for PR 12004 at commit [`4a7b61d`](https://github.com/apache/spark/commit/4a7b61d14dc28d95730c3ff89b46971ba194734a).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481] [build] Add spark-cloud module to pull in o...

Posted by nchammas <gi...@git.apache.org>.
Github user nchammas commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    >  Does a build of Spark + Hadoop 2.7 right now have no ability at all to read from S3 out of the box, or just not full / ideal support?
    
    No ability at all, as far as I can tell. People have to explicitly start their Spark session with a call to `--packages` like this:
    
    ```
    pyspark --packages com.amazonaws:aws-java-sdk-pom:1.10.34,org.apache.hadoop:hadoop-aws:2.6.0
    ```
    
    Without that, you get a `java.io.IOException: No FileSystem for scheme: s3n` if you try to read something from S3.
    
    I see the maintainer case for not wanting to have the default builds of Spark include AWS-specific stuff, and at the same time the end-user case for having that is just as clear.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12004: [SPARK-7481] [build] Add spark-cloud module to pu...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12004#discussion_r89134682
  
    --- Diff: docs/storage-openstack-swift.md ---
    @@ -19,41 +19,32 @@ Although not mandatory, it is recommended to configure the proxy server of Swift
     
     # Dependencies
     
    -The Spark application should include <code>hadoop-openstack</code> dependency.
    +The Spark application should include <code>hadoop-openstack</code> dependency, which can
    +be done by including the `spark-cloud` module for the specific version of spark used.
     For example, for Maven support, add the following to the <code>pom.xml</code> file:
     
     {% highlight xml %}
     <dependencyManagement>
       ...
       <dependency>
    -    <groupId>org.apache.hadoop</groupId>
    -    <artifactId>hadoop-openstack</artifactId>
    -    <version>2.3.0</version>
    +    <groupId>org.apache.spark</groupId>
    +    <artifactId>spark-cloud_2.11</artifactId>
    +    <version>${spark.version}</version>
       </dependency>
       ...
     </dependencyManagement>
     {% endhighlight %}
     
    +If using the Scala 2.10-compatible version of Spark, the artifact is of course `spark-cloud_2.10`.
     
     # Configuration Parameters
     
     Create <code>core-site.xml</code> and place it inside Spark's <code>conf</code> directory.
    -There are two main categories of parameters that should to be configured: declaration of the
    -Swift driver and the parameters that are required by Keystone. 
    +Ther main category of parameters that should to be configured are the authentication parameters
    --- End diff --
    
    "The main category ... is ..."?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481][build] [WIP] Add Hadoop 2.6+ spark-cloud mo...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    **[Test build #63789 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63789/consoleFull)** for PR 12004 at commit [`2001dd0`](https://github.com/apache/spark/commit/2001dd075cad851aa7bb958e7b2fcdc23268999d).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481] [build] Add spark-cloud module to pull in o...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    **[Test build #66513 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66513/consoleFull)** for PR 12004 at commit [`a216aed`](https://github.com/apache/spark/commit/a216aed9a009c41a90131a8d6de04bb54c504a17).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481][build] [WIP] Add Hadoop 2.7+ spark-cloud mo...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64487/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12004: [SPARK-7481] [build] Add spark-cloud module to pu...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12004#discussion_r89134487
  
    --- Diff: docs/cloud-integration.md ---
    @@ -0,0 +1,953 @@
    +---
    +layout: global
    +displayTitle: Integration with Cloud Infrastructures
    +title: Integration with Cloud Infrastructures
    +description: Introduction to cloud storage support in Apache Spark SPARK_VERSION_SHORT
    +---
    +<!---
    +  Licensed under the Apache License, Version 2.0 (the "License");
    +  you may not use this file except in compliance with the License.
    +  You may obtain a copy of the License at
    +
    +   http://www.apache.org/licenses/LICENSE-2.0
    +
    +  Unless required by applicable law or agreed to in writing, software
    +  distributed under the License is distributed on an "AS IS" BASIS,
    +  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +  See the License for the specific language governing permissions and
    +  limitations under the License. See accompanying LICENSE file.
    +-->
    +
    +* This will become a table of contents (this text will be scraped).
    +{:toc}
    +
    +## <a name="introduction"></a>Introduction
    +
    +
    +All the public cloud infrastructures, Amazon AWS, Microsoft Azure, Google GCS and others offer
    +persistent data storage systems, "object stores". These are not quite the same as classic file
    +systems: in order to scale to hundreds of Petabytes, without any single points of failure
    +or size limits, object stores, "blobstores", have a simpler model of `name => data`.
    +
    +Apache Spark can read or write data in object stores for data access.
    +through filesystem connectors implemented in Apache Hadoop or provided by third-parties.
    +These libraries make the object stores look *almost* like filesystems, with directories and
    +operations on files (rename) and directories (create, rename, delete) which mimic
    +those of a classic filesystem. Because of this, Spark and Spark-based applications
    +can work with object stores, generally treating them as as if they were slower-but-larger filesystems.
    +
    +With these connectors, Apache Spark supports object stores as the source
    +of data for analysis, including Spark Streaming and DataFrames.
    +
    +
    +## <a name="quick_start"></a>Quick Start
    +
    +Provided the relevant libraries are on the classpath, and Spark is configured with your credentials,
    +objects in an object store can be can be read or written through URLs which uses the name of the
    +object store client as the schema and the bucket/container as the hostname.
    +
    +
    +### Dependencies
    +
    +The Spark application neeeds the relevant Hadoop libraries, which can
    +be done by including the `spark-cloud` module for the specific version of spark used.
    +
    +The Spark application should include <code>hadoop-openstack</code> dependency, which can
    +be done by including the `spark-cloud` module for the specific version of spark used.
    +For example, for Maven support, add the following to the <code>pom.xml</code> file:
    +
    +{% highlight xml %}
    +<dependencyManagement>
    +  ...
    +  <dependency>
    +    <groupId>org.apache.spark</groupId>
    +    <artifactId>spark-cloud_2.11</artifactId>
    +    <version>${spark.version}</version>
    +  </dependency>
    +  ...
    +</dependencyManagement>
    +{% endhighlight %}
    +
    +If using the Scala 2.10-compatible version of Spark, the artifact is of course `spark-cloud_2.10`.
    +
    +### Basic Use
    +
    +
    +
    +To refer to a path in Amazon S3, use `s3a://` as the scheme (Hadoop 2.7+) or `s3n://` on older versions.
    +
    +{% highlight scala %}
    +sparkContext.textFile("s3a://landsat-pds/scene_list.gz").count()
    +{% endhighlight %}
    +
    +Similarly, an RDD can be saved to an object store via `saveAsTextFile()`
    +
    +
    +{% highlight scala %}
    +val numbers = sparkContext.parallelize(1 to 1000)
    +
    +// save to Amazon S3 (or compatible implementation)
    +numbers.saveAsTextFile("s3a://testbucket/counts")
    +
    +// Save to Azure Object store
    +numbers.saveAsTextFile("wasb://testbucket@example.blob.core.windows.net/counts")
    +
    +// save to an OpenStack Swift implementation
    +numbers.saveAsTextFile("swift://testbucket.openstack1/counts")
    +{% endhighlight %}
    +
    +That's essentially it: object stores can act as a source and destination of data, using exactly
    +the same APIs to load and save data as one uses to work with data in HDFS or other filesystems.
    +
    +Because object stores are viewed by Spark as filesystems, object stores can
    +be used as the source or destination of any spark work \u2014be it batch, SQL, DataFrame,
    +Streaming or something else.
    +
    +The steps to do so are as follows
    +
    +1. Use the full URI to refer to a bucket, including the prefix for the client-side library
    +to use. Example: `s3a://landsat-pds/scene_list.gz`
    +1. Have the Spark context configured with the authentication details of the object store.
    +In a YARN cluster, this may also be done in the `core-site.xml` file.
    +1. Have the JAR containing the filesystem classes on the classpath \u2014along with all of its dependencies.
    +
    +### <a name="dataframes"></a>Example: DataFrames
    +
    +DataFrames can be created from and saved to object stores through the `read()` and `write()` methods.
    +
    +{% highlight scala %}
    +import org.apache.spark.SparkConf
    +import org.apache.spark.sql.SparkSession
    +import org.apache.spark.sql.types.StringType
    +
    +val spark = SparkSession
    +    .builder
    +    .appName("DataFrames")
    +    .config(sparkConf)
    +    .getOrCreate()
    +import spark.implicits._
    +val numRows = 1000
    +
    +// generate test data
    +val sourceData = spark.range(0, numRows).select($"id".as("l"), $"id".cast(StringType).as("s"))
    +
    +// define the destination
    +val dest = "wasb://yourcontainer@youraccount.blob.core.windows.net/dataframes"
    +
    +// write the data
    +val orcFile = dest + "/data.orc"
    +sourceData.write.format("orc").save(orcFile)
    +
    +// now read it back
    +val orcData = spark.read.format("orc").load(orcFile)
    +
    +// finally, write the data as Parquet
    +orcData.write.format("parquet").save(dest + "/data.parquet")
    +spark.stop()
    +{% endhighlight %}
    +
    +### <a name="streaming"></a>Example: Spark Streaming and Cloud Storage
    +
    +Spark Streaming can monitor files added to object stores, by
    +creating a `FileInputDStream` DStream monitoring a path under a bucket.
    +
    +{% highlight scala %}
    +import org.apache.spark.SparkConf
    +import org.apache.spark.sql.SparkSession
    +import org.apache.spark.streaming._
    +
    +val sparkConf = new SparkConf()
    +val ssc = new StreamingContext(sparkConf, Milliseconds(5000))
    +try {
    +  val lines = ssc.textFileStream("s3a://bucket/incoming")
    +  val matches = lines.filter(_.endsWith("3"))
    +  matches.print()
    +  ssc.start()
    +  ssc.awaitTermination()
    +} finally {
    +  ssc.stop(true)
    +}
    +{% endhighlight %}
    +
    +1. The time to scan for new files is proportional to the number of files
    +under the path \u2014not the number of *new* files, and that it can become a slow operation.
    +The size of the window needs to be set to handle this.
    +
    +1. Files only appear in an object store once they are completely written; there
    +is no need for a worklow of write-then-rename to ensure that files aren't picked up
    +while they are still being written. Applications can write straight to the monitored directory.
    +
    +#### <a name="checkpointing"></a>Checkpointing Streams to object stores
    +
    +Streams should only be checkpointed to an object store considered compatible with
    +HDFS. As the checkpoint operation includes a `rename()` operation, checkpointing to
    +an object store can be so slow that streaming throughput collapses.
    +
    +
    +## <a name="output"></a>Object Stores as a substitute for HDFS
    +
    +As the examples show, you can write data to object stores. However, that does not mean
    +That they can be used as replacements for a cluster-wide filesystem.
    +
    +The full details are covered in [Cloud Object Stores are Not Real Filesystems](#cloud_stores_are_not_filesystems).
    +
    +The brief summary is:
    +
    +| Object Store Connector      |  Replace HDFS? |
    +|-----------------------------|--------------------|
    +| Apache `s3a://` `s3n://`    | No  |
    +| Amazon EMR `s3://`          | Yes |
    +| Microsoft Azure `wasb://`   | Yes |
    +| OpenStack `swift://`        | No  |
    +
    +It is possible to use any of the object stores as a destination of work, i.e. use
    +`saveAsTextFile()` or `save` to save data there, but the commit process may be slow
    +and, unreliable in the presence of failures.
    +
    +It is faster and safer to use the cluster filesystem as the destination of Spark jobs,
    +using that data as the data for follow-on work. The final results can
    +be persisted in to the object store using `distcp`.
    +
    +### Recommended settings for writing to object stores
    +
    +Here are the settings to use when writing to object stores. This uses the "version 2" algorithm
    +for committing files \u2014which does less renaming than the v1 algorithm. Speculative execution is
    +disabled to avoid multiple writers corrupting the output.
    +
    +```
    +spark.speculation false
    +spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2
    +spark.hadoop.mapreduce.fileoutputcommitter.cleanup-failures.ignored true
    +```
    +
    +There's also the option of skipping the cleanup of temporary files in the output directory.
    +Enabling this option eliminates a small delay caused by listing and deleting any such files.
    +
    +```
    +spark.hadoop.mapreduce.fileoutputcommitter.cleanup.skipped true
    +```
    +
    +Bear in mind that storing temporary files can run up charges; Delete
    +directories called `"_temporary"` on a regular basis to avoid this.
    +
    +
    +### YARN Scheduler settings
    +
    +When running Spark in a YARN cluster running in EC2, turning off locality avoids any delays
    +waiting for the scheduler to find a node close to the data.
    +
    +```xml
    +  <property>
    +    <name>yarn.scheduler.capacity.node-locality-delay</name>
    +    <value>0</value>
    +  </property>
    +```
    +
    +This has to be set in the YARN cluster configuration, not in the Spark configuration.
    +
    +### Parquet IO Settings
    +
    +For optimal performance when reading files saved in the Apache Parquet format,
    +read and write operations must be minimized, including generation of summary metadata,
    +and coalescing metadata from multiple files. The Predicate pushdown option
    +enables the Parquet library to skip un-needed columns, so saving bandwidth.
    +
    +    spark.hadoop.parquet.enable.summary-metadata false
    +    spark.sql.parquet.mergeSchema false
    +    spark.sql.parquet.filterPushdown true
    +    spark.sql.hive.metastorePartitionPruning true
    +
    +### ORC IO Settings
    +
    +For optimal performance when reading files saved in the Apache ORC format,
    +read and write operations must be minimized. Here are the options to achieve this.
    +
    +
    +    spark.sql.orc.filterPushdown true
    +    spark.sql.orc.splits.include.file.footer true
    +    spark.sql.orc.cache.stripe.details.size 10000
    +    spark.sql.hive.metastorePartitionPruning true
    +
    +The Predicate pushdown option enables the ORC library to skip un-needed columns, and use index
    +information to filter out parts of the file where it can be determined that no columns match the predicate.
    +
    +The `spark.sql.orc.splits.include.file.footer` option means that the ORC file footer information,
    +is passed around with the file information \u2014so eliminating the need to reread this data.
    +
    +
    +## <a name="authenticating"></a>Authenticating with Object Stores
    +
    +Apart from the special case of public read-only data, all object stores
    +require callers to authenticate themselves.
    +To do this, the Spark context must be configured with the authentication
    +details of the object store.
    +
    +1. In a YARN cluster, this may be done automatically in the `core-site.xml` file.
    +1. When Spark is running in cloud infrastructure (for example, on Amazon EC2, Google Cloud or
    +Microsoft Azure), the authentication details may be automatically derived from information
    +available to the VM.
    +1. `spark-submit` automatically picks up the contents of `AWS_ACCESS_KEY` and `AWS_SECRET_KEY`
    +environment variables and sets the associated configuration parameters for`s3n` and `s3a`
    +to these values. This essentially propagates the values across the Spark cluster.
    +1. Authentication details may be manually added to the Spark configuration
    +1. Alternatively, they can be programmatically added. *Important: never put authentication
    +secrets in source code. They will be compromised*.
    +
    +It is critical that the credentials used to access object stores are kept secret. Not only can
    +they be abused to run up compute charges, they can be used to read and alter private data.
    +
    +1. If adding login details to a spark configuration file, do not share this file, including
    +attaching to bug reports or committing it to SCM repositories.
    +1. Have different accounts for access to the storage for each application,
    +each with access rights restricted to those object storage buckets/containers used by the
    +application.
    +1. If the object store supports any form of session credential (e.g. Amazon's STS), issue
    +session credentials for the expected lifetime of the application.
    +1. When using a version of Spark with with Hadoop 2.8+ libraries, consider using Hadoop
    +credential files to store secrets, referencing
    +these files in the relevant ID/secret properties of the XML configuration file.
    +
    +
    +## <a name="object_stores"></a>Object stores and Their Library Dependencies
    +
    +The different object stores supported by Spark depend on specific Hadoop versions,
    +and require specific Hadoop JARs and dependent Java libraries on the classpath.
    +
    +<table class="table">
    +  <tr><th>Schema</th><th>Store</th><th>Details</th></tr>
    +  <tr>
    +    <td><code>s3a://</code></td>
    +    <td>Amazon S3</a>
    +    <td>
    +    Recommended S3 client for Spark releases built on Apache Hadoop 2.7 or later.
    +    </td>
    +  </tr>
    +  <tr>
    +    <td><code>s3n://</code></td>
    +    <td>Amazon S3</a>
    +    <td>
    +    Deprected S3 client; only use for Spark releases built on Apache Hadoop 2.6 or earlier.
    +    </td>
    +  </tr>
    +  <tr>
    +    <td><code>s3://</code></td>
    +    <td>Amazon S3 on Amazon EMR</a>
    +    <td>
    +    Amazon's own S3 client; use only and exclusivley in Amazon EMR.
    +    </td>
    +  </tr>
    +  <tr>
    +    <td><code>wasb://</code></td>
    +    <td>Azure Storage</a>
    +    <td>
    +    Client for Microsoft Azure Storage; since Hadoop 2.7.
    +    </td>
    +  </tr>
    +  <tr>
    +    <td><code>swift://</code></td>
    +    <td>OpenStack Swift</a>
    +    <td>
    +    Client for OpenStack Swift object stores.
    +    </td>
    +  </tr>
    +  <tr>
    +    <td><code>gs://</code></td>
    +    <td>Google Cloud Storage</a>
    +    <td>
    +    Google's client for their cloud object store.
    +    </td>
    +  </tr>
    +</table>
    +
    +
    +### <a name="working_with_amazon_s3"></a>Working with Amazon S3
    +
    +Amazon's S3 object store is probably the most widely used object store \u2014it is also the one
    +with the most client libraries. This is due to the evolution of Hadoop's support, and Amazon
    +offering Hadoop and Spark as its EMR service, along with its own S3 client.
    +
    +The recommendations for which client to use depend upon the version of Hadoop on the Spark classpath.
    +
    +<table class="table">
    +  <tr><th>Hadoop Library Version</th><th>Client</th></tr>
    +  <tr>
    +    <td>Hadoop 2.7+ and commercial products based on it</a>
    +    <td><code>s3a://</code></td>
    +  </tr>
    +  <tr>
    +    <td>Hadoop 2.6 or earlier</a>
    +    <td><code>s3n://</code></td>
    +  </tr>
    +  <tr>
    +    <td>Amazon EMR</a>
    +    <td><code>s3://</code></td>
    +  </tr>
    +</table>
    +
    +Authentication is generally via properties set in the spark context or, in YARN clusters,
    +`core-site.xml`.
    +Versions of the S3A client also support short-lived session credentials and IAM authentication to
    +automatically pick up credentials on EC2 deployments. Consult the appropriate Hadoop documentation for specifics.
    +
    +`spark-submit` will automatically pick up and propagate `AWS_ACCESS_KEY` and `AWS_SECRET_KEY`
    +from the environment variables set in the environment of the user running `spark-submit`; these
    +will override any set in the configuration files.
    +
    +Be aware that while S3 buckets support complex access control declarations, Spark needs
    +full read/write access to any bucket to which it must write data. That is: it does not support writing
    +to buckets where the root paths are read only, or not readable at all.
    +
    +#### <a name="s3a"></a>S3A Filesystem Client: `s3a://`
    +
    +The ["S3A" filesystem client](https://hadoop.apache.org/docs/stable2/hadoop-aws/tools/hadoop-aws/index.html)
    +shipped with in Hadoop 2.6, and has been considered ready for production use since Hadoop 2.7.1
    +
    +*The S3A connector is the sole S3 connector undergoing active maintenance at the Apache, and
    +should be used wherever possible.*
    +
    +**Classpath**
    --- End diff --
    
    While all of the info here is useful and correct I assume, some of these sections seem to be relevant to developers only and not end users. What of this do I need to know as an end user? do I need to put this stuff on my classpath? It could be a dumb question but I thought part of the idea was that this was being pulled in automatically, or had to be.
    
    I think the same applies to a lot of asides or back-story here. Some is good but the focus here has to be what it is Spark users need to know.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481] [build] Add spark-cloud module to pull in o...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    **[Test build #66962 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66962/consoleFull)** for PR 12004 at commit [`a2ed095`](https://github.com/apache/spark/commit/a2ed095b2f2277e477e49d3ca59a40ed98c331cc).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481][build] [WIP] Add Hadoop 2.6+ spark-cloud mo...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    **[Test build #62235 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62235/consoleFull)** for PR 12004 at commit [`e1a0907`](https://github.com/apache/spark/commit/e1a090787007c4c500a44cd88ed172b72c8dc3f0).
     * This patch **fails build dependency tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7481][build][WIP] Add Hadoop 2.6+ spark...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12004#issuecomment-220265269
  
    **[Test build #58856 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/58856/consoleFull)** for PR 12004 at commit [`4e37c7a`](https://github.com/apache/spark/commit/4e37c7a0e5509fee74f7af519dd81aa97de60762).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7481][build][WIP] Add Hadoop 2.6+ spark...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12004#issuecomment-203484631
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481][build] [WIP] Add Hadoop 2.7+ spark-cloud mo...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    **[Test build #65446 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65446/consoleFull)** for PR 12004 at commit [`c2b7d88`](https://github.com/apache/spark/commit/c2b7d885f91bb447ace8fbac427b2fdf9c84b4ef).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481] [build] Add spark-cloud module to pull in o...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    I'm with Sean here -- we shouldn't create a module just because we might create something in the future. Why don't we create the module when there is something specific to add?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12004: [SPARK-7481] [build] Add spark-cloud module to pu...

Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12004#discussion_r113950840
  
    --- Diff: docs/cloud-integration.md ---
    @@ -0,0 +1,512 @@
    +---
    +layout: global
    +displayTitle: Integration with Cloud Infrastructures
    +title: Integration with Cloud Infrastructures
    +description: Introduction to cloud storage support in Apache Spark SPARK_VERSION_SHORT
    +---
    +<!---
    +  Licensed under the Apache License, Version 2.0 (the "License");
    +  you may not use this file except in compliance with the License.
    +  You may obtain a copy of the License at
    +
    +   http://www.apache.org/licenses/LICENSE-2.0
    +
    +  Unless required by applicable law or agreed to in writing, software
    +  distributed under the License is distributed on an "AS IS" BASIS,
    +  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +  See the License for the specific language governing permissions and
    +  limitations under the License. See accompanying LICENSE file.
    +-->
    +
    +* This will become a table of contents (this text will be scraped).
    +{:toc}
    +
    +## <a name="introduction"></a>Introduction
    +
    +
    +All the public cloud infrastructures, Amazon AWS, Microsoft Azure, Google GCS and others offer
    --- End diff --
    
    I'll do "All major cloud providers offer persistent data storage in *object stores*."


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481] [build] Add spark-cloud module to pull in o...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    **[Test build #68936 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68936/consoleFull)** for PR 12004 at commit [`9726d6c`](https://github.com/apache/spark/commit/9726d6c857439b870e08d49889a5da3c35a708e3).
     * This patch **fails build dependency tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481][build] [WIP] Add Hadoop 2.6+ spark-cloud mo...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    **[Test build #64298 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64298/consoleFull)** for PR 12004 at commit [`300f14a`](https://github.com/apache/spark/commit/300f14a0e428b86b70bde43ae06a34db68043e87).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481] [build] Add spark-cloud module to pull in o...

Posted by nchammas <gi...@git.apache.org>.
Github user nchammas commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    @steveloughran - Is this message in the most recent build log critical?
    
    ```
    Spark's published dependencies DO NOT MATCH the manifest file (dev/spark-deps).
    To update the manifest file, run './dev/test-dependencies.sh --replace-manifest'.
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481] [build] Add spark-cloud module to pull in o...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    **[Test build #68668 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68668/consoleFull)** for PR 12004 at commit [`ac6b33f`](https://github.com/apache/spark/commit/ac6b33f35e3e0370d33116e6defa0e9baa0ec7f1).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481] [build] Add spark-cloud module to pull in o...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    **[Test build #72155 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72155/testReport)** for PR 12004 at commit [`4a7b61d`](https://github.com/apache/spark/commit/4a7b61d14dc28d95730c3ff89b46971ba194734a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481][build] [WIP] Add Hadoop 2.6+ spark-cloud mo...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64428/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481] [build] Add spark-cloud module to pull in o...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    **[Test build #67991 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67991/consoleFull)** for PR 12004 at commit [`c9f3a0b`](https://github.com/apache/spark/commit/c9f3a0bbdb682fc151233ae46abe97da382a9594).
     * This patch **fails build dependency tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7481][build][WIP] Add Hadoop 2.6+ spark...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12004#issuecomment-216988386
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481] [build] Add spark-cloud module to pull in o...

Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    (apologies for not replying; rebuilding a deceased laptop)
    
    My main concern is to have the ability to make spark releases which include the object store client libraries and a set of transitive JARs consistent with the version of Hadoop and spark used. It's that transitive problem, "consistent aws-sdk" and  "all jackson JARs in sync" which makes it hard and stops it being straightforward for any downstream project to pull in the right files themselves. I know this, because I have code that wants to do that, and, because I'm testing across so many different variants of the hadoop-* modules, I get to see these things first. Put differently: this patch compensates for the fact that whenever I bump the dependency version of the aws- or azure- JARs spark apps trying to work with these object stores break. 
    
    With the packaging set up, then anyone can build spark itself with the right JARs, including the (large) transitive AWS/Azure dependencies. Pretty much everybody publicly releasing derivatives of spark are doing this \u2014an explicit module delivers that same ability to the ASF code itself. And: allows the spark project to publish to maven central, the spark-hadoop-cloud artifact to let anyone building apps downstream via maven, sbt, ivy, ... to pick up the right dependencies. Trying to do that downstream is a very delicate piece of work.
    
    I'm about to push up a version which has just cut out the tests for transitive classloading; the means there's no source in the module, just the packaging. It is left to downstream code to validate the artifact declarations through whatever functional tests they have. It still generates a JAR file, only this is now empty. I could change it to being a POM artifact only, though that would commit the module to be a POM-only artifact forever.
    
    Now, once the code is stripped down to its minimum, there is one more deployment option: just adding the artifacts to an existing module (e.g spark-core), again presumably with some profile to enable it. That's actually simpler: no new spark artifacts, just tuned dependencies. spark-core already ships with the jets3t dependency, because hadoop-common (still) declares its dependency on it. Happy to do it that way if you want : all I care about is having the packaging and transitive dependencies available and consistent.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481] [build] Add spark-cloud module to pull in o...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    Surely hadoop-aws depends on the version of the AWS SDK it wants to?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7481][build][WIP] Add Hadoop 2.6+ spark...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12004#issuecomment-214785038
  
    **[Test build #56998 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56998/consoleFull)** for PR 12004 at commit [`8926acb`](https://github.com/apache/spark/commit/8926acb25e05f5d0748155c26262aae0d54fb3d0).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481][build] [WIP] Add Hadoop 2.6+ spark-cloud mo...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    **[Test build #63984 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63984/consoleFull)** for PR 12004 at commit [`0d9f122`](https://github.com/apache/spark/commit/0d9f12250dd1b9f78acdac714b0bbeeda294cef5).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12004: [SPARK-7481] [build] Add spark-cloud module to pu...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12004#discussion_r89132111
  
    --- Diff: cloud/src/test/scala/org/apache/spark/cloud/AzureInstantiationSuite.scala ---
    @@ -0,0 +1,29 @@
    +/*
    --- End diff --
    
    What are these tests testing? just that some code is available on the classpath? To me it doesn't seem worth it, it's not what tests generally do in other modules. I think we can assume Maven works?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481][build] [WIP] Add Hadoop 2.6+ spark-cloud mo...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    **[Test build #62186 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62186/consoleFull)** for PR 12004 at commit [`d609126`](https://github.com/apache/spark/commit/d609126dbd4da75d6001cf931b08927c7113a889).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481][build] [WIP] Add Hadoop 2.6+ spark-cloud mo...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63787/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481][build] [WIP] Add Hadoop 2.6+ spark-cloud mo...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    **[Test build #62235 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62235/consoleFull)** for PR 12004 at commit [`e1a0907`](https://github.com/apache/spark/commit/e1a090787007c4c500a44cd88ed172b72c8dc3f0).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12004: [SPARK-7481] [build] Add spark-cloud module to pu...

Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12004#discussion_r89202156
  
    --- Diff: docs/cloud-integration.md ---
    @@ -0,0 +1,953 @@
    +---
    +layout: global
    +displayTitle: Integration with Cloud Infrastructures
    +title: Integration with Cloud Infrastructures
    +description: Introduction to cloud storage support in Apache Spark SPARK_VERSION_SHORT
    +---
    +<!---
    +  Licensed under the Apache License, Version 2.0 (the "License");
    +  you may not use this file except in compliance with the License.
    +  You may obtain a copy of the License at
    +
    +   http://www.apache.org/licenses/LICENSE-2.0
    +
    +  Unless required by applicable law or agreed to in writing, software
    +  distributed under the License is distributed on an "AS IS" BASIS,
    +  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +  See the License for the specific language governing permissions and
    +  limitations under the License. See accompanying LICENSE file.
    +-->
    +
    +* This will become a table of contents (this text will be scraped).
    +{:toc}
    +
    +## <a name="introduction"></a>Introduction
    +
    +
    +All the public cloud infrastructures, Amazon AWS, Microsoft Azure, Google GCS and others offer
    +persistent data storage systems, "object stores". These are not quite the same as classic file
    +systems: in order to scale to hundreds of Petabytes, without any single points of failure
    +or size limits, object stores, "blobstores", have a simpler model of `name => data`.
    +
    +Apache Spark can read or write data in object stores for data access.
    +through filesystem connectors implemented in Apache Hadoop or provided by third-parties.
    +These libraries make the object stores look *almost* like filesystems, with directories and
    +operations on files (rename) and directories (create, rename, delete) which mimic
    +those of a classic filesystem. Because of this, Spark and Spark-based applications
    +can work with object stores, generally treating them as as if they were slower-but-larger filesystems.
    +
    +With these connectors, Apache Spark supports object stores as the source
    +of data for analysis, including Spark Streaming and DataFrames.
    +
    +
    +## <a name="quick_start"></a>Quick Start
    +
    +Provided the relevant libraries are on the classpath, and Spark is configured with your credentials,
    +objects in an object store can be can be read or written through URLs which uses the name of the
    +object store client as the schema and the bucket/container as the hostname.
    +
    +
    +### Dependencies
    +
    +The Spark application neeeds the relevant Hadoop libraries, which can
    +be done by including the `spark-cloud` module for the specific version of spark used.
    +
    +The Spark application should include <code>hadoop-openstack</code> dependency, which can
    +be done by including the `spark-cloud` module for the specific version of spark used.
    +For example, for Maven support, add the following to the <code>pom.xml</code> file:
    +
    +{% highlight xml %}
    +<dependencyManagement>
    +  ...
    +  <dependency>
    +    <groupId>org.apache.spark</groupId>
    +    <artifactId>spark-cloud_2.11</artifactId>
    +    <version>${spark.version}</version>
    +  </dependency>
    +  ...
    +</dependencyManagement>
    +{% endhighlight %}
    +
    +If using the Scala 2.10-compatible version of Spark, the artifact is of course `spark-cloud_2.10`.
    +
    +### Basic Use
    +
    +
    +
    +To refer to a path in Amazon S3, use `s3a://` as the scheme (Hadoop 2.7+) or `s3n://` on older versions.
    +
    +{% highlight scala %}
    +sparkContext.textFile("s3a://landsat-pds/scene_list.gz").count()
    +{% endhighlight %}
    +
    +Similarly, an RDD can be saved to an object store via `saveAsTextFile()`
    +
    +
    +{% highlight scala %}
    +val numbers = sparkContext.parallelize(1 to 1000)
    +
    +// save to Amazon S3 (or compatible implementation)
    +numbers.saveAsTextFile("s3a://testbucket/counts")
    +
    +// Save to Azure Object store
    +numbers.saveAsTextFile("wasb://testbucket@example.blob.core.windows.net/counts")
    +
    +// save to an OpenStack Swift implementation
    +numbers.saveAsTextFile("swift://testbucket.openstack1/counts")
    +{% endhighlight %}
    +
    +That's essentially it: object stores can act as a source and destination of data, using exactly
    +the same APIs to load and save data as one uses to work with data in HDFS or other filesystems.
    +
    +Because object stores are viewed by Spark as filesystems, object stores can
    +be used as the source or destination of any spark work \u2014be it batch, SQL, DataFrame,
    +Streaming or something else.
    +
    +The steps to do so are as follows
    +
    +1. Use the full URI to refer to a bucket, including the prefix for the client-side library
    +to use. Example: `s3a://landsat-pds/scene_list.gz`
    +1. Have the Spark context configured with the authentication details of the object store.
    +In a YARN cluster, this may also be done in the `core-site.xml` file.
    +1. Have the JAR containing the filesystem classes on the classpath \u2014along with all of its dependencies.
    +
    +### <a name="dataframes"></a>Example: DataFrames
    +
    +DataFrames can be created from and saved to object stores through the `read()` and `write()` methods.
    +
    +{% highlight scala %}
    +import org.apache.spark.SparkConf
    --- End diff --
    
    I had some examples, I remove them. I can put them back. But as you note, it's only a URL; the example is there to make it clear. I can just cut it back to "use it wherever you would any other path"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7481][build][WIP] Add Hadoop 2.6+ profi...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12004#issuecomment-203103180
  
    **[Test build #54454 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54454/consoleFull)** for PR 12004 at commit [`72b3548`](https://github.com/apache/spark/commit/72b354855867c51c6426d72209d1b98d17796730).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class DependencyCheckSuite extends SparkFunSuite `


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7481][build][WIP] Add Hadoop 2.6+ spark...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12004#issuecomment-212496714
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/56360/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12004: [SPARK-7481] [build] Add spark-cloud module to pu...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12004#discussion_r89308671
  
    --- Diff: pom.xml ---
    @@ -2558,6 +2660,26 @@
           </modules>
         </profile>
     
    +    <!--
    +      The cloud profile enables the cloud module.
    +      It does not declare the hadoop-* artifacts which
    +      the cloud module pulls in; these are delegated to
    +      the hadoop-x.y protocols, so permitting different
    +      hadoop versions to declare different include/exclude
    +      rules (especially transient dependencies).
    +
    +      To use this profile, the hadoop-2.7 profile must also
    --- End diff --
    
    I'm probably missing your point, but if the `cloud` profile requires the `hadoop-2.7` profile then when could it be set separately?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12004: [SPARK-7481] [build] Add spark-cloud module to pu...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12004#discussion_r113695789
  
    --- Diff: docs/cloud-integration.md ---
    @@ -0,0 +1,512 @@
    +---
    +layout: global
    +displayTitle: Integration with Cloud Infrastructures
    +title: Integration with Cloud Infrastructures
    +description: Introduction to cloud storage support in Apache Spark SPARK_VERSION_SHORT
    +---
    +<!---
    +  Licensed under the Apache License, Version 2.0 (the "License");
    +  you may not use this file except in compliance with the License.
    +  You may obtain a copy of the License at
    +
    +   http://www.apache.org/licenses/LICENSE-2.0
    +
    +  Unless required by applicable law or agreed to in writing, software
    +  distributed under the License is distributed on an "AS IS" BASIS,
    +  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +  See the License for the specific language governing permissions and
    +  limitations under the License. See accompanying LICENSE file.
    +-->
    +
    +* This will become a table of contents (this text will be scraped).
    +{:toc}
    +
    +## <a name="introduction"></a>Introduction
    +
    +
    +All the public cloud infrastructures, Amazon AWS, Microsoft Azure, Google GCS and others offer
    +persistent data storage systems, "object stores". These are not quite the same as classic file
    +systems: in order to scale to hundreds of Petabytes, without any single points of failure
    +or size limits, object stores, "blobstores", have a simpler model of `name => data`.
    +
    +Apache Spark can read or write data in object stores for data access.
    +through filesystem connectors implemented in Apache Hadoop or provided by third-parties.
    +These libraries make the object stores look *almost* like filesystems, with directories and
    +operations on files (rename) and directories (create, rename, delete) which mimic
    +those of a classic filesystem. Because of this, Spark and Spark-based applications
    +can work with object stores, generally treating them as as if they were slower-but-larger filesystems.
    +
    +With these connectors, Apache Spark supports object stores as the source
    +of data for analysis, including Spark Streaming and DataFrames.
    +
    +
    +## <a name="quick_start"></a>Quick Start
    +
    +Provided the relevant libraries are on the classpath, and Spark is configured with your credentials,
    +objects in an object store can be can be read or written through URLs which uses the name of the
    +object store client as the schema and the bucket/container as the hostname.
    +
    +
    +### Dependencies
    +
    +The Spark application neeeds the relevant Hadoop libraries, which can
    +be done by including the `spark-hadoop-cloud` module for the specific version of spark used.
    +
    +The Spark application should include <code>hadoop-openstack</code> dependency, which can
    +be done by including the `spark-hadoop-cloud` module for the specific version of spark used.
    +For example, for Maven support, add the following to the <code>pom.xml</code> file:
    +
    +{% highlight xml %}
    +<dependencyManagement>
    +  ...
    +  <dependency>
    +    <groupId>org.apache.spark</groupId>
    +    <artifactId>spark-hadoop-cloud_2.11</artifactId>
    +    <version>${spark.version}</version>
    +  </dependency>
    +  ...
    +</dependencyManagement>
    +{% endhighlight %}
    +
    +If using the Scala 2.10-compatible version of Spark, the artifact is of course `spark-hadoop-cloud_2.10`.
    +
    +### Basic Use
    +
    +You can refer to data in an object store just as you would data in a filesystem, by
    +using a URL to the data in methods like `SparkContext.textFile()` to read data, 
    +`saveAsTextFile()` to write it back.
    +
    +
    +Because object stores are viewed by Spark as filesystems, object stores can
    +be used as the source or destination of any spark work \u2014be it batch, SQL, DataFrame,
    +Streaming or something else.
    +
    +The steps to do so are as follows
    +
    +1. Use the full URI to refer to a bucket, including the prefix for the client-side library
    +to use. Example: `s3a://landsat-pds/scene_list.gz`
    +1. Have the Spark context configured with the authentication details of the object store.
    +In a YARN cluster, this may also be done in the `core-site.xml` file.
    +
    +
    +## <a name="output"></a>Object Stores as a substitute for HDFS
    +
    +As the examples show, you can write data to object stores. However, that does not mean
    +That they can be used as replacements for a cluster-wide filesystem.
    +
    +The full details are covered in [Cloud Object Stores are Not Real Filesystems](#cloud_stores_are_not_filesystems).
    +
    +The brief summary is:
    +
    +| Object Store Connector      |  Replace HDFS? |
    +|-----------------------------|--------------------|
    +| `s3a://` `s3n://`  from the ASF   | No  |
    +| Amazon EMR `s3://`          | Yes |
    +| Microsoft Azure `wasb://`   | Yes |
    +| OpenStack `swift://`        | No  |
    +
    +It is possible to use any of the object stores as a destination of work, i.e. use
    +`saveAsTextFile()` or `save()` to save data there, but the commit process may be slow
    +and, unreliable in the presence of failures.
    --- End diff --
    
    "and ," -> and


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7481][build][WIP] Add Hadoop 2.6+ spark...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12004#issuecomment-221668290
  
    **[Test build #59287 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59287/consoleFull)** for PR 12004 at commit [`6b3812b`](https://github.com/apache/spark/commit/6b3812b24ca819997d6cd11c28a6d0b9a4402a2d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7481][build][WIP] Add Hadoop 2.6+ spark...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12004#issuecomment-212495960
  
    **[Test build #56360 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56360/consoleFull)** for PR 12004 at commit [`8845af0`](https://github.com/apache/spark/commit/8845af0be6e516b956f6acda222ddc0dd85ad17c).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7481][build][WIP] Add Hadoop 2.6+ spark...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12004#issuecomment-210418227
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12004: [SPARK-7481] [build] Add spark-cloud module to pu...

Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12004#discussion_r113979717
  
    --- Diff: docs/cloud-integration.md ---
    @@ -0,0 +1,512 @@
    +---
    +layout: global
    +displayTitle: Integration with Cloud Infrastructures
    +title: Integration with Cloud Infrastructures
    +description: Introduction to cloud storage support in Apache Spark SPARK_VERSION_SHORT
    +---
    +<!---
    +  Licensed under the Apache License, Version 2.0 (the "License");
    +  you may not use this file except in compliance with the License.
    +  You may obtain a copy of the License at
    +
    +   http://www.apache.org/licenses/LICENSE-2.0
    +
    +  Unless required by applicable law or agreed to in writing, software
    +  distributed under the License is distributed on an "AS IS" BASIS,
    +  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +  See the License for the specific language governing permissions and
    +  limitations under the License. See accompanying LICENSE file.
    +-->
    +
    +* This will become a table of contents (this text will be scraped).
    +{:toc}
    +
    +## <a name="introduction"></a>Introduction
    +
    +
    +All the public cloud infrastructures, Amazon AWS, Microsoft Azure, Google GCS and others offer
    +persistent data storage systems, "object stores". These are not quite the same as classic file
    +systems: in order to scale to hundreds of Petabytes, without any single points of failure
    +or size limits, object stores, "blobstores", have a simpler model of `name => data`.
    +
    +Apache Spark can read or write data in object stores for data access.
    +through filesystem connectors implemented in Apache Hadoop or provided by third-parties.
    +These libraries make the object stores look *almost* like filesystems, with directories and
    +operations on files (rename) and directories (create, rename, delete) which mimic
    +those of a classic filesystem. Because of this, Spark and Spark-based applications
    +can work with object stores, generally treating them as as if they were slower-but-larger filesystems.
    +
    +With these connectors, Apache Spark supports object stores as the source
    +of data for analysis, including Spark Streaming and DataFrames.
    +
    +
    +## <a name="quick_start"></a>Quick Start
    +
    +Provided the relevant libraries are on the classpath, and Spark is configured with your credentials,
    +objects in an object store can be can be read or written through URLs which uses the name of the
    +object store client as the schema and the bucket/container as the hostname.
    +
    +
    +### Dependencies
    +
    +The Spark application neeeds the relevant Hadoop libraries, which can
    +be done by including the `spark-hadoop-cloud` module for the specific version of spark used.
    +
    +The Spark application should include <code>hadoop-openstack</code> dependency, which can
    +be done by including the `spark-hadoop-cloud` module for the specific version of spark used.
    +For example, for Maven support, add the following to the <code>pom.xml</code> file:
    +
    +{% highlight xml %}
    +<dependencyManagement>
    +  ...
    +  <dependency>
    +    <groupId>org.apache.spark</groupId>
    +    <artifactId>spark-hadoop-cloud_2.11</artifactId>
    +    <version>${spark.version}</version>
    +  </dependency>
    +  ...
    +</dependencyManagement>
    +{% endhighlight %}
    +
    +If using the Scala 2.10-compatible version of Spark, the artifact is of course `spark-hadoop-cloud_2.10`.
    +
    +### Basic Use
    +
    +You can refer to data in an object store just as you would data in a filesystem, by
    +using a URL to the data in methods like `SparkContext.textFile()` to read data, 
    +`saveAsTextFile()` to write it back.
    +
    +
    +Because object stores are viewed by Spark as filesystems, object stores can
    +be used as the source or destination of any spark work \u2014be it batch, SQL, DataFrame,
    +Streaming or something else.
    +
    +The steps to do so are as follows
    +
    +1. Use the full URI to refer to a bucket, including the prefix for the client-side library
    +to use. Example: `s3a://landsat-pds/scene_list.gz`
    +1. Have the Spark context configured with the authentication details of the object store.
    +In a YARN cluster, this may also be done in the `core-site.xml` file.
    +
    +
    +## <a name="output"></a>Object Stores as a substitute for HDFS
    +
    +As the examples show, you can write data to object stores. However, that does not mean
    +That they can be used as replacements for a cluster-wide filesystem.
    +
    +The full details are covered in [Cloud Object Stores are Not Real Filesystems](#cloud_stores_are_not_filesystems).
    +
    +The brief summary is:
    +
    +| Object Store Connector      |  Replace HDFS? |
    +|-----------------------------|--------------------|
    +| `s3a://` `s3n://`  from the ASF   | No  |
    +| Amazon EMR `s3://`          | Yes |
    +| Microsoft Azure `wasb://`   | Yes |
    +| OpenStack `swift://`        | No  |
    +
    +It is possible to use any of the object stores as a destination of work, i.e. use
    +`saveAsTextFile()` or `save()` to save data there, but the commit process may be slow
    +and, unreliable in the presence of failures.
    +
    +It is faster and safer to use the cluster filesystem as the destination of Spark jobs,
    +using that data as the data for follow-on work. The final results can
    +be persisted in to the object store using `distcp`.
    +
    +#### <a name="checkpointing"></a>Spark Streaming and object stores
    +
    +Spark Streaming can monitor files added to object stores, by
    +creating a `FileInputDStream` DStream monitoring a path under a bucket through
    +`StreamingContext.textFileStream()`.
    +
    +
    +1. The time to scan for new files is proportional to the number of files
    +under the path \u2014not the number of *new* files, and that it can become a slow operation.
    +The size of the window needs to be set to handle this.
    +
    +1. Files only appear in an object store once they are completely written; there
    +is no need for a worklow of write-then-rename to ensure that files aren't picked up
    +while they are still being written. Applications can write straight to the monitored directory.
    +
    +1. Streams should only be checkpointed to an object store considered compatible with
    +HDFS. Otherwise the checkpointing will be slow and potentially unreliable.
    +
    +### Recommended settings for writing to object stores
    +
    +Here are the settings to use when writing to object stores. This uses the "version 2" algorithm
    +for committing files \u2014which does less renaming than the v1 algorithm. Speculative execution is
    +disabled to avoid multiple writers corrupting the output.
    +
    +```
    +spark.speculation false
    +spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2
    +spark.hadoop.mapreduce.fileoutputcommitter.cleanup-failures.ignored true
    +```
    +
    +There's also the option of skipping the cleanup of temporary files in the output directory.
    +Enabling this option eliminates a small delay caused by listing and deleting any such files.
    +
    +```
    +spark.hadoop.mapreduce.fileoutputcommitter.cleanup.skipped true
    +```
    +
    +Bear in mind that storing temporary files can run up charges; Delete
    +directories called `"_temporary"` on a regular basis to avoid this.
    +
    +
    +### YARN Scheduler settings
    +
    +When running Spark in a YARN cluster running in EC2, turning off locality avoids any delays
    +waiting for the scheduler to find a node close to the data.
    +
    +```xml
    +  <property>
    +    <name>yarn.scheduler.capacity.node-locality-delay</name>
    +    <value>0</value>
    +  </property>
    +```
    +
    +This has to be set in the YARN cluster configuration, not in the Spark configuration.
    +
    +### Parquet I/O Settings
    +
    +For optimal performance when reading files saved in the Apache Parquet format,
    +read and write operations must be minimized, including generation of summary metadata,
    +and coalescing metadata from multiple files. The `filterPushdown` option
    +enables the Parquet library to optimize data reads itself, potentially saving bandwidth.
    +
    +```
    +spark.hadoop.parquet.enable.summary-metadata false
    +spark.sql.parquet.mergeSchema false
    +spark.sql.parquet.filterPushdown true
    +spark.sql.hive.metastorePartitionPruning true
    +```
    +
    +### ORC I/O Settings
    +
    +For optimal performance when reading files saved in the Apache ORC format,
    +read and write operations must be minimized. Here are the options to achieve this.
    +
    +
    +```
    +spark.sql.orc.filterPushdown true
    +spark.sql.orc.splits.include.file.footer true
    +spark.sql.orc.cache.stripe.details.size 10000
    +spark.sql.hive.metastorePartitionPruning true
    +```
    +
    +The `filterPushdown` option enables the ORC library to optimize data reads itself,
    +potentially saving bandwidth.
    +
    +The `spark.sql.orc.splits.include.file.footer` option means that the ORC file footer information,
    +is passed around with the file information \u2014so eliminating the need to reread this data.
    +
    +## <a name="authenticating"></a>Authenticating with Object Stores
    +
    +Apart from the special case of public read-only data, all object stores
    +require callers to authenticate themselves.
    +To do this, the Spark context must be configured with the authentication
    +details of the object store.
    +
    +1. In a YARN cluster, this may be done automatically in the `core-site.xml` file.
    +1. When Spark is running in cloud infrastructure (for example, on Amazon EC2, Google Cloud or
    +Microsoft Azure), the authentication details may be automatically derived from information
    +available to the VM.
    +1. `spark-submit` automatically picks up the contents of `AWS_ACCESS_KEY` and `AWS_SECRET_KEY`
    +environment variables and sets the associated configuration parameters for`s3n` and `s3a`
    +to these values. This essentially propagates the values across the Spark cluster.
    +1. Authentication details may be manually added to the Spark configuration
    +1. Alternatively, they can be programmatically added. *Important: never put authentication
    +secrets in source code. They will be compromised*.
    +
    +It is critical that the credentials used to access object stores are kept secret. Not only can
    +they be abused to run up compute charges, they can be used to read and alter private data.
    +
    +1. If adding login details to a spark configuration file, do not share this file, including
    +attaching to bug reports or committing it to SCM repositories.
    +1. Have different accounts for access to the storage for each application,
    +each with access rights restricted to those object storage buckets/containers used by the
    +application.
    +1. If the object store supports any form of session credential (e.g. Amazon's STS), issue
    +session credentials for the expected lifetime of the application.
    +1. When using a version of Spark with with Hadoop 2.8+ libraries, consider using Hadoop
    +credential files to store secrets, referencing
    +these files in the relevant ID/secret properties of the XML configuration file.
    +
    +
    +## <a name="object_stores"></a>Object stores and Their Library Dependencies
    +
    +The different object stores supported by Spark depend on specific Hadoop versions,
    +and require specific Hadoop JARs and dependent Java libraries on the classpath.
    +
    +<table class="table">
    +  <tr><th>Schema</th><th>Store</th><th>Details</th></tr>
    +  <tr>
    +    <td><code>s3a://</code></td>
    +    <td>Amazon S3</a>
    +    <td>
    +    Recommended S3 client for Spark releases built on Apache Hadoop 2.7 or later.
    +    </td>
    +  </tr>
    +  <tr>
    +    <td><code>s3n://</code></td>
    +    <td>Amazon S3</a>
    +    <td>
    +    Deprected S3 client; only use for Spark releases built on Apache Hadoop 2.6 or earlier.
    +    </td>
    +  </tr>
    +  <tr>
    +    <td><code>s3://</code></td>
    +    <td>Amazon S3 on Amazon EMR</a>
    +    <td>
    +    Amazon's own S3 client; use only and exclusivley in Amazon EMR.
    +    </td>
    +  </tr>
    +  <tr>
    +    <td><code>wasb://</code></td>
    +    <td>Azure Storage</a>
    +    <td>
    +    Client for Microsoft Azure Storage; since Hadoop 2.7.
    +    </td>
    +  </tr>
    +  <tr>
    +    <td><code>swift://</code></td>
    +    <td>OpenStack Swift</a>
    +    <td>
    +    Client for OpenStack Swift object stores.
    +    </td>
    +  </tr>
    +  <tr>
    +    <td><code>gs://</code></td>
    +    <td>Google Cloud Storage</a>
    +    <td>
    +    Google's client for their cloud object store.
    +    </td>
    +  </tr>
    +</table>
    +
    +
    +### <a name="working_with_amazon_s3"></a>Working with Amazon S3
    +
    +Amazon's S3 object store is probably the most widely used object store \u2014it is also the one
    +with the most client libraries. This is due to the evolution of Hadoop's support, and Amazon
    +offering Hadoop and Spark as its EMR service, along with its own S3 client.
    +
    +The recommendations for which client to use depend upon the version of Hadoop on the Spark classpath.
    +
    +<table class="table">
    +  <tr><th>Hadoop Library Version</th><th>Client</th></tr>
    +  <tr>
    +    <td>Hadoop 2.7+ and commercial products based on it</a>
    +    <td><code>s3a://</code></td>
    +  </tr>
    +  <tr>
    +    <td>Hadoop 2.6 or earlier</a>
    +    <td><code>s3n://</code></td>
    +  </tr>
    +  <tr>
    +    <td>Amazon EMR</a>
    +    <td><code>s3://</code></td>
    +  </tr>
    +</table>
    +
    +Authentication is generally via properties set in the spark context or, in YARN clusters,
    +`core-site.xml`.
    +Versions of the S3A client also support short-lived session credentials and IAM authentication to
    +automatically pick up credentials on EC2 deployments. Consult the appropriate Hadoop documentation for specifics.
    +
    +`spark-submit` will automatically pick up and propagate `AWS_ACCESS_KEY` and `AWS_SECRET_KEY`
    +from the environment variables set in the environment of the user running `spark-submit`; these
    +will override any set in the configuration files.
    +
    +Be aware that while S3 buckets support complex access control declarations, Spark needs
    +full read/write access to any bucket to which it must write data. That is: it does not support writing
    +to buckets where the root paths are read only, or not readable at all.
    +
    +#### <a name="s3a"></a>S3A Filesystem Client: `s3a://`
    +
    +The ["S3A" filesystem client](https://hadoop.apache.org/docs/stable2/hadoop-aws/tools/hadoop-aws/index.html)
    +is the sole S3 connector undergoing active maintenance at the Apache, and should be used wherever
    +possible.
    +
    +
    +**Tuning for performance:**
    +
    +For recent Hadoop versions, *when working with binary formats* (Parquet, ORC) use
    +
    +```
    +spark.hadoop.fs.s3a.experimental.input.fadvise random
    +```
    +
    +This reads from the object in blocks, which is efficient when seeking backwards as well as
    +forwards in a file \u2014at the expense of making full file reads slower.
    +
    +When working with text formats (text, CSV), or any sequential read through an entire file
    +(including .gzip compressed data),
    +this "random" I/O policy should be disabled. This is the default, but can be done
    +explicitly:
    +
    +```
    +spark.hadoop.fs.s3a.experimental.input.fadvise normal
    +spark.hadoop.fs.s3a.readahead.range 157810688
    +```
    +
    +This optimizes the object read for sequential input, and when there is a forward `seek()` call
    +up to that readahead range, will simply read the data in the current HTTPS request, rather than
    +abort it and start a new one.
    +
    +
    +#### <a name="s3n"></a>S3 Native Client `s3n://`
    +
    +The ["S3N" filesystem client](https://hadoop.apache.org/docs/stable2/hadoop-aws/tools/hadoop-aws/index.html)
    +was implemented in 2008 and has been widely used.
    +
    +While stable, S3N is essentially unmaintained, and deprecated in favor of S3A.
    +As well as being slower and limited in authentication mechanisms, the
    +only maintenance it receives are critical security issues.
    +
    +
    +#### <a name="emrs3"></a>Amazon EMR's S3 Client: `s3://`
    +
    +
    +In Amazon EMR, `s3://` is the URL schema used to refer to
    +[Amazon's own filesystem client](https://aws.amazon.com/premiumsupport/knowledge-center/emr-file-system-s3/),
    +one that is closed-source.
    +
    +As EMR also maps `s3n://` to the same filesystem, using URLs with the `s3n://` schema avoids
    +some confusion. Bear in mind, however, that Amazon's S3 client library is not the Apache one:
    +only Amazon can field bug reports related to it.
    +
    +To work with this data outside of EMR itself, use `s3a://` or `s3n://` instead.
    +
    +
    +#### <a name="asf_s3"></a>Obsolete: Apache Hadoop's S3 client, `s3://`
    +
    +Apache's own Hadoop releases (i.e not EMR), uses URL `s3://` to refer to a
    +deprecated inode-based filesystem implemented on top of S3.
    +This filesystem is obsolete, deprecated and has been dropped from Hadoop 3.x.
    +
    +*Important: * Do not use `s3://` URLs with Apache Spark except on Amazon EMR*
    +It is not the same as the Amazon EMR one and incompatible with all other applications.
    +
    +
    +### <a name="working_with_azure"></a>Working with Microsoft Azure Storage
    +
    +Azure support comes with the [`wasb` filesystem client](https://hadoop.apache.org/docs/stable2/hadoop-azure/index.html).
    +
    +The Apache implementation is that used by Microsoft in Azure itself: it can be used
    +to access data in Azure as well as remotely. The object store itself is *consistent*, and
    +can be reliably used as the destination of queries.
    +
    +
    +### <a name="working_with_swift"></a>Working with OpenStack Swift
    +
    +
    +The OpenStack [`swift://` filesystem client](https://hadoop.apache.org/docs/stable2/hadoop-openstack/index.html)
    +works with Swift object stores in private OpenStack installations, public installations
    +including Rackspace Cloud and IBM Softlayer.
    +
    +### <a name="working_with_google_cloud_storage"></a>Working with Google Cloud Storage
    +
    +[Google Cloud Storage](https://cloud.google.com/storage) is supported via Google's own
    +[GCS filesystem client](https://cloud.google.com/hadoop/google-cloud-storage-connector).
    +
    +
    +For use outside of Google cloud, `gcs-connector.jar` must be be manually downloaded then added
    +to `$SPARK_HOME/jars`.
    +
    +
    +## <a name="cloud_stores_are_not_filesystems"></a>Important: Cloud Object Stores are Not Real Filesystems
    +
    +Object stores are not filesystems: they are not a hierarchical tree of directories and files.
    +
    +The Hadoop filesystem APIs offer a filesystem API to the object stores, but underneath
    +they are still object stores, [and the difference is significant](http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/filesystem/introduction.html)
    +
    +While object stores can be used as the source and store
    +for persistent data, they cannot be used as a direct replacement for a cluster-wide filesystem such as HDFS.
    +This is important to know, as the fact they are accessed with the same APIs can be misleading.
    +
    +### Directory Operations May be Slow and Non-atomic
    +
    +Directory rename and delete may be performed as a series of operations. Specifically, recursive
    +directory deletion may be implemented as "list the objects, delete them singly or in batches".
    +File and directory renames may be implemented as "copy all the objects" followed by the delete operation.
    +
    +1. The time to delete a directory depends on the number of files in the directory.
    +1. Directory deletion may fail partway through, leaving a partially deleted directory.
    +1. Directory renaming may fail part way through, leaving the destination directory containing some of the files
    +being renamed, the source directory untouched.
    +1. The time to rename files and directories increases with the amount of data to rename.
    +1. If the rename is done on the client, the time to rename
    +each file will depend upon the bandwidth between client and the filesystem. The further away the client
    +is, the longer the rename will take.
    +1. Recursive directory listing can be very slow. This can slow down some parts of job submission
    +and execution.
    +
    +Because of these behaviours, committing of work by renaming directories is neither efficient nor
    +reliable. In Spark 1.6 and predecessors, there was a special output committer for Parquet,
    +the `org.apache.spark.sql.execution.datasources.parquet.DirectParquetOutputCommitter`
    +which bypasses the rename phase. However, as well as having major problems when used
    + with speculative execution enabled, it handled failures badly. For this reason, it
    +[was removed from Spark 2.0](https://issues.apache.org/jira/browse/SPARK-10063).
    +
    +*Critical* speculative execution does not work against object
    +stores which do not support atomic directory renames. Your output may get
    +corrupted.
    +
    +*Warning* even non-speculative execution is at risk of leaving the output of a job in an inconsistent
    +state if a "Direct" output committer is used and executors fail.
    +
    +### Data is Not Written Until the OutputStream's `close()` Operation.
    +
    +Data written to the object store is often buffered to a local file or stored in memory,
    --- End diff --
    
    I've cut it


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12004: [SPARK-7481] [build] Add spark-cloud module to pu...

Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12004#discussion_r113962864
  
    --- Diff: docs/cloud-integration.md ---
    @@ -0,0 +1,512 @@
    +---
    +layout: global
    +displayTitle: Integration with Cloud Infrastructures
    +title: Integration with Cloud Infrastructures
    +description: Introduction to cloud storage support in Apache Spark SPARK_VERSION_SHORT
    +---
    +<!---
    +  Licensed under the Apache License, Version 2.0 (the "License");
    +  you may not use this file except in compliance with the License.
    +  You may obtain a copy of the License at
    +
    +   http://www.apache.org/licenses/LICENSE-2.0
    +
    +  Unless required by applicable law or agreed to in writing, software
    +  distributed under the License is distributed on an "AS IS" BASIS,
    +  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +  See the License for the specific language governing permissions and
    +  limitations under the License. See accompanying LICENSE file.
    +-->
    +
    +* This will become a table of contents (this text will be scraped).
    +{:toc}
    +
    +## <a name="introduction"></a>Introduction
    +
    +
    +All the public cloud infrastructures, Amazon AWS, Microsoft Azure, Google GCS and others offer
    +persistent data storage systems, "object stores". These are not quite the same as classic file
    +systems: in order to scale to hundreds of Petabytes, without any single points of failure
    +or size limits, object stores, "blobstores", have a simpler model of `name => data`.
    +
    +Apache Spark can read or write data in object stores for data access.
    +through filesystem connectors implemented in Apache Hadoop or provided by third-parties.
    +These libraries make the object stores look *almost* like filesystems, with directories and
    +operations on files (rename) and directories (create, rename, delete) which mimic
    +those of a classic filesystem. Because of this, Spark and Spark-based applications
    +can work with object stores, generally treating them as as if they were slower-but-larger filesystems.
    +
    +With these connectors, Apache Spark supports object stores as the source
    +of data for analysis, including Spark Streaming and DataFrames.
    +
    +
    +## <a name="quick_start"></a>Quick Start
    +
    +Provided the relevant libraries are on the classpath, and Spark is configured with your credentials,
    +objects in an object store can be can be read or written through URLs which uses the name of the
    +object store client as the schema and the bucket/container as the hostname.
    +
    +
    +### Dependencies
    +
    +The Spark application neeeds the relevant Hadoop libraries, which can
    +be done by including the `spark-hadoop-cloud` module for the specific version of spark used.
    +
    +The Spark application should include <code>hadoop-openstack</code> dependency, which can
    +be done by including the `spark-hadoop-cloud` module for the specific version of spark used.
    +For example, for Maven support, add the following to the <code>pom.xml</code> file:
    +
    +{% highlight xml %}
    +<dependencyManagement>
    +  ...
    +  <dependency>
    +    <groupId>org.apache.spark</groupId>
    +    <artifactId>spark-hadoop-cloud_2.11</artifactId>
    +    <version>${spark.version}</version>
    +  </dependency>
    +  ...
    +</dependencyManagement>
    +{% endhighlight %}
    +
    +If using the Scala 2.10-compatible version of Spark, the artifact is of course `spark-hadoop-cloud_2.10`.
    +
    +### Basic Use
    +
    +You can refer to data in an object store just as you would data in a filesystem, by
    +using a URL to the data in methods like `SparkContext.textFile()` to read data, 
    +`saveAsTextFile()` to write it back.
    +
    +
    +Because object stores are viewed by Spark as filesystems, object stores can
    +be used as the source or destination of any spark work \u2014be it batch, SQL, DataFrame,
    +Streaming or something else.
    +
    +The steps to do so are as follows
    +
    +1. Use the full URI to refer to a bucket, including the prefix for the client-side library
    +to use. Example: `s3a://landsat-pds/scene_list.gz`
    +1. Have the Spark context configured with the authentication details of the object store.
    +In a YARN cluster, this may also be done in the `core-site.xml` file.
    +
    +
    +## <a name="output"></a>Object Stores as a substitute for HDFS
    +
    +As the examples show, you can write data to object stores. However, that does not mean
    +That they can be used as replacements for a cluster-wide filesystem.
    +
    +The full details are covered in [Cloud Object Stores are Not Real Filesystems](#cloud_stores_are_not_filesystems).
    +
    +The brief summary is:
    +
    +| Object Store Connector      |  Replace HDFS? |
    +|-----------------------------|--------------------|
    +| `s3a://` `s3n://`  from the ASF   | No  |
    --- End diff --
    
    yes, I'm about to put out a stripped down document which doesn't do this table, instead just has a list of references at the bottom "for further reading"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7481][build][WIP] Add Hadoop 2.6+ profi...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12004#issuecomment-202556008
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/54333/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12004: [SPARK-7481] [build] Add spark-cloud module to pu...

Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12004#discussion_r113950943
  
    --- Diff: docs/cloud-integration.md ---
    @@ -0,0 +1,512 @@
    +---
    +layout: global
    +displayTitle: Integration with Cloud Infrastructures
    +title: Integration with Cloud Infrastructures
    +description: Introduction to cloud storage support in Apache Spark SPARK_VERSION_SHORT
    +---
    +<!---
    +  Licensed under the Apache License, Version 2.0 (the "License");
    +  you may not use this file except in compliance with the License.
    +  You may obtain a copy of the License at
    +
    +   http://www.apache.org/licenses/LICENSE-2.0
    +
    +  Unless required by applicable law or agreed to in writing, software
    +  distributed under the License is distributed on an "AS IS" BASIS,
    +  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +  See the License for the specific language governing permissions and
    +  limitations under the License. See accompanying LICENSE file.
    +-->
    +
    +* This will become a table of contents (this text will be scraped).
    +{:toc}
    +
    +## <a name="introduction"></a>Introduction
    +
    +
    +All the public cloud infrastructures, Amazon AWS, Microsoft Azure, Google GCS and others offer
    +persistent data storage systems, "object stores". These are not quite the same as classic file
    +systems: in order to scale to hundreds of Petabytes, without any single points of failure
    --- End diff --
    
    done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12004: [SPARK-7481] [build] Add spark-cloud module to pu...

Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12004#discussion_r113749428
  
    --- Diff: cloud/pom.xml ---
    @@ -0,0 +1,158 @@
    +<?xml version="1.0" encoding="UTF-8"?>
    +<!--
    +  ~ Licensed to the Apache Software Foundation (ASF) under one or more
    +  ~ contributor license agreements.  See the NOTICE file distributed with
    +  ~ this work for additional information regarding copyright ownership.
    +  ~ The ASF licenses this file to You under the Apache License, Version 2.0
    +  ~ (the "License"); you may not use this file except in compliance with
    +  ~ the License.  You may obtain a copy of the License at
    +  ~
    +  ~    http://www.apache.org/licenses/LICENSE-2.0
    +  ~
    +  ~ Unless required by applicable law or agreed to in writing, software
    +  ~ distributed under the License is distributed on an "AS IS" BASIS,
    +  ~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +  ~ See the License for the specific language governing permissions and
    +  ~ limitations under the License.
    +  -->
    +<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    +  <modelVersion>4.0.0</modelVersion>
    +  <parent>
    +    <groupId>org.apache.spark</groupId>
    +    <artifactId>spark-parent_2.11</artifactId>
    +    <version>2.2.0-SNAPSHOT</version>
    +    <relativePath>../pom.xml</relativePath>
    +  </parent>
    +
    +  <artifactId>spark-hadoop-cloud_2.11</artifactId>
    +  <packaging>jar</packaging>
    +  <name>Spark Project Cloud Integration</name>
    +  <description>
    +    Contains support for cloud infrastructures, specifically the Hadoop JARs and
    +    transitive dependencies needed to interact with the infrastructures.
    +
    +    Any project which explicitly depends upon the spark-hadoop-cloud artifact will get the
    +    dependencies; the exact versions of which will depend upon the hadoop version Spark was compiled
    +    against.
    +
    +    The imports of transitive dependencies are managed to make them consistent
    +    with those of the Spark build.
    +
    +    WARNING: the signatures of methods in the AWS and Azure SDKs do change between
    --- End diff --
    
    Cutting back to the first line, it can be covered in docs. 
    
    One option with the docs is to trim them back and say "consult the [Hadoop documentation](http://hadoop.apache.org/docs/r2.8.0/hadoop-aws/tools/hadoop-aws/index.html#S3A) for object store setup, and I can be more explicit there on version pain. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481] [build] Add spark-cloud module to pull in o...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73430/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481] [build] Add spark-cloud module to pull in o...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    **[Test build #69578 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69578/consoleFull)** for PR 12004 at commit [`bd50732`](https://github.com/apache/spark/commit/bd50732993e4e4ab98f3037bf29b83907f437f2a).
     * This patch **fails build dependency tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12004: [SPARK-7481] [build] Add spark-cloud module to pu...

Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12004#discussion_r82502588
  
    --- Diff: cloud/src/main/scala/org/apache/spark/cloud/s3/S3AConstants.scala ---
    @@ -0,0 +1,75 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.cloud.s3
    +
    +/**
    + * S3A constants. Different Hadoop versions have an incomplete set of these; keeping them
    + * in source here ensures that there are no compile/link problems.
    + */
    +object S3AConstants {
    +  val ACCESS_KEY = "fs.s3a.access.key"
    +  val SECRET_KEY = "fs.s3a.secret.key"
    +  val AWS_CREDENTIALS_PROVIDER = "fs.s3a.aws.credentials.provider"
    +  val ANONYMOUS_CREDENTIALS = "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider"
    +  val SESSION_TOKEN = "fs.s3a.session.token"
    +  val MAXIMUM_CONNECTIONS = "fs.s3a.connection.maximum"
    +  val SECURE_CONNECTIONS = "fs.s3a.connection.ssl.enabled"
    +  val ENDPOINT = "fs.s3a.endpoint"
    +  val PATH_STYLE_ACCESS = "fs.s3a.path.style.access"
    +  val PROXY_HOST = "fs.s3a.proxy.host"
    +  val PROXY_PORT = "fs.s3a.proxy.port"
    +  val PROXY_USERNAME = "fs.s3a.proxy.username"
    +  val PROXY_PASSWORD = "fs.s3a.proxy.password"
    +  val PROXY_DOMAIN = "fs.s3a.proxy.domain"
    +  val PROXY_WORKSTATION = "fs.s3a.proxy.workstation"
    +  val MAX_ERROR_RETRIES = "fs.s3a.attempts.maximum"
    +  val ESTABLISH_TIMEOUT = "fs.s3a.connection.establish.timeout"
    +  val SOCKET_TIMEOUT = "fs.s3a.connection.timeout"
    +  val MAX_PAGING_KEYS = "fs.s3a.paging.maximum"
    +  val MAX_THREADS = "fs.s3a.threads.max"
    +  val KEEPALIVE_TIME = "fs.s3a.threads.keepalivetime"
    +  val MAX_TOTAL_TASKS = "fs.s3a.max.total.tasks"
    +  val MULTIPART_SIZE = "fs.s3a.multipart.size"
    +  val MIN_PERMITTED_MULTIPART_SIZE = 5 * (1024 * 1024)
    +  val MIN_MULTIPART_THRESHOLD = "fs.s3a.multipart.threshold"
    +  val ENABLE_MULTI_DELETE = "fs.s3a.multiobjectdelete.enable"
    +  val BUFFER_DIR = "fs.s3a.buffer.dir"
    +  val FAST_UPLOAD = "fs.s3a.fast.upload"
    +  val FAST_BUFFER_SIZE = "fs.s3a.fast.buffer.size"
    +  val PURGE_EXISTING_MULTIPART = "fs.s3a.multipart.purge"
    +  val PURGE_EXISTING_MULTIPART_AGE = "fs.s3a.multipart.purge.age"
    +  val SERVER_SIDE_ENCRYPTION_ALGORITHM = "fs.s3a.server-side-encryption-algorithm"
    +  val SERVER_SIDE_ENCRYPTION_AES256 = "AES256"
    +  val SIGNING_ALGORITHM = "fs.s3a.signing-algorithm"
    +  val BLOCK_SIZE = "fs.s3a.block.size"
    +  val FS_S3A = "s3a"
    +  val USER_AGENT_PREFIX = "fs.s3a.user.agent.prefix"
    +  val READAHEAD_RANGE = "fs.s3a.readahead.range"
    +  val INPUT_FADVISE = "fs.s3a.experimental.input.fadvise";
    +  val SEQUENTIAL_IO = "sequential"
    +  val NORMAL_IO = "normal"
    +  val RANDOM_IO = "random"
    +
    +  /**
    +   * Default source of a public multi-MB CSV file.
    +   */
    +  val S3A_CSV_PATH_DEFAULT = "s3a://landsat-pds/scene_list.gz"
    --- End diff --
    
    Good q. could be either. I've been using it as a test path for reading a 20MB CSV file which costs $0 to use; the tests are set up so that people testing s3 against their own S3 implementation can switch to a different one. But as you note: it's an example, so the name should be more appropriate. Will fix


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7481][build][WIP] Add Hadoop 2.6+ spark...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12004#issuecomment-210417764
  
    **[Test build #55916 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55916/consoleFull)** for PR 12004 at commit [`105de0b`](https://github.com/apache/spark/commit/105de0b8e80aabd93efbf693083ef72a37e8791d).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7481][build][WIP] Add Hadoop 2.6+ spark...

Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on the pull request:

    https://github.com/apache/spark/pull/12004#issuecomment-214801641
  
    Note that as this module only builds on Hadoop >= 2.6; jenkins won't be compiling it. The tests are designed to skip running if no config file to cloud infrastructure has been provided.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481] [build] Add spark-cloud module to pull in o...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12004: [SPARK-7481] [build] Add spark-cloud module to pu...

Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12004#discussion_r113967945
  
    --- Diff: docs/cloud-integration.md ---
    @@ -0,0 +1,512 @@
    +---
    +layout: global
    +displayTitle: Integration with Cloud Infrastructures
    +title: Integration with Cloud Infrastructures
    +description: Introduction to cloud storage support in Apache Spark SPARK_VERSION_SHORT
    +---
    +<!---
    +  Licensed under the Apache License, Version 2.0 (the "License");
    +  you may not use this file except in compliance with the License.
    +  You may obtain a copy of the License at
    +
    +   http://www.apache.org/licenses/LICENSE-2.0
    +
    +  Unless required by applicable law or agreed to in writing, software
    +  distributed under the License is distributed on an "AS IS" BASIS,
    +  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +  See the License for the specific language governing permissions and
    +  limitations under the License. See accompanying LICENSE file.
    +-->
    +
    +* This will become a table of contents (this text will be scraped).
    +{:toc}
    +
    +## <a name="introduction"></a>Introduction
    +
    +
    +All the public cloud infrastructures, Amazon AWS, Microsoft Azure, Google GCS and others offer
    +persistent data storage systems, "object stores". These are not quite the same as classic file
    +systems: in order to scale to hundreds of Petabytes, without any single points of failure
    +or size limits, object stores, "blobstores", have a simpler model of `name => data`.
    +
    +Apache Spark can read or write data in object stores for data access.
    +through filesystem connectors implemented in Apache Hadoop or provided by third-parties.
    +These libraries make the object stores look *almost* like filesystems, with directories and
    +operations on files (rename) and directories (create, rename, delete) which mimic
    +those of a classic filesystem. Because of this, Spark and Spark-based applications
    +can work with object stores, generally treating them as as if they were slower-but-larger filesystems.
    +
    +With these connectors, Apache Spark supports object stores as the source
    +of data for analysis, including Spark Streaming and DataFrames.
    +
    +
    +## <a name="quick_start"></a>Quick Start
    +
    +Provided the relevant libraries are on the classpath, and Spark is configured with your credentials,
    +objects in an object store can be can be read or written through URLs which uses the name of the
    +object store client as the schema and the bucket/container as the hostname.
    +
    +
    +### Dependencies
    +
    +The Spark application neeeds the relevant Hadoop libraries, which can
    +be done by including the `spark-hadoop-cloud` module for the specific version of spark used.
    +
    +The Spark application should include <code>hadoop-openstack</code> dependency, which can
    +be done by including the `spark-hadoop-cloud` module for the specific version of spark used.
    +For example, for Maven support, add the following to the <code>pom.xml</code> file:
    +
    +{% highlight xml %}
    +<dependencyManagement>
    +  ...
    +  <dependency>
    +    <groupId>org.apache.spark</groupId>
    +    <artifactId>spark-hadoop-cloud_2.11</artifactId>
    +    <version>${spark.version}</version>
    +  </dependency>
    +  ...
    +</dependencyManagement>
    +{% endhighlight %}
    +
    +If using the Scala 2.10-compatible version of Spark, the artifact is of course `spark-hadoop-cloud_2.10`.
    +
    +### Basic Use
    +
    +You can refer to data in an object store just as you would data in a filesystem, by
    +using a URL to the data in methods like `SparkContext.textFile()` to read data, 
    +`saveAsTextFile()` to write it back.
    +
    +
    +Because object stores are viewed by Spark as filesystems, object stores can
    +be used as the source or destination of any spark work \u2014be it batch, SQL, DataFrame,
    +Streaming or something else.
    +
    +The steps to do so are as follows
    +
    +1. Use the full URI to refer to a bucket, including the prefix for the client-side library
    +to use. Example: `s3a://landsat-pds/scene_list.gz`
    +1. Have the Spark context configured with the authentication details of the object store.
    +In a YARN cluster, this may also be done in the `core-site.xml` file.
    +
    +
    +## <a name="output"></a>Object Stores as a substitute for HDFS
    +
    +As the examples show, you can write data to object stores. However, that does not mean
    +That they can be used as replacements for a cluster-wide filesystem.
    +
    +The full details are covered in [Cloud Object Stores are Not Real Filesystems](#cloud_stores_are_not_filesystems).
    +
    +The brief summary is:
    +
    +| Object Store Connector      |  Replace HDFS? |
    +|-----------------------------|--------------------|
    +| `s3a://` `s3n://`  from the ASF   | No  |
    +| Amazon EMR `s3://`          | Yes |
    +| Microsoft Azure `wasb://`   | Yes |
    +| OpenStack `swift://`        | No  |
    +
    +It is possible to use any of the object stores as a destination of work, i.e. use
    +`saveAsTextFile()` or `save()` to save data there, but the commit process may be slow
    +and, unreliable in the presence of failures.
    +
    +It is faster and safer to use the cluster filesystem as the destination of Spark jobs,
    --- End diff --
    
    I'm going to change it to "consult the docs of the connector and the object store". That allows for things to change over time without the spark docs changing, and making it clear to the reader that this may be very dangerous.
    
    It is a problem to use the result of a job because the commit process uses the object store as the the location for uncommitted work.  Both the task and job commits rename() their output; task attempt output promoted to job, job to dest, all of which are done by list + copy + delete. If you don't get that listing right, you don't copy everything. task output doesn't get promoted to job output *and nothing even notices*.   Or, if there is only one output, the listing of the parent directory returning 404, as there is no evidence a directory exists yet. At least there the job fails;  look at the final stack trace  HADOOP-11487 for that surfacing in Spark SQL. 
    
    *The more I know about the standard commit algorithm, the S3 consistency model & what S3N/S3A do, the more surprised I am that it has ever worked with S3.*
    
    The Netflix staging committer addresses this by using HDFS to manage (consistently) all the data about the ongoing job: when each task commits it does an uncommitted multipart put to the final destination dir, saving all the pending commit metadata data to a file in HDFS, relying on the usual v1 commit algorithm to commit/abort that until the final job commit. It then reads in all the successfully promised summary files, completes their writes by a single POST each. This means we can avoid both the rename and a need for consistency on the dest FS, at least for a single query. Chaining the queries still needs consistency (S3mper, etc), or a least a long enough delay for the changes of the first query to propagate. 
    
    regarding distcp: you do the work locally and then upload. As long as DistCp doesn't try to list/manipulate the uploaded file, no issues. HADOOP-13145 stops it doing that: before it went in, problems did surface.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481] [build] Add spark-cloud module to pull in o...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72155/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12004: [SPARK-7481] [build] Add spark-cloud module to pu...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12004#discussion_r89133312
  
    --- Diff: docs/cloud-integration.md ---
    @@ -0,0 +1,953 @@
    +---
    +layout: global
    +displayTitle: Integration with Cloud Infrastructures
    +title: Integration with Cloud Infrastructures
    +description: Introduction to cloud storage support in Apache Spark SPARK_VERSION_SHORT
    +---
    +<!---
    +  Licensed under the Apache License, Version 2.0 (the "License");
    +  you may not use this file except in compliance with the License.
    +  You may obtain a copy of the License at
    +
    +   http://www.apache.org/licenses/LICENSE-2.0
    +
    +  Unless required by applicable law or agreed to in writing, software
    +  distributed under the License is distributed on an "AS IS" BASIS,
    +  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +  See the License for the specific language governing permissions and
    +  limitations under the License. See accompanying LICENSE file.
    +-->
    +
    +* This will become a table of contents (this text will be scraped).
    +{:toc}
    +
    +## <a name="introduction"></a>Introduction
    +
    +
    +All the public cloud infrastructures, Amazon AWS, Microsoft Azure, Google GCS and others offer
    +persistent data storage systems, "object stores". These are not quite the same as classic file
    +systems: in order to scale to hundreds of Petabytes, without any single points of failure
    +or size limits, object stores, "blobstores", have a simpler model of `name => data`.
    +
    +Apache Spark can read or write data in object stores for data access.
    +through filesystem connectors implemented in Apache Hadoop or provided by third-parties.
    +These libraries make the object stores look *almost* like filesystems, with directories and
    +operations on files (rename) and directories (create, rename, delete) which mimic
    +those of a classic filesystem. Because of this, Spark and Spark-based applications
    +can work with object stores, generally treating them as as if they were slower-but-larger filesystems.
    +
    +With these connectors, Apache Spark supports object stores as the source
    +of data for analysis, including Spark Streaming and DataFrames.
    +
    +
    +## <a name="quick_start"></a>Quick Start
    +
    +Provided the relevant libraries are on the classpath, and Spark is configured with your credentials,
    +objects in an object store can be can be read or written through URLs which uses the name of the
    +object store client as the schema and the bucket/container as the hostname.
    +
    +
    +### Dependencies
    +
    +The Spark application neeeds the relevant Hadoop libraries, which can
    +be done by including the `spark-cloud` module for the specific version of spark used.
    +
    +The Spark application should include <code>hadoop-openstack</code> dependency, which can
    +be done by including the `spark-cloud` module for the specific version of spark used.
    +For example, for Maven support, add the following to the <code>pom.xml</code> file:
    +
    +{% highlight xml %}
    +<dependencyManagement>
    +  ...
    +  <dependency>
    +    <groupId>org.apache.spark</groupId>
    +    <artifactId>spark-cloud_2.11</artifactId>
    +    <version>${spark.version}</version>
    +  </dependency>
    +  ...
    +</dependencyManagement>
    +{% endhighlight %}
    +
    +If using the Scala 2.10-compatible version of Spark, the artifact is of course `spark-cloud_2.10`.
    +
    +### Basic Use
    +
    +
    +
    +To refer to a path in Amazon S3, use `s3a://` as the scheme (Hadoop 2.7+) or `s3n://` on older versions.
    +
    +{% highlight scala %}
    +sparkContext.textFile("s3a://landsat-pds/scene_list.gz").count()
    +{% endhighlight %}
    +
    +Similarly, an RDD can be saved to an object store via `saveAsTextFile()`
    +
    +
    +{% highlight scala %}
    +val numbers = sparkContext.parallelize(1 to 1000)
    +
    +// save to Amazon S3 (or compatible implementation)
    +numbers.saveAsTextFile("s3a://testbucket/counts")
    +
    +// Save to Azure Object store
    +numbers.saveAsTextFile("wasb://testbucket@example.blob.core.windows.net/counts")
    +
    +// save to an OpenStack Swift implementation
    +numbers.saveAsTextFile("swift://testbucket.openstack1/counts")
    +{% endhighlight %}
    +
    +That's essentially it: object stores can act as a source and destination of data, using exactly
    +the same APIs to load and save data as one uses to work with data in HDFS or other filesystems.
    +
    +Because object stores are viewed by Spark as filesystems, object stores can
    +be used as the source or destination of any spark work \u2014be it batch, SQL, DataFrame,
    +Streaming or something else.
    +
    +The steps to do so are as follows
    +
    +1. Use the full URI to refer to a bucket, including the prefix for the client-side library
    +to use. Example: `s3a://landsat-pds/scene_list.gz`
    +1. Have the Spark context configured with the authentication details of the object store.
    +In a YARN cluster, this may also be done in the `core-site.xml` file.
    +1. Have the JAR containing the filesystem classes on the classpath \u2014along with all of its dependencies.
    +
    +### <a name="dataframes"></a>Example: DataFrames
    +
    +DataFrames can be created from and saved to object stores through the `read()` and `write()` methods.
    +
    +{% highlight scala %}
    +import org.apache.spark.SparkConf
    --- End diff --
    
    Ideally these are in compilable example files, and are included with `include_example`.  I think these examples don't actually entail a direct dependency on cloud SDK or `spark-cloud`? that should be possible then.
    
    Alternatively ... honestly does this need an example? the only thing to know about is a different URI scheme right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7481][build][WIP] Add Hadoop 2.6+ spark...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12004#issuecomment-212598141
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/56390/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7481][build][WIP] Add Hadoop 2.6+ spark...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12004#issuecomment-210383201
  
    **[Test build #55916 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55916/consoleFull)** for PR 12004 at commit [`105de0b`](https://github.com/apache/spark/commit/105de0b8e80aabd93efbf693083ef72a37e8791d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7481][build][WIP] Add Hadoop 2.6+ spark...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12004#issuecomment-204966316
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/54803/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7481][build][WIP] Add Hadoop 2.6+ spark...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12004#issuecomment-203484599
  
    **[Test build #54525 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54525/consoleFull)** for PR 12004 at commit [`6beafb5`](https://github.com/apache/spark/commit/6beafb551551f3cfd28ff3f0f085b156c9e5fb38).
     * This patch **fails MiMa tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7481][build][WIP] Add Hadoop 2.6+ spark...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12004#issuecomment-212496709
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481][build] [WIP] Add Hadoop 2.6+ spark-cloud mo...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12004: [SPARK-7481] [build] Add spark-cloud module to pu...

Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12004#discussion_r89340198
  
    --- Diff: docs/storage-openstack-swift.md ---
    @@ -19,41 +19,32 @@ Although not mandatory, it is recommended to configure the proxy server of Swift
     
     # Dependencies
     
    -The Spark application should include <code>hadoop-openstack</code> dependency.
    +The Spark application should include <code>hadoop-openstack</code> dependency, which can
    +be done by including the `spark-cloud` module for the specific version of spark used.
     For example, for Maven support, add the following to the <code>pom.xml</code> file:
     
     {% highlight xml %}
     <dependencyManagement>
       ...
       <dependency>
    -    <groupId>org.apache.hadoop</groupId>
    -    <artifactId>hadoop-openstack</artifactId>
    -    <version>2.3.0</version>
    +    <groupId>org.apache.spark</groupId>
    +    <artifactId>spark-cloud_2.11</artifactId>
    +    <version>${spark.version}</version>
       </dependency>
       ...
     </dependencyManagement>
     {% endhighlight %}
     
    +If using the Scala 2.10-compatible version of Spark, the artifact is of course `spark-cloud_2.10`.
     
     # Configuration Parameters
     
     Create <code>core-site.xml</code> and place it inside Spark's <code>conf</code> directory.
    -There are two main categories of parameters that should to be configured: declaration of the
    -Swift driver and the parameters that are required by Keystone. 
    +Ther main category of parameters that should to be configured are the authentication parameters
    --- End diff --
    
    fixed


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12004: [SPARK-7481] [build] Add spark-cloud module to pu...

Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12004#discussion_r98448893
  
    --- Diff: cloud/src/test/scala/org/apache/spark/cloud/AzureInstantiationSuite.scala ---
    @@ -0,0 +1,29 @@
    +/*
    --- End diff --
    
    They don't detect much; they're just a simple sanity check of dependency pull in. Easy to pull; it just means that all tests will go downstream.
    
    My downstream tests [are more rigorous](https://github.com/steveloughran/spark-cloud-examples/blob/master/cloud-examples/src/test/scala/com/hortonworks/spark/cloud/s3/S3DependencyCheckSuite.scala) . and can also be executed under Jenkins without credentials, so there's nothing lost in cutting these, except that PRs to the scala project won't show up the problems directly


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481][build] [WIP] Add Hadoop 2.7+ spark-cloud mo...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    **[Test build #65446 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65446/consoleFull)** for PR 12004 at commit [`c2b7d88`](https://github.com/apache/spark/commit/c2b7d885f91bb447ace8fbac427b2fdf9c84b4ef).
     * This patch **fails build dependency tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12004: [SPARK-7481] [build] Add spark-cloud module to pu...

Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12004#discussion_r89402090
  
    --- Diff: pom.xml ---
    @@ -2558,6 +2660,26 @@
           </modules>
         </profile>
     
    +    <!--
    +      The cloud profile enables the cloud module.
    +      It does not declare the hadoop-* artifacts which
    +      the cloud module pulls in; these are delegated to
    +      the hadoop-x.y protocols, so permitting different
    +      hadoop versions to declare different include/exclude
    +      rules (especially transient dependencies).
    +
    +      To use this profile, the hadoop-2.7 profile must also
    --- End diff --
    
    profile, hadoop-cloud, module the same?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481][build] [WIP] Add Hadoop 2.6+ spark-cloud mo...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481] [build] Add spark-cloud module to pull in o...

Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    Test failure due to new artifacts
    ```
    +++ b/dev/pr-deps/spark-deps-hadoop-2.7
    @@ -16,8 +16,6 @@ arpack_combined_all-0.1.jar
     avro-1.7.7.jar
     avro-ipc-1.7.7.jar
     avro-mapred-1.7.7-hadoop2.jar
    -aws-java-sdk-1.7.4.jar
    -azure-storage-2.0.0.jar
     base64-2.3.8.jar
     bcprov-jdk15on-1.51.jar
     bonecp-0.8.0.RELEASE.jar
    @@ -63,8 +61,6 @@ guice-3.0.jar
     guice-servlet-3.0.jar
     hadoop-annotations-2.7.3.jar
     hadoop-auth-2.7.3.jar
    -hadoop-aws-2.7.3.jar
    -hadoop-azure-2.7.3.jar
     hadoop-client-2.7.3.jar
     hadoop-common-2.7.3.jar
     hadoop-hdfs-2.7.3.jar
    @@ -73,7 +69,6 @@ hadoop-mapreduce-client-common-2.7.3.jar
     hadoop-mapreduce-client-core-2.7.3.jar
     hadoop-mapreduce-client-jobclient-2.7.3.jar
     hadoop-mapreduce-client-shuffle-2.7.3.jar
    -hadoop-hadoop-openstack-2.7.3.jar
     hadoop-yarn-api-2.7.3.jar
     hadoop-yarn-client-2.7.3.jar
     hadoop-yarn-common-2.7.3.jar
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12004: [SPARK-7481] [build] Add spark-cloud module to pu...

Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12004#discussion_r113749971
  
    --- Diff: pom.xml ---
    @@ -621,6 +621,11 @@
             <version>${fasterxml.jackson.version}</version>
           </dependency>
           <dependency>
    +        <groupId>com.fasterxml.jackson.dataformat</groupId>
    +        <artifactId>jackson-dataformat-cbor</artifactId>
    --- End diff --
    
    yes, keeping jackson in sync is a key breakage point. Declaring it in the root POM doesn't add it everywhere, it just delares it so that the cloud POM can exclude the one which comes via  the `aws-java-sdk-s3` dependency JARs and pick up the one used in spark. The (later) Spark one is compatible with the one aws SDK depends on, so moving up works...it's just that all jackson bits needs to be in sync, and there's no way in Maven or Ivy to declare that fact.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12004: [SPARK-7481] [build] Add spark-cloud module to pu...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12004#discussion_r89364107
  
    --- Diff: cloud/src/test/scala/org/apache/spark/cloud/AzureInstantiationSuite.scala ---
    @@ -0,0 +1,29 @@
    +/*
    --- End diff --
    
    Hm, more tests are usually good though if it's just verifying the classpath, it may not be worth it, especially in the name of keeping this simple.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481][build] [WIP] Add Hadoop 2.7+ spark-cloud mo...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481][build] [WIP] Add Hadoop 2.6+ spark-cloud mo...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    **[Test build #64428 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64428/consoleFull)** for PR 12004 at commit [`b25d497`](https://github.com/apache/spark/commit/b25d49701b4015b49efc6c89734301525d803524).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7481][build][WIP] Add Hadoop 2.6+ spark...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12004#issuecomment-212597555
  
    **[Test build #56390 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56390/consoleFull)** for PR 12004 at commit [`2fca815`](https://github.com/apache/spark/commit/2fca815198cd0cd578f9fa52408ac38f9142c2b4).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481][build] [WIP] Add Hadoop 2.6+ spark-cloud mo...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    **[Test build #62537 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62537/consoleFull)** for PR 12004 at commit [`cb07c1d`](https://github.com/apache/spark/commit/cb07c1d7b79944059e477b0b615ce061b08cef00).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481] [build] Add spark-cloud module to pull in o...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    I've pointed out this before, and again: FWIW I really don't see what this pull request is trying to accomplish....


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12004: [SPARK-7481] [build] Add spark-cloud module to pu...

Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12004#discussion_r89340299
  
    --- Diff: docs/cloud-integration.md ---
    @@ -0,0 +1,953 @@
    +---
    +layout: global
    +displayTitle: Integration with Cloud Infrastructures
    +title: Integration with Cloud Infrastructures
    +description: Introduction to cloud storage support in Apache Spark SPARK_VERSION_SHORT
    +---
    +<!---
    +  Licensed under the Apache License, Version 2.0 (the "License");
    +  you may not use this file except in compliance with the License.
    +  You may obtain a copy of the License at
    +
    +   http://www.apache.org/licenses/LICENSE-2.0
    +
    +  Unless required by applicable law or agreed to in writing, software
    +  distributed under the License is distributed on an "AS IS" BASIS,
    +  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +  See the License for the specific language governing permissions and
    +  limitations under the License. See accompanying LICENSE file.
    +-->
    +
    +* This will become a table of contents (this text will be scraped).
    +{:toc}
    +
    +## <a name="introduction"></a>Introduction
    +
    +
    +All the public cloud infrastructures, Amazon AWS, Microsoft Azure, Google GCS and others offer
    +persistent data storage systems, "object stores". These are not quite the same as classic file
    +systems: in order to scale to hundreds of Petabytes, without any single points of failure
    +or size limits, object stores, "blobstores", have a simpler model of `name => data`.
    +
    +Apache Spark can read or write data in object stores for data access.
    +through filesystem connectors implemented in Apache Hadoop or provided by third-parties.
    +These libraries make the object stores look *almost* like filesystems, with directories and
    +operations on files (rename) and directories (create, rename, delete) which mimic
    +those of a classic filesystem. Because of this, Spark and Spark-based applications
    +can work with object stores, generally treating them as as if they were slower-but-larger filesystems.
    +
    +With these connectors, Apache Spark supports object stores as the source
    +of data for analysis, including Spark Streaming and DataFrames.
    +
    +
    +## <a name="quick_start"></a>Quick Start
    +
    +Provided the relevant libraries are on the classpath, and Spark is configured with your credentials,
    +objects in an object store can be can be read or written through URLs which uses the name of the
    +object store client as the schema and the bucket/container as the hostname.
    +
    +
    +### Dependencies
    +
    +The Spark application neeeds the relevant Hadoop libraries, which can
    +be done by including the `spark-cloud` module for the specific version of spark used.
    +
    +The Spark application should include <code>hadoop-openstack</code> dependency, which can
    +be done by including the `spark-cloud` module for the specific version of spark used.
    +For example, for Maven support, add the following to the <code>pom.xml</code> file:
    +
    +{% highlight xml %}
    +<dependencyManagement>
    +  ...
    +  <dependency>
    +    <groupId>org.apache.spark</groupId>
    +    <artifactId>spark-cloud_2.11</artifactId>
    +    <version>${spark.version}</version>
    +  </dependency>
    +  ...
    +</dependencyManagement>
    +{% endhighlight %}
    +
    +If using the Scala 2.10-compatible version of Spark, the artifact is of course `spark-cloud_2.10`.
    +
    +### Basic Use
    +
    +
    +
    +To refer to a path in Amazon S3, use `s3a://` as the scheme (Hadoop 2.7+) or `s3n://` on older versions.
    +
    +{% highlight scala %}
    +sparkContext.textFile("s3a://landsat-pds/scene_list.gz").count()
    +{% endhighlight %}
    +
    +Similarly, an RDD can be saved to an object store via `saveAsTextFile()`
    +
    +
    +{% highlight scala %}
    +val numbers = sparkContext.parallelize(1 to 1000)
    +
    +// save to Amazon S3 (or compatible implementation)
    +numbers.saveAsTextFile("s3a://testbucket/counts")
    +
    +// Save to Azure Object store
    +numbers.saveAsTextFile("wasb://testbucket@example.blob.core.windows.net/counts")
    +
    +// save to an OpenStack Swift implementation
    +numbers.saveAsTextFile("swift://testbucket.openstack1/counts")
    +{% endhighlight %}
    +
    +That's essentially it: object stores can act as a source and destination of data, using exactly
    +the same APIs to load and save data as one uses to work with data in HDFS or other filesystems.
    +
    +Because object stores are viewed by Spark as filesystems, object stores can
    +be used as the source or destination of any spark work \u2014be it batch, SQL, DataFrame,
    +Streaming or something else.
    +
    +The steps to do so are as follows
    +
    +1. Use the full URI to refer to a bucket, including the prefix for the client-side library
    +to use. Example: `s3a://landsat-pds/scene_list.gz`
    +1. Have the Spark context configured with the authentication details of the object store.
    +In a YARN cluster, this may also be done in the `core-site.xml` file.
    +1. Have the JAR containing the filesystem classes on the classpath \u2014along with all of its dependencies.
    +
    +### <a name="dataframes"></a>Example: DataFrames
    +
    +DataFrames can be created from and saved to object stores through the `read()` and `write()` methods.
    +
    +{% highlight scala %}
    +import org.apache.spark.SparkConf
    +import org.apache.spark.sql.SparkSession
    +import org.apache.spark.sql.types.StringType
    +
    +val spark = SparkSession
    +    .builder
    +    .appName("DataFrames")
    +    .config(sparkConf)
    +    .getOrCreate()
    +import spark.implicits._
    +val numRows = 1000
    +
    +// generate test data
    +val sourceData = spark.range(0, numRows).select($"id".as("l"), $"id".cast(StringType).as("s"))
    +
    +// define the destination
    +val dest = "wasb://yourcontainer@youraccount.blob.core.windows.net/dataframes"
    +
    +// write the data
    +val orcFile = dest + "/data.orc"
    +sourceData.write.format("orc").save(orcFile)
    +
    +// now read it back
    +val orcData = spark.read.format("orc").load(orcFile)
    +
    +// finally, write the data as Parquet
    +orcData.write.format("parquet").save(dest + "/data.parquet")
    +spark.stop()
    +{% endhighlight %}
    +
    +### <a name="streaming"></a>Example: Spark Streaming and Cloud Storage
    +
    +Spark Streaming can monitor files added to object stores, by
    +creating a `FileInputDStream` DStream monitoring a path under a bucket.
    +
    +{% highlight scala %}
    +import org.apache.spark.SparkConf
    +import org.apache.spark.sql.SparkSession
    +import org.apache.spark.streaming._
    +
    +val sparkConf = new SparkConf()
    +val ssc = new StreamingContext(sparkConf, Milliseconds(5000))
    +try {
    +  val lines = ssc.textFileStream("s3a://bucket/incoming")
    +  val matches = lines.filter(_.endsWith("3"))
    +  matches.print()
    +  ssc.start()
    +  ssc.awaitTermination()
    +} finally {
    +  ssc.stop(true)
    +}
    +{% endhighlight %}
    +
    +1. The time to scan for new files is proportional to the number of files
    +under the path \u2014not the number of *new* files, and that it can become a slow operation.
    +The size of the window needs to be set to handle this.
    +
    +1. Files only appear in an object store once they are completely written; there
    +is no need for a worklow of write-then-rename to ensure that files aren't picked up
    +while they are still being written. Applications can write straight to the monitored directory.
    +
    +#### <a name="checkpointing"></a>Checkpointing Streams to object stores
    +
    +Streams should only be checkpointed to an object store considered compatible with
    +HDFS. As the checkpoint operation includes a `rename()` operation, checkpointing to
    +an object store can be so slow that streaming throughput collapses.
    +
    +
    +## <a name="output"></a>Object Stores as a substitute for HDFS
    +
    +As the examples show, you can write data to object stores. However, that does not mean
    +That they can be used as replacements for a cluster-wide filesystem.
    +
    +The full details are covered in [Cloud Object Stores are Not Real Filesystems](#cloud_stores_are_not_filesystems).
    +
    +The brief summary is:
    +
    +| Object Store Connector      |  Replace HDFS? |
    +|-----------------------------|--------------------|
    +| Apache `s3a://` `s3n://`    | No  |
    +| Amazon EMR `s3://`          | Yes |
    +| Microsoft Azure `wasb://`   | Yes |
    +| OpenStack `swift://`        | No  |
    +
    +It is possible to use any of the object stores as a destination of work, i.e. use
    +`saveAsTextFile()` or `save` to save data there, but the commit process may be slow
    +and, unreliable in the presence of failures.
    +
    +It is faster and safer to use the cluster filesystem as the destination of Spark jobs,
    +using that data as the data for follow-on work. The final results can
    +be persisted in to the object store using `distcp`.
    +
    +### Recommended settings for writing to object stores
    +
    +Here are the settings to use when writing to object stores. This uses the "version 2" algorithm
    +for committing files \u2014which does less renaming than the v1 algorithm. Speculative execution is
    +disabled to avoid multiple writers corrupting the output.
    +
    +```
    +spark.speculation false
    +spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2
    +spark.hadoop.mapreduce.fileoutputcommitter.cleanup-failures.ignored true
    +```
    +
    +There's also the option of skipping the cleanup of temporary files in the output directory.
    +Enabling this option eliminates a small delay caused by listing and deleting any such files.
    +
    +```
    +spark.hadoop.mapreduce.fileoutputcommitter.cleanup.skipped true
    +```
    +
    +Bear in mind that storing temporary files can run up charges; Delete
    +directories called `"_temporary"` on a regular basis to avoid this.
    +
    +
    +### YARN Scheduler settings
    +
    +When running Spark in a YARN cluster running in EC2, turning off locality avoids any delays
    +waiting for the scheduler to find a node close to the data.
    +
    +```xml
    +  <property>
    +    <name>yarn.scheduler.capacity.node-locality-delay</name>
    +    <value>0</value>
    +  </property>
    +```
    +
    +This has to be set in the YARN cluster configuration, not in the Spark configuration.
    +
    +### Parquet IO Settings
    +
    +For optimal performance when reading files saved in the Apache Parquet format,
    +read and write operations must be minimized, including generation of summary metadata,
    +and coalescing metadata from multiple files. The Predicate pushdown option
    +enables the Parquet library to skip un-needed columns, so saving bandwidth.
    +
    +    spark.hadoop.parquet.enable.summary-metadata false
    +    spark.sql.parquet.mergeSchema false
    +    spark.sql.parquet.filterPushdown true
    +    spark.sql.hive.metastorePartitionPruning true
    +
    +### ORC IO Settings
    +
    +For optimal performance when reading files saved in the Apache ORC format,
    +read and write operations must be minimized. Here are the options to achieve this.
    +
    +
    +    spark.sql.orc.filterPushdown true
    +    spark.sql.orc.splits.include.file.footer true
    +    spark.sql.orc.cache.stripe.details.size 10000
    +    spark.sql.hive.metastorePartitionPruning true
    +
    +The Predicate pushdown option enables the ORC library to skip un-needed columns, and use index
    +information to filter out parts of the file where it can be determined that no columns match the predicate.
    +
    +The `spark.sql.orc.splits.include.file.footer` option means that the ORC file footer information,
    +is passed around with the file information \u2014so eliminating the need to reread this data.
    +
    +
    +## <a name="authenticating"></a>Authenticating with Object Stores
    +
    +Apart from the special case of public read-only data, all object stores
    +require callers to authenticate themselves.
    +To do this, the Spark context must be configured with the authentication
    +details of the object store.
    +
    +1. In a YARN cluster, this may be done automatically in the `core-site.xml` file.
    +1. When Spark is running in cloud infrastructure (for example, on Amazon EC2, Google Cloud or
    +Microsoft Azure), the authentication details may be automatically derived from information
    +available to the VM.
    +1. `spark-submit` automatically picks up the contents of `AWS_ACCESS_KEY` and `AWS_SECRET_KEY`
    +environment variables and sets the associated configuration parameters for`s3n` and `s3a`
    +to these values. This essentially propagates the values across the Spark cluster.
    +1. Authentication details may be manually added to the Spark configuration
    +1. Alternatively, they can be programmatically added. *Important: never put authentication
    +secrets in source code. They will be compromised*.
    +
    +It is critical that the credentials used to access object stores are kept secret. Not only can
    +they be abused to run up compute charges, they can be used to read and alter private data.
    +
    +1. If adding login details to a spark configuration file, do not share this file, including
    +attaching to bug reports or committing it to SCM repositories.
    +1. Have different accounts for access to the storage for each application,
    +each with access rights restricted to those object storage buckets/containers used by the
    +application.
    +1. If the object store supports any form of session credential (e.g. Amazon's STS), issue
    +session credentials for the expected lifetime of the application.
    +1. When using a version of Spark with with Hadoop 2.8+ libraries, consider using Hadoop
    +credential files to store secrets, referencing
    +these files in the relevant ID/secret properties of the XML configuration file.
    +
    +
    +## <a name="object_stores"></a>Object stores and Their Library Dependencies
    +
    +The different object stores supported by Spark depend on specific Hadoop versions,
    +and require specific Hadoop JARs and dependent Java libraries on the classpath.
    +
    +<table class="table">
    +  <tr><th>Schema</th><th>Store</th><th>Details</th></tr>
    +  <tr>
    +    <td><code>s3a://</code></td>
    +    <td>Amazon S3</a>
    +    <td>
    +    Recommended S3 client for Spark releases built on Apache Hadoop 2.7 or later.
    +    </td>
    +  </tr>
    +  <tr>
    +    <td><code>s3n://</code></td>
    +    <td>Amazon S3</a>
    +    <td>
    +    Deprected S3 client; only use for Spark releases built on Apache Hadoop 2.6 or earlier.
    +    </td>
    +  </tr>
    +  <tr>
    +    <td><code>s3://</code></td>
    +    <td>Amazon S3 on Amazon EMR</a>
    +    <td>
    +    Amazon's own S3 client; use only and exclusivley in Amazon EMR.
    +    </td>
    +  </tr>
    +  <tr>
    +    <td><code>wasb://</code></td>
    +    <td>Azure Storage</a>
    +    <td>
    +    Client for Microsoft Azure Storage; since Hadoop 2.7.
    +    </td>
    +  </tr>
    +  <tr>
    +    <td><code>swift://</code></td>
    +    <td>OpenStack Swift</a>
    +    <td>
    +    Client for OpenStack Swift object stores.
    +    </td>
    +  </tr>
    +  <tr>
    +    <td><code>gs://</code></td>
    +    <td>Google Cloud Storage</a>
    +    <td>
    +    Google's client for their cloud object store.
    +    </td>
    +  </tr>
    +</table>
    +
    +
    +### <a name="working_with_amazon_s3"></a>Working with Amazon S3
    +
    +Amazon's S3 object store is probably the most widely used object store \u2014it is also the one
    +with the most client libraries. This is due to the evolution of Hadoop's support, and Amazon
    +offering Hadoop and Spark as its EMR service, along with its own S3 client.
    +
    +The recommendations for which client to use depend upon the version of Hadoop on the Spark classpath.
    +
    +<table class="table">
    +  <tr><th>Hadoop Library Version</th><th>Client</th></tr>
    +  <tr>
    +    <td>Hadoop 2.7+ and commercial products based on it</a>
    +    <td><code>s3a://</code></td>
    +  </tr>
    +  <tr>
    +    <td>Hadoop 2.6 or earlier</a>
    +    <td><code>s3n://</code></td>
    +  </tr>
    +  <tr>
    +    <td>Amazon EMR</a>
    +    <td><code>s3://</code></td>
    +  </tr>
    +</table>
    +
    +Authentication is generally via properties set in the spark context or, in YARN clusters,
    +`core-site.xml`.
    +Versions of the S3A client also support short-lived session credentials and IAM authentication to
    +automatically pick up credentials on EC2 deployments. Consult the appropriate Hadoop documentation for specifics.
    +
    +`spark-submit` will automatically pick up and propagate `AWS_ACCESS_KEY` and `AWS_SECRET_KEY`
    +from the environment variables set in the environment of the user running `spark-submit`; these
    +will override any set in the configuration files.
    +
    +Be aware that while S3 buckets support complex access control declarations, Spark needs
    +full read/write access to any bucket to which it must write data. That is: it does not support writing
    +to buckets where the root paths are read only, or not readable at all.
    +
    +#### <a name="s3a"></a>S3A Filesystem Client: `s3a://`
    +
    +The ["S3A" filesystem client](https://hadoop.apache.org/docs/stable2/hadoop-aws/tools/hadoop-aws/index.html)
    +shipped with in Hadoop 2.6, and has been considered ready for production use since Hadoop 2.7.1
    +
    +*The S3A connector is the sole S3 connector undergoing active maintenance at the Apache, and
    +should be used wherever possible.*
    +
    +**Classpath**
    --- End diff --
    
    I'll tighten it all down, and cut the CP details.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481] [build] Add spark-cloud module to pull in o...

Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    latest patch: has updated the dependency settings. As noted, works for Hadoop versions from 2.7 to 3.0.2-alpha & the HADOOP-13345 branch, at least if you build the last two with a `-Ddeclared.hadoop.version=2.11` to stop Hive overreacting to Hadoop major versions. Does the latter deliver speedup? Not clear yet; I'm not doing the benchmarking, just the regression testing. Hive + Tez flies, and I'm sure sean's colleagues can give the Impala numbers to him.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7481][build][WIP] Add Hadoop 2.6+ spark...

Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on the pull request:

    https://github.com/apache/spark/pull/12004#issuecomment-214753851
  
    Oh, and there's an initial documentation page on spark + cloud infrastructure, which tries to make clear that object stores are not real filesystems


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481][build] [WIP] Add Hadoop 2.6+ spark-cloud mo...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    **[Test build #64290 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64290/consoleFull)** for PR 12004 at commit [`4601b0a`](https://github.com/apache/spark/commit/4601b0a38c6f794959d3760e457800909816fb5e).
     * This patch **fails to build**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12004: [SPARK-7481] [build] Add spark-cloud module to pu...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12004#discussion_r104287069
  
    --- Diff: cloud/pom.xml ---
    @@ -0,0 +1,158 @@
    +<?xml version="1.0" encoding="UTF-8"?>
    +<!--
    +  ~ Licensed to the Apache Software Foundation (ASF) under one or more
    +  ~ contributor license agreements.  See the NOTICE file distributed with
    +  ~ this work for additional information regarding copyright ownership.
    +  ~ The ASF licenses this file to You under the Apache License, Version 2.0
    +  ~ (the "License"); you may not use this file except in compliance with
    +  ~ the License.  You may obtain a copy of the License at
    +  ~
    +  ~    http://www.apache.org/licenses/LICENSE-2.0
    +  ~
    +  ~ Unless required by applicable law or agreed to in writing, software
    +  ~ distributed under the License is distributed on an "AS IS" BASIS,
    +  ~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +  ~ See the License for the specific language governing permissions and
    +  ~ limitations under the License.
    +  -->
    +<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    +  <modelVersion>4.0.0</modelVersion>
    +  <parent>
    +    <groupId>org.apache.spark</groupId>
    +    <artifactId>spark-parent_2.11</artifactId>
    +    <version>2.1.0-SNAPSHOT</version>
    +    <relativePath>../pom.xml</relativePath>
    +  </parent>
    +
    +  <artifactId>spark-cloud_2.11</artifactId>
    +  <packaging>jar</packaging>
    +  <name>Spark Project Cloud Integration</name>
    +  <description>
    +    Contains support for cloud infrastructures, specifically the Hadoop JARs and
    +    transitive dependencies needed to interact with the infrastructures.
    +
    +    Any project which explicitly depends upon the spark-cloud artifact will get the dependencies;
    +    the exact versions of which will depend upon the hadoop version Spark was compiled against.
    +
    +    Hadoop 2.7:
    +      hadoop-aws
    +      aws-java-sdk-s3
    +      hadoop-azure
    +      azure-storage
    +      hadoop-openstack
    +
    +    WARNING: the signatures of methods in aws-java-sdk/aws-java-sdk-s3 can change between versions:
    +    use the same version against which Hadoop was compiled.
    +
    +  </description>
    +  <properties>
    +    <sbt.project.name>cloud</sbt.project.name>
    +  </properties>
    +
    +  <dependencies>
    +    <dependency>
    +      <groupId>org.apache.spark</groupId>
    +      <artifactId>spark-core_${scala.binary.version}</artifactId>
    +      <version>${project.version}</version>
    +    </dependency>
    +
    +    <!--Used for test classes -->
    +    <dependency>
    +      <groupId>org.apache.spark</groupId>
    +      <artifactId>spark-core_${scala.binary.version}</artifactId>
    +      <version>${project.version}</version>
    +      <type>test-jar</type>
    +      <scope>test</scope>
    +    </dependency>
    +
    +
    +    <!-- Jets3t is needed for s3n and s3 classic to work-->
    +    <dependency>
    +      <groupId>net.java.dev.jets3t</groupId>
    +      <artifactId>jets3t</artifactId>
    +    </dependency>
    +
    +    <!-- Explicit listing of transitive deps that are shaded. Otherwise, odd compiler crashes. -->
    +    <dependency>
    +      <groupId>com.google.guava</groupId>
    +      <artifactId>guava</artifactId>
    +    </dependency>
    +    <!-- End of shaded deps. -->
    +  </dependencies>
    +
    +  <build>
    +    <outputDirectory>target/scala-${scala.binary.version}/classes</outputDirectory>
    +    <testOutputDirectory>target/scala-${scala.binary.version}/test-classes</testOutputDirectory>
    +  </build>
    +
    +  <profiles>
    +
    +    <!--
    +      This profile is enabled automatically by the sbt build. It changes the scope for the guava
    +      dependency, since we don't shade it in the artifacts generated by the sbt build.
    +    -->
    +    <profile>
    +      <id>sbt</id>
    +      <dependencies>
    +        <dependency>
    +          <groupId>com.google.guava</groupId>
    +          <artifactId>guava</artifactId>
    +          <scope>compile</scope>
    +        </dependency>
    +      </dependencies>
    +    </profile>
    +
    +    <profile>
    +      <id>hadoop-2.7</id>
    +        <dependencies>
    +          <dependency>
    +            <groupId>org.apache.hadoop</groupId>
    +            <artifactId>hadoop-aws</artifactId>
    --- End diff --
    
    Does any of this change now that we only support Hadoop 2.6+? I assume that's good news if anything. Only a `hadoop-2.7` profile is defined here so what would this do for 2.6? You mention that the salient packaging change occurred in 2.6.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481][build] [WIP] Add Hadoop 2.6+ spark-cloud mo...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    **[Test build #62255 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62255/consoleFull)** for PR 12004 at commit [`c7ba2aa`](https://github.com/apache/spark/commit/c7ba2aa8bd17ffda6eb7d17465a2d1e79705770e).
     * This patch **fails build dependency tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481][build] [WIP] Add Hadoop 2.6+ spark-cloud mo...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    **[Test build #62537 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62537/consoleFull)** for PR 12004 at commit [`cb07c1d`](https://github.com/apache/spark/commit/cb07c1d7b79944059e477b0b615ce061b08cef00).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481] [build] Add spark-cloud module to pull in o...

Posted by nchammas <gi...@git.apache.org>.
Github user nchammas commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    Thanks for elaborating on where this work will help @steveloughran. Again, just speaking from my own point of view as Spark user and [Flintrock](https://github.com/nchammas/flintrock) maintainer, this sounds like it would be a big help. I hope that after getting something like this in, we can have the default builds of Spark leverage it to bundle support for S3.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12004: [SPARK-7481] [build] Add spark-cloud module to pu...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12004#discussion_r113697285
  
    --- Diff: docs/cloud-integration.md ---
    @@ -0,0 +1,512 @@
    +---
    +layout: global
    +displayTitle: Integration with Cloud Infrastructures
    +title: Integration with Cloud Infrastructures
    +description: Introduction to cloud storage support in Apache Spark SPARK_VERSION_SHORT
    +---
    +<!---
    +  Licensed under the Apache License, Version 2.0 (the "License");
    +  you may not use this file except in compliance with the License.
    +  You may obtain a copy of the License at
    +
    +   http://www.apache.org/licenses/LICENSE-2.0
    +
    +  Unless required by applicable law or agreed to in writing, software
    +  distributed under the License is distributed on an "AS IS" BASIS,
    +  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +  See the License for the specific language governing permissions and
    +  limitations under the License. See accompanying LICENSE file.
    +-->
    +
    +* This will become a table of contents (this text will be scraped).
    +{:toc}
    +
    +## <a name="introduction"></a>Introduction
    +
    +
    +All the public cloud infrastructures, Amazon AWS, Microsoft Azure, Google GCS and others offer
    +persistent data storage systems, "object stores". These are not quite the same as classic file
    +systems: in order to scale to hundreds of Petabytes, without any single points of failure
    +or size limits, object stores, "blobstores", have a simpler model of `name => data`.
    +
    +Apache Spark can read or write data in object stores for data access.
    +through filesystem connectors implemented in Apache Hadoop or provided by third-parties.
    +These libraries make the object stores look *almost* like filesystems, with directories and
    +operations on files (rename) and directories (create, rename, delete) which mimic
    +those of a classic filesystem. Because of this, Spark and Spark-based applications
    +can work with object stores, generally treating them as as if they were slower-but-larger filesystems.
    +
    +With these connectors, Apache Spark supports object stores as the source
    +of data for analysis, including Spark Streaming and DataFrames.
    +
    +
    +## <a name="quick_start"></a>Quick Start
    +
    +Provided the relevant libraries are on the classpath, and Spark is configured with your credentials,
    +objects in an object store can be can be read or written through URLs which uses the name of the
    +object store client as the schema and the bucket/container as the hostname.
    +
    +
    +### Dependencies
    +
    +The Spark application neeeds the relevant Hadoop libraries, which can
    +be done by including the `spark-hadoop-cloud` module for the specific version of spark used.
    +
    +The Spark application should include <code>hadoop-openstack</code> dependency, which can
    +be done by including the `spark-hadoop-cloud` module for the specific version of spark used.
    +For example, for Maven support, add the following to the <code>pom.xml</code> file:
    +
    +{% highlight xml %}
    +<dependencyManagement>
    +  ...
    +  <dependency>
    +    <groupId>org.apache.spark</groupId>
    +    <artifactId>spark-hadoop-cloud_2.11</artifactId>
    +    <version>${spark.version}</version>
    +  </dependency>
    +  ...
    +</dependencyManagement>
    +{% endhighlight %}
    +
    +If using the Scala 2.10-compatible version of Spark, the artifact is of course `spark-hadoop-cloud_2.10`.
    +
    +### Basic Use
    +
    +You can refer to data in an object store just as you would data in a filesystem, by
    +using a URL to the data in methods like `SparkContext.textFile()` to read data, 
    +`saveAsTextFile()` to write it back.
    +
    +
    +Because object stores are viewed by Spark as filesystems, object stores can
    +be used as the source or destination of any spark work \u2014be it batch, SQL, DataFrame,
    +Streaming or something else.
    +
    +The steps to do so are as follows
    +
    +1. Use the full URI to refer to a bucket, including the prefix for the client-side library
    +to use. Example: `s3a://landsat-pds/scene_list.gz`
    +1. Have the Spark context configured with the authentication details of the object store.
    +In a YARN cluster, this may also be done in the `core-site.xml` file.
    +
    +
    +## <a name="output"></a>Object Stores as a substitute for HDFS
    +
    +As the examples show, you can write data to object stores. However, that does not mean
    +That they can be used as replacements for a cluster-wide filesystem.
    +
    +The full details are covered in [Cloud Object Stores are Not Real Filesystems](#cloud_stores_are_not_filesystems).
    +
    +The brief summary is:
    +
    +| Object Store Connector      |  Replace HDFS? |
    +|-----------------------------|--------------------|
    +| `s3a://` `s3n://`  from the ASF   | No  |
    +| Amazon EMR `s3://`          | Yes |
    +| Microsoft Azure `wasb://`   | Yes |
    +| OpenStack `swift://`        | No  |
    +
    +It is possible to use any of the object stores as a destination of work, i.e. use
    +`saveAsTextFile()` or `save()` to save data there, but the commit process may be slow
    +and, unreliable in the presence of failures.
    +
    +It is faster and safer to use the cluster filesystem as the destination of Spark jobs,
    +using that data as the data for follow-on work. The final results can
    +be persisted in to the object store using `distcp`.
    +
    +#### <a name="checkpointing"></a>Spark Streaming and object stores
    +
    +Spark Streaming can monitor files added to object stores, by
    +creating a `FileInputDStream` DStream monitoring a path under a bucket through
    +`StreamingContext.textFileStream()`.
    +
    +
    +1. The time to scan for new files is proportional to the number of files
    +under the path \u2014not the number of *new* files, and that it can become a slow operation.
    +The size of the window needs to be set to handle this.
    +
    +1. Files only appear in an object store once they are completely written; there
    +is no need for a worklow of write-then-rename to ensure that files aren't picked up
    +while they are still being written. Applications can write straight to the monitored directory.
    +
    +1. Streams should only be checkpointed to an object store considered compatible with
    +HDFS. Otherwise the checkpointing will be slow and potentially unreliable.
    +
    +### Recommended settings for writing to object stores
    +
    +Here are the settings to use when writing to object stores. This uses the "version 2" algorithm
    +for committing files \u2014which does less renaming than the v1 algorithm. Speculative execution is
    +disabled to avoid multiple writers corrupting the output.
    +
    +```
    +spark.speculation false
    +spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2
    +spark.hadoop.mapreduce.fileoutputcommitter.cleanup-failures.ignored true
    +```
    +
    +There's also the option of skipping the cleanup of temporary files in the output directory.
    +Enabling this option eliminates a small delay caused by listing and deleting any such files.
    +
    +```
    +spark.hadoop.mapreduce.fileoutputcommitter.cleanup.skipped true
    +```
    +
    +Bear in mind that storing temporary files can run up charges; Delete
    --- End diff --
    
    Yeah, is this a good idea? sooner or later these need to be cleaned up, and can it be that expensive? I just question whether to call this "recommended"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481][build] [WIP] Add Hadoop 2.6+ spark-cloud mo...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    **[Test build #63588 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63588/consoleFull)** for PR 12004 at commit [`cb07c1d`](https://github.com/apache/spark/commit/cb07c1d7b79944059e477b0b615ce061b08cef00).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481] [build] Add spark-cloud module to pull in o...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    If Hadoop 2.5 vs 2.6 behaves differently w.r.t. S3 support classes, we can vary dependencies within the existing profile even, sure. That should be fixed up. However I think we may be juust about to drop 2.5 support anyway? that could simplify this.
    
    
    I get the idea of a small dependency-only module that includes a bunch of optional `hadoop-*` modules that contain support code that is specific to Hadoop + cloud-specific APIs. Are these integration libraries something users could supply in their app? maybe not. The module makes some sense then, so people can build in cloud-specific SDK support if they want.
    
    Docs are pretty uncontroversial, especially cloud-specific notes about config params and how to set them. That seems helpful.
    
    
    However I also see a lot of cloud-specific tests and examples. I didn't expect that. Is there new different functionality in Spark that only turns up in a cloud context? I see this is actually adding some new utility methods and new RDD API-like methods like saveAsTextFile(). I thought this would just be about making it easy to get the Hadoop API machinery set up to access cloud-specific storage.
    
    These tests couldn't be enabled on Spark Jenkins, right? at least, it would mean budgeting to run them and all that. If this is about making SDK integration easier, do we need specific tests? it seems to be more about testing the SDK and cloud service than anything, and prone to false positives.
    
    Not that it isn't useful, just trying to figure out how to reduce this to something less massive, at least to start.
    
    
    I don't know the origin of the feature branch comment -- is this referring to maintaining separate branches for major lines of development within Spark's primary git repo, and not just release branches? I actually don't quite like that. Downsides? Such branches are quasi-official when it's not clear they deserve that status more than others' collaborations. Enabling development off master for extended periods tends to let people do a lightweight fork and continue development without the forcing mechanism of getting review or buy in early. This leads to long-running dead ends, or "too big to not upmerge" feature branch battles. Or you get, well, forks. Wasn't this kind of how Hadoop ended up with a different "security" release branch a long time ago?
    
    The upside is collaborating on something that isn't master, but, git makes that trivial now. Yes the risk is that the collaboration is therefore not forcibly coupled to Spark's git repo, but we really only need that any such repo is public and open and shared on official channels. It's not like people can't collaborate, privately even, today, so not a new thing.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12004: [SPARK-7481] [build] Add spark-cloud module to pu...

Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12004#discussion_r113976264
  
    --- Diff: docs/cloud-integration.md ---
    @@ -0,0 +1,512 @@
    +---
    +layout: global
    +displayTitle: Integration with Cloud Infrastructures
    +title: Integration with Cloud Infrastructures
    +description: Introduction to cloud storage support in Apache Spark SPARK_VERSION_SHORT
    +---
    +<!---
    +  Licensed under the Apache License, Version 2.0 (the "License");
    +  you may not use this file except in compliance with the License.
    +  You may obtain a copy of the License at
    +
    +   http://www.apache.org/licenses/LICENSE-2.0
    +
    +  Unless required by applicable law or agreed to in writing, software
    +  distributed under the License is distributed on an "AS IS" BASIS,
    +  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +  See the License for the specific language governing permissions and
    +  limitations under the License. See accompanying LICENSE file.
    +-->
    +
    +* This will become a table of contents (this text will be scraped).
    +{:toc}
    +
    +## <a name="introduction"></a>Introduction
    +
    +
    +All the public cloud infrastructures, Amazon AWS, Microsoft Azure, Google GCS and others offer
    +persistent data storage systems, "object stores". These are not quite the same as classic file
    +systems: in order to scale to hundreds of Petabytes, without any single points of failure
    +or size limits, object stores, "blobstores", have a simpler model of `name => data`.
    +
    +Apache Spark can read or write data in object stores for data access.
    +through filesystem connectors implemented in Apache Hadoop or provided by third-parties.
    +These libraries make the object stores look *almost* like filesystems, with directories and
    +operations on files (rename) and directories (create, rename, delete) which mimic
    +those of a classic filesystem. Because of this, Spark and Spark-based applications
    +can work with object stores, generally treating them as as if they were slower-but-larger filesystems.
    +
    +With these connectors, Apache Spark supports object stores as the source
    +of data for analysis, including Spark Streaming and DataFrames.
    +
    +
    +## <a name="quick_start"></a>Quick Start
    +
    +Provided the relevant libraries are on the classpath, and Spark is configured with your credentials,
    +objects in an object store can be can be read or written through URLs which uses the name of the
    +object store client as the schema and the bucket/container as the hostname.
    +
    +
    +### Dependencies
    +
    +The Spark application neeeds the relevant Hadoop libraries, which can
    +be done by including the `spark-hadoop-cloud` module for the specific version of spark used.
    +
    +The Spark application should include <code>hadoop-openstack</code> dependency, which can
    +be done by including the `spark-hadoop-cloud` module for the specific version of spark used.
    +For example, for Maven support, add the following to the <code>pom.xml</code> file:
    +
    +{% highlight xml %}
    +<dependencyManagement>
    +  ...
    +  <dependency>
    +    <groupId>org.apache.spark</groupId>
    +    <artifactId>spark-hadoop-cloud_2.11</artifactId>
    +    <version>${spark.version}</version>
    +  </dependency>
    +  ...
    +</dependencyManagement>
    +{% endhighlight %}
    +
    +If using the Scala 2.10-compatible version of Spark, the artifact is of course `spark-hadoop-cloud_2.10`.
    +
    +### Basic Use
    +
    +You can refer to data in an object store just as you would data in a filesystem, by
    +using a URL to the data in methods like `SparkContext.textFile()` to read data, 
    +`saveAsTextFile()` to write it back.
    +
    +
    +Because object stores are viewed by Spark as filesystems, object stores can
    +be used as the source or destination of any spark work \u2014be it batch, SQL, DataFrame,
    +Streaming or something else.
    +
    +The steps to do so are as follows
    +
    +1. Use the full URI to refer to a bucket, including the prefix for the client-side library
    +to use. Example: `s3a://landsat-pds/scene_list.gz`
    +1. Have the Spark context configured with the authentication details of the object store.
    +In a YARN cluster, this may also be done in the `core-site.xml` file.
    +
    +
    +## <a name="output"></a>Object Stores as a substitute for HDFS
    +
    +As the examples show, you can write data to object stores. However, that does not mean
    +That they can be used as replacements for a cluster-wide filesystem.
    +
    +The full details are covered in [Cloud Object Stores are Not Real Filesystems](#cloud_stores_are_not_filesystems).
    +
    +The brief summary is:
    +
    +| Object Store Connector      |  Replace HDFS? |
    +|-----------------------------|--------------------|
    +| `s3a://` `s3n://`  from the ASF   | No  |
    +| Amazon EMR `s3://`          | Yes |
    +| Microsoft Azure `wasb://`   | Yes |
    +| OpenStack `swift://`        | No  |
    +
    +It is possible to use any of the object stores as a destination of work, i.e. use
    +`saveAsTextFile()` or `save()` to save data there, but the commit process may be slow
    +and, unreliable in the presence of failures.
    +
    +It is faster and safer to use the cluster filesystem as the destination of Spark jobs,
    +using that data as the data for follow-on work. The final results can
    +be persisted in to the object store using `distcp`.
    +
    +#### <a name="checkpointing"></a>Spark Streaming and object stores
    +
    +Spark Streaming can monitor files added to object stores, by
    +creating a `FileInputDStream` DStream monitoring a path under a bucket through
    +`StreamingContext.textFileStream()`.
    +
    +
    +1. The time to scan for new files is proportional to the number of files
    +under the path \u2014not the number of *new* files, and that it can become a slow operation.
    +The size of the window needs to be set to handle this.
    +
    +1. Files only appear in an object store once they are completely written; there
    +is no need for a worklow of write-then-rename to ensure that files aren't picked up
    +while they are still being written. Applications can write straight to the monitored directory.
    +
    +1. Streams should only be checkpointed to an object store considered compatible with
    +HDFS. Otherwise the checkpointing will be slow and potentially unreliable.
    +
    +### Recommended settings for writing to object stores
    +
    +Here are the settings to use when writing to object stores. This uses the "version 2" algorithm
    +for committing files \u2014which does less renaming than the v1 algorithm. Speculative execution is
    +disabled to avoid multiple writers corrupting the output.
    +
    +```
    +spark.speculation false
    +spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2
    +spark.hadoop.mapreduce.fileoutputcommitter.cleanup-failures.ignored true
    +```
    +
    +There's also the option of skipping the cleanup of temporary files in the output directory.
    +Enabling this option eliminates a small delay caused by listing and deleting any such files.
    +
    +```
    +spark.hadoop.mapreduce.fileoutputcommitter.cleanup.skipped true
    +```
    +
    +Bear in mind that storing temporary files can run up charges; Delete
    +directories called `"_temporary"` on a regular basis to avoid this.
    +
    +
    +### YARN Scheduler settings
    +
    +When running Spark in a YARN cluster running in EC2, turning off locality avoids any delays
    --- End diff --
    
    It comes down to what the FS client returns in a call to `FS.getBlockLocations(path()`. S3a returns the default, "localhost", and what the driver decides is the best placement strategy for blocks which say that.
    
    Looking @ Azure, it supports a property where you can name a host to impersonate, agan, default "localhost" ; Swift can get some locality info from the store which it actually uses.
    
    So that's the real issue: what is the best way for a filesystem to let an app know the data is "in the cloud" and for the driver to recognise that and use it in its placement choice. In Spark it trickles in to `RDD.preferredLocations()`; so really its up to what happens there when the returned location for an rdd is "localhost". I haven't actually checked what goes on there; I'm recommending that option as its generally what we do: either there's a good reason or its superstition. 
    
    How about I cut on the basis that "it's not a defensible statement" and add some JIRA to "work out what goes on here", maybe even doing some experiments?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481] [build] Add spark-cloud module to pull in o...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    **[Test build #69480 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69480/consoleFull)** for PR 12004 at commit [`e9b8ed0`](https://github.com/apache/spark/commit/e9b8ed0b47b300cfe6bb64d2d2622b842fede142).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481] [build] Add spark-cloud module to pull in o...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    **[Test build #67991 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67991/consoleFull)** for PR 12004 at commit [`c9f3a0b`](https://github.com/apache/spark/commit/c9f3a0bbdb682fc151233ae46abe97da382a9594).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481] [build] Add spark-cloud module to pull in o...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481] [build] Add spark-cloud module to pull in o...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    **[Test build #69480 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69480/consoleFull)** for PR 12004 at commit [`e9b8ed0`](https://github.com/apache/spark/commit/e9b8ed0b47b300cfe6bb64d2d2622b842fede142).
     * This patch **fails build dependency tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7481][build][WIP] Add Hadoop 2.6+ spark...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12004#issuecomment-216988393
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/57782/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481][build] [WIP] Add Hadoop 2.7+ spark-cloud mo...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64580/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481] [build] Add spark-cloud module to pull in o...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12004: [SPARK-7481] [build] Add spark-cloud module to pu...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12004#discussion_r113694301
  
    --- Diff: assembly/pom.xml ---
    @@ -226,5 +226,19 @@
             <parquet.deps.scope>provided</parquet.deps.scope>
           </properties>
         </profile>
    +
    +    <!--
    +     Pull in spark-hadoop-cloud and its associated JARs,
    +    -->
    +    <profile>
    +      <id>cloud</id>
    --- End diff --
    
    Call this `hadoop-cloud` perhaps?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12004: [SPARK-7481] [build] Add spark-cloud module to pu...

Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran closed the pull request at:

    https://github.com/apache/spark/pull/12004


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481] [build] Add spark-cloud module to pull in o...

Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    the latest patch moves to the suggested name `spark-hadoop-cloud`; the external test repo is in sync. Those test are all working happily against s3 ireland, Azure and rackspace swift, on hadoop 2.7+


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7481][build][WIP] Add Hadoop 2.6+ spark...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12004#issuecomment-214747320
  
    **[Test build #56998 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56998/consoleFull)** for PR 12004 at commit [`8926acb`](https://github.com/apache/spark/commit/8926acb25e05f5d0748155c26262aae0d54fb3d0).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481][build] [WIP] Add Hadoop 2.6+ spark-cloud mo...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62260/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7481][build][WIP] Add Hadoop 2.6+ profi...

Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12004#discussion_r57614456
  
    --- Diff: cloud/pom.xml ---
    @@ -0,0 +1,141 @@
    +<?xml version="1.0" encoding="UTF-8"?>
    +<!--
    +  ~ Licensed to the Apache Software Foundation (ASF) under one or more
    +  ~ contributor license agreements.  See the NOTICE file distributed with
    +  ~ this work for additional information regarding copyright ownership.
    +  ~ The ASF licenses this file to You under the Apache License, Version 2.0
    +  ~ (the "License"); you may not use this file except in compliance with
    +  ~ the License.  You may obtain a copy of the License at
    +  ~
    +  ~    http://www.apache.org/licenses/LICENSE-2.0
    +  ~
    +  ~ Unless required by applicable law or agreed to in writing, software
    +  ~ distributed under the License is distributed on an "AS IS" BASIS,
    +  ~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +  ~ See the License for the specific language governing permissions and
    +  ~ limitations under the License.
    +  -->
    +<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    +  <modelVersion>4.0.0</modelVersion>
    +  <parent>
    +    <groupId>org.apache.spark</groupId>
    +    <artifactId>spark-parent_2.11</artifactId>
    +    <version>2.0.0-SNAPSHOT</version>
    +    <relativePath>../pom.xml</relativePath>
    +  </parent>
    +
    +  <artifactId>spark-cloud_2.11</artifactId>
    +  <packaging>jar</packaging>
    +  <name>Spark Project cloud integration</name>
    +  <description>Contains support for cloud infrastructures, including the Hadoop JARs and
    +    transitive dependencies needed to interact with the infrastructures.
    +    When included in the spark-assembly JAR, the hadoop artifacts are included, but not
    +    any of the 3rd party libraries needed, such as those from Amazon (for AWS) and Microsoft (Azure).
    +    These will need to be explicitly added to the classpath of any application interacting
    +    with the services.
    +    
    +    Any project which explicitly depends upon the spark-cloud artifact will get the dependencies;
    +    the exact versions of which will depend upon the hadoop version Spark was compiled against.
    +
    +    Hadoop 2.5 and earlier: jets3t.
    +    Hadoop 2.6: hadoop-aws and aws-java-sdk JARs
    +    Hadoop 2.7+: hadoop-aws, aws-java-sdk-s3, hadoop-azure and azure-storage JARs
    +
    +    WARNING: the signatures of methods in aws-java-sdk/aws-java-sdk-s3 can change between versions:
    +    use the same version against which Hadoop was compiled.
    +  </description>
    +  <properties>
    +    <sbt.project.name>cloud</sbt.project.name>
    +  </properties>
    +
    +  <dependencies>
    +    <dependency>
    +      <groupId>org.apache.spark</groupId>
    +      <artifactId>spark-core_${scala.binary.version}</artifactId>
    +      <version>${project.version}</version>
    +    </dependency>
    +    <dependency>
    +      <groupId>org.apache.spark</groupId>
    +      <artifactId>spark-core_${scala.binary.version}</artifactId>
    +      <version>${project.version}</version>
    +      <type>test-jar</type>
    +      <scope>test</scope>
    +    </dependency>
    +    <dependency>
    +      <groupId>org.apache.spark</groupId>
    +      <artifactId>spark-test-tags_${scala.binary.version}</artifactId>
    +    </dependency>
    +    <dependency>
    +      <groupId>net.java.dev.jets3t</groupId>
    +      <artifactId>jets3t</artifactId>
    +    </dependency>
    +
    +    <!-- Explicit listing of transitive deps that are shaded. Otherwise, odd compiler crashes. -->
    +    <dependency>
    +      <groupId>com.google.guava</groupId>
    +      <artifactId>guava</artifactId>
    +    </dependency>
    +    <!-- End of shaded deps. -->
    +  </dependencies>
    +  <build>
    +    <outputDirectory>target/scala-${scala.binary.version}/classes</outputDirectory>
    +    <testOutputDirectory>target/scala-${scala.binary.version}/test-classes</testOutputDirectory>
    +  </build>
    +
    +  <profiles>
    +    
    +    <!--
    +      This profile is enabled automatically by the sbt built. It changes the scope for the guava
    --- End diff --
    
    funny: that was a comment I lifted with the dependency cargo cult style; it'll need fixing in in the original too..


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12004: [SPARK-7481] [build] Add spark-cloud module to pu...

Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12004#discussion_r89352877
  
    --- Diff: docs/cloud-integration.md ---
    @@ -0,0 +1,953 @@
    +---
    +layout: global
    +displayTitle: Integration with Cloud Infrastructures
    +title: Integration with Cloud Infrastructures
    +description: Introduction to cloud storage support in Apache Spark SPARK_VERSION_SHORT
    +---
    +<!---
    +  Licensed under the Apache License, Version 2.0 (the "License");
    +  you may not use this file except in compliance with the License.
    +  You may obtain a copy of the License at
    +
    +   http://www.apache.org/licenses/LICENSE-2.0
    +
    +  Unless required by applicable law or agreed to in writing, software
    +  distributed under the License is distributed on an "AS IS" BASIS,
    +  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +  See the License for the specific language governing permissions and
    +  limitations under the License. See accompanying LICENSE file.
    +-->
    +
    +* This will become a table of contents (this text will be scraped).
    +{:toc}
    +
    +## <a name="introduction"></a>Introduction
    +
    +
    +All the public cloud infrastructures, Amazon AWS, Microsoft Azure, Google GCS and others offer
    +persistent data storage systems, "object stores". These are not quite the same as classic file
    +systems: in order to scale to hundreds of Petabytes, without any single points of failure
    +or size limits, object stores, "blobstores", have a simpler model of `name => data`.
    +
    +Apache Spark can read or write data in object stores for data access.
    +through filesystem connectors implemented in Apache Hadoop or provided by third-parties.
    +These libraries make the object stores look *almost* like filesystems, with directories and
    +operations on files (rename) and directories (create, rename, delete) which mimic
    +those of a classic filesystem. Because of this, Spark and Spark-based applications
    +can work with object stores, generally treating them as as if they were slower-but-larger filesystems.
    +
    +With these connectors, Apache Spark supports object stores as the source
    +of data for analysis, including Spark Streaming and DataFrames.
    +
    +
    +## <a name="quick_start"></a>Quick Start
    +
    +Provided the relevant libraries are on the classpath, and Spark is configured with your credentials,
    +objects in an object store can be can be read or written through URLs which uses the name of the
    +object store client as the schema and the bucket/container as the hostname.
    +
    +
    +### Dependencies
    +
    +The Spark application neeeds the relevant Hadoop libraries, which can
    +be done by including the `spark-cloud` module for the specific version of spark used.
    +
    +The Spark application should include <code>hadoop-openstack</code> dependency, which can
    +be done by including the `spark-cloud` module for the specific version of spark used.
    +For example, for Maven support, add the following to the <code>pom.xml</code> file:
    +
    +{% highlight xml %}
    +<dependencyManagement>
    +  ...
    +  <dependency>
    +    <groupId>org.apache.spark</groupId>
    +    <artifactId>spark-cloud_2.11</artifactId>
    +    <version>${spark.version}</version>
    +  </dependency>
    +  ...
    +</dependencyManagement>
    +{% endhighlight %}
    +
    +If using the Scala 2.10-compatible version of Spark, the artifact is of course `spark-cloud_2.10`.
    +
    +### Basic Use
    +
    +
    +
    +To refer to a path in Amazon S3, use `s3a://` as the scheme (Hadoop 2.7+) or `s3n://` on older versions.
    +
    +{% highlight scala %}
    +sparkContext.textFile("s3a://landsat-pds/scene_list.gz").count()
    +{% endhighlight %}
    +
    +Similarly, an RDD can be saved to an object store via `saveAsTextFile()`
    +
    +
    +{% highlight scala %}
    +val numbers = sparkContext.parallelize(1 to 1000)
    +
    +// save to Amazon S3 (or compatible implementation)
    +numbers.saveAsTextFile("s3a://testbucket/counts")
    +
    +// Save to Azure Object store
    +numbers.saveAsTextFile("wasb://testbucket@example.blob.core.windows.net/counts")
    +
    +// save to an OpenStack Swift implementation
    +numbers.saveAsTextFile("swift://testbucket.openstack1/counts")
    +{% endhighlight %}
    +
    +That's essentially it: object stores can act as a source and destination of data, using exactly
    +the same APIs to load and save data as one uses to work with data in HDFS or other filesystems.
    +
    +Because object stores are viewed by Spark as filesystems, object stores can
    +be used as the source or destination of any spark work \u2014be it batch, SQL, DataFrame,
    +Streaming or something else.
    +
    +The steps to do so are as follows
    +
    +1. Use the full URI to refer to a bucket, including the prefix for the client-side library
    +to use. Example: `s3a://landsat-pds/scene_list.gz`
    +1. Have the Spark context configured with the authentication details of the object store.
    +In a YARN cluster, this may also be done in the `core-site.xml` file.
    +1. Have the JAR containing the filesystem classes on the classpath \u2014along with all of its dependencies.
    +
    +### <a name="dataframes"></a>Example: DataFrames
    +
    +DataFrames can be created from and saved to object stores through the `read()` and `write()` methods.
    +
    +{% highlight scala %}
    +import org.apache.spark.SparkConf
    +import org.apache.spark.sql.SparkSession
    +import org.apache.spark.sql.types.StringType
    +
    +val spark = SparkSession
    +    .builder
    +    .appName("DataFrames")
    +    .config(sparkConf)
    +    .getOrCreate()
    +import spark.implicits._
    +val numRows = 1000
    +
    +// generate test data
    +val sourceData = spark.range(0, numRows).select($"id".as("l"), $"id".cast(StringType).as("s"))
    +
    +// define the destination
    +val dest = "wasb://yourcontainer@youraccount.blob.core.windows.net/dataframes"
    +
    +// write the data
    +val orcFile = dest + "/data.orc"
    +sourceData.write.format("orc").save(orcFile)
    +
    +// now read it back
    +val orcData = spark.read.format("orc").load(orcFile)
    +
    +// finally, write the data as Parquet
    +orcData.write.format("parquet").save(dest + "/data.parquet")
    +spark.stop()
    +{% endhighlight %}
    +
    +### <a name="streaming"></a>Example: Spark Streaming and Cloud Storage
    +
    +Spark Streaming can monitor files added to object stores, by
    +creating a `FileInputDStream` DStream monitoring a path under a bucket.
    +
    +{% highlight scala %}
    +import org.apache.spark.SparkConf
    +import org.apache.spark.sql.SparkSession
    +import org.apache.spark.streaming._
    +
    +val sparkConf = new SparkConf()
    +val ssc = new StreamingContext(sparkConf, Milliseconds(5000))
    +try {
    +  val lines = ssc.textFileStream("s3a://bucket/incoming")
    +  val matches = lines.filter(_.endsWith("3"))
    +  matches.print()
    +  ssc.start()
    +  ssc.awaitTermination()
    +} finally {
    +  ssc.stop(true)
    +}
    +{% endhighlight %}
    +
    +1. The time to scan for new files is proportional to the number of files
    +under the path \u2014not the number of *new* files, and that it can become a slow operation.
    +The size of the window needs to be set to handle this.
    +
    +1. Files only appear in an object store once they are completely written; there
    +is no need for a worklow of write-then-rename to ensure that files aren't picked up
    +while they are still being written. Applications can write straight to the monitored directory.
    +
    +#### <a name="checkpointing"></a>Checkpointing Streams to object stores
    +
    +Streams should only be checkpointed to an object store considered compatible with
    +HDFS. As the checkpoint operation includes a `rename()` operation, checkpointing to
    +an object store can be so slow that streaming throughput collapses.
    +
    +
    +## <a name="output"></a>Object Stores as a substitute for HDFS
    --- End diff --
    
    I should add that you ought you point this out [to your doc team](https://www.cloudera.com/documentation/enterprise/5-8-x/topics/spark_s3.html) \u2014especially the bit about speculation. Our docs are (based on those in this PR)(http://docs.hortonworks.com/HDPDocuments/HDCloudAWS/HDCloudAWS-1.8.0/bk_hdcloud-aws/content/s3-spark/index.html), including all the warnings. S3 works great as a source of data, the S3A phase II work benefits the column formats (ORC, Spark) a lot, other tuning coming along. It's the rename-in-commit which is the enemy. 
    
    Eventual consistency? not much of an issue for static/infrequently updated data, though it does surface in tests


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12004: [SPARK-7481][build] [WIP] Add Hadoop 2.6+ spark-cloud mo...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12004
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62255/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7481][build][WIP] Add Hadoop 2.6+ profi...

Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on the pull request:

    https://github.com/apache/spark/pull/12004#issuecomment-203424016
  
    build failing as SBT needs to be conditional on the spark/cloud module being Hadoop 2.6+


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7481][build][WIP] Add Hadoop 2.6+ spark...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12004#issuecomment-211545950
  
    **[Test build #56092 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56092/consoleFull)** for PR 12004 at commit [`ce48b8c`](https://github.com/apache/spark/commit/ce48b8c377446e210c552d70ce4a391f4a4db4e6).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org