You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by steveloughran <gi...@git.apache.org> on 2017/05/02 18:33:18 UTC

[GitHub] spark pull request #17834: [SPARK-7481] [build] Add spark-hadoop-cloud modul...

GitHub user steveloughran opened a pull request:

    https://github.com/apache/spark/pull/17834

    [SPARK-7481] [build] Add spark-hadoop-cloud module to pull in object store access.

    ## What changes were proposed in this pull request?
    
    Add a new `spark-hadoop-cloud ` module and maven profile to pull in object store support from `hadoop-openstack`, `hadoop-aws` and `hadoop-azure` (Hadoop 2.7+) JARs, along with their dependencies, fixing up the dependencies so that everything works, in particular Jackson.
    
    It restores `s3n://` access to S3, adds its `s3a://` replacement, OpenStack `swift://` and azure `wasb://`.
    
    There's a documentation page, `cloud_integration.md`, which covers the basic details of using Spark with object stores, referring the reader to the  supplier's own documentation, with specific warnings on security and the possible mismatch between a store's behavior and that of a filesystem.
    In particular, users are advised be very cautious when trying to use an object store as the destination of data, and to consult the documentation of the storage supplier and the connector. 
    
    (this is the successor to #12004; I can't re-open it)
    
    ## How was this patch tested?
    
    I tests in [https://github.com/steveloughran/spark-cloud-examples/tree/master/cloud-examples](https://github.com/steveloughran/spark-cloud-examples/tree/master/cloud-examples)
    
    Those verify that the dependencies are sufficient to allow downstream applications to work with s3a, azure wasb and swift storage connectors, and perform basic IO & dataframe operations thereon. All seems well.
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/steveloughran/spark cloud/SPARK-7481-current

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/17834.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #17834
    
----
commit 1da9a3d181e5226a0ae9379c0c8905b319a4afe9
Author: Steve Loughran <st...@apache.org>
Date:   2016-11-18T15:50:15Z

    [SPARK-7481] stripped down packaging only module

commit 028d9ed428638520239da7d2b619d20817df56fd
Author: Steve Loughran <st...@apache.org>
Date:   2016-11-18T17:02:53Z

    [SPARK-7481] basic instantiation tests verify that dependency hadoop-azure, hadoop-aws, hadoop-openstack and implicitly their transitive dependencies are resolved. They don't verify all dependency setup, specifically that Jackson versions are consistent; that needs integration testing.

commit ace46e98e913ee68c0aca88d17eeb0f055da074b
Author: Steve Loughran <st...@apache.org>
Date:   2016-11-18T19:04:53Z

    [SPARK-7481] tests restricted to instantiation; logging modified appropriately

commit 3f6dfdad893d083e4653c547fcd6406a91dd9544
Author: Steve Loughran <st...@apache.org>
Date:   2016-11-21T12:07:25Z

    [SPARK-7481] declare httpcomponents:httpclient explicitly, as downstream tests which pulled in spark-cloud but not spark-hive were ending up with inconsistent versions. Add a test for the missing class being there too.

commit 5f8f996cea76a16391073b46023a981cba3b3cce
Author: Steve Loughran <st...@apache.org>
Date:   2016-11-21T17:56:05Z

    [SPARK-7481] update docs by culling section on cloud integration tests; link to remaning docs from top level.

commit e92a49322dfdb777e996e9b07b298bb8ae8967d6
Author: Steve Loughran <st...@apache.org>
Date:   2016-11-28T15:44:10Z

    [SPARK-7481]  updated documentation as per review

commit 97e80e1963b8f64905165c08974197cf4cd68356
Author: Steve Loughran <st...@apache.org>
Date:   2016-11-28T15:44:30Z

    [SPARK-7481]  SBT will build this now, optionally

commit ef3cebfd1baf928c3f30380f662eaee13ee6ca08
Author: Steve Loughran <st...@apache.org>
Date:   2016-11-28T15:45:44Z

    [SPARK-7481] cloud POM includes jackson-dataformat-cbor, so that the CP is set up consistently for the later versions of the AWS SDK

commit 66650c7c7d4d9e2cb640175428bf16a343d6319b
Author: Steve Loughran <st...@apache.org>
Date:   2016-12-01T13:30:48Z

    [SPARK-7481]  rebase with master; Pom had got out of sync

commit 31cc37e90f2dcb0ebbe696bc08d951e0526293f9
Author: Steve Loughran <st...@apache.org>
Date:   2016-12-02T17:39:52Z

    [SPARK-7481] rename spark-cloud module to spark-hadoo-cloud, in POMs and docs

commit 2fc6f23b5397f344583c0e192f88fb40bb88f6ad
Author: Steve Loughran <st...@apache.org>
Date:   2016-12-14T15:47:10Z

    [SPARK-7841] bump up cloud pom to 2.2.0-SNAPSHOT; other minor pom cleanup

commit 65f6814ccba464dbba1c8a5390638291c7c3cf1a
Author: Steve Loughran <st...@apache.org>
Date:   2017-01-10T14:07:18Z

    [SPARK-7481] builds against Hadoop shaded 3.x clients failing as direct references to AWS classes failing. Cut them and rely on transitive load through FS class instantation to force the load. All that happens is that failures to link will be slightly less easy to debug.

commit 73820a341cbbdecdd386a1448300439577273671
Author: Steve Loughran <st...@apache.org>
Date:   2017-01-20T13:52:45Z

    [SPARK-7481] update 2.7 dependencies to include azure, aws and openstack JARs, transitive dependencies on aws and azure SDKs

commit 824d801d43000161533dd50c9e2c7d2f1a1f7a0b
Author: Steve Loughran <st...@apache.org>
Date:   2017-01-30T14:27:39Z

    [SPARK-7481] add joda time as the dependency. Tested against hadoop branch-2, s3 ireland

commit 12a1b8488968917e4d99a39c7dd3ac2d39f87727
Author: Steve Loughran <st...@apache.org>
Date:   2017-02-24T14:30:29Z

    SPARK-7481 purge all tests from the cloud module

commit a7a2deca3cf00488682e355b41c716ecce57a62f
Author: Steve Loughran <st...@hortonworks.com>
Date:   2017-03-20T14:10:12Z

    SPARK-7481 add cloud module to sbt sequence
    
    Change-Id: I3dea2544f089615493163f0fae482992873f9c35

commit 02f6e19bef8d7e1e0622d04bf47bb2c785996877
Author: Steve Loughran <st...@hortonworks.com>
Date:   2017-03-20T14:14:37Z

    SPARK-7481 break line of mvn XML declaration
    
    Change-Id: Ibd6d40df2bc8a2edf19a058c458bea233ba414fd

commit ce042d2405706bc7cd6b0d2a410c36346be0c86e
Author: Steve Loughran <st...@hortonworks.com>
Date:   2017-03-20T19:19:49Z

    SPARK-7481 cloud pom is still JAR (not pom). works against Hadoop 2.6 as well as 2.7, keeping azure the 2.7.x dependency. All dependencies are scoped @ hadoop.scope
    
    Change-Id: I80bd95fd48e21cf2eb4d94907ac99081cd3bd375

commit a98575370d9af1cda2c8b05672beea101ec6e83e
Author: Steve Loughran <st...@hortonworks.com>
Date:   2017-04-27T15:07:10Z

    SPARK-7481 move to Spark 2.3.0-SNAPSHOT
    
    Change-Id: I91f764aeed7d832df1538453d869a7fd83964d65

commit 0e0527d62295b1d18a53ab12ac12fddaddf7be94
Author: Steve Loughran <st...@hortonworks.com>
Date:   2017-04-27T20:18:06Z

    tweaked pom; updated docs
    
    Change-Id: I12ea6ed72ffa9edee964c90c862ff4c45bc4f47f

commit b78158f7aaeaebda206c30ea3e620b3775b3481b
Author: Steve Loughran <st...@hortonworks.com>
Date:   2017-04-28T14:50:58Z

    SPARK-7481 strip down the docs to a bare minimum: FS differences, security, spark-specific options + links elsewhere
    
    Change-Id: I7e9efe20d116802a403af875b241b91178078d78

commit de3e95bfaa012fe8003d030fe84b00259d7610aa
Author: Steve Loughran <st...@hortonworks.com>
Date:   2017-04-28T15:52:06Z

    SPARK-7481 doc review
    
    Change-Id: I1923a4b6a959d86aa2c5b3d71faaaf2541d3ba85

commit 9b1579b04646e8581482d2b37e8b3d984be7dd75
Author: Steve Loughran <st...@hortonworks.com>
Date:   2017-04-28T17:26:10Z

    review comments
    
    Change-Id: I6a0b0b9f06a4adcdf55ef75161dc1039961bc7a1

commit 844e2551daad0ecfd1f870c4d3e130e361c454c1
Author: Steve Loughran <st...@hortonworks.com>
Date:   2017-05-02T13:44:10Z

    SPARK-7481 more proofreading
    
    Change-Id: Ic4804667af8e52b7be11fb00621ad8b69a1d2569

commit 72a03ed58331813b0ad4bc9517fcc1f23a5eda6f
Author: Steve Loughran <st...@hortonworks.com>
Date:   2017-05-02T18:21:46Z

    SPARK-7481 proofreading docs
    
    Change-Id: I2b75a2722f0082b916b9be20bd23a0bdc2d36615

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17834: [SPARK-7481] [build] Add spark-hadoop-cloud module to pu...

Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on the issue:

    https://github.com/apache/spark/pull/17834
  
    The last one was on all the doc comments, and believe I've addressed them both with the little typos and by focusing the docs on the main points for Spark users: how stores differ from filesystems, and what it means
    
    The big issue for spark users is "the commit problem", where I've listed the object store behaviours and said "this means things may not work —consult the docs". I'm not being explicit as what works/doesn't work as that's the moving target. 
    
    Right now, I don't trust commits with S3a using V1 or V2 FileOutputCommitter algorithms to work 100% of the time, because they rely on a list consistency which Amazon S3 doesn't guarantee. I could make that a lot clearer, something like
    
    *You cannot reliably use the FileOutputCommitter to commit work to Amazon S3 or Openstack Swift unless there is some form of consistency layer on top*.
    
    That probably is the core concept people need to know: it's not safe without something (EMR, S3Guard, Databricks Commit Service) to give you that consistent view.
    
    Then add pointers to the talks by myself and [Eric Liang](https://www.slideshare.net/databricks/robust-and-scalable-etl-over-cloud-storage-with-apache-spark) on the topic 
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17834: [SPARK-7481] [build] Add spark-hadoop-cloud modul...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17834#discussion_r114975188
  
    --- Diff: docs/cloud-integration.md ---
    @@ -0,0 +1,203 @@
    +---
    +layout: global
    +displayTitle: Integration with Cloud Infrastructures
    +title: Integration with Cloud Infrastructures
    +description: Introduction to cloud storage support in Apache Spark SPARK_VERSION_SHORT
    +---
    +<!---
    +  Licensed under the Apache License, Version 2.0 (the "License");
    +  you may not use this file except in compliance with the License.
    +  You may obtain a copy of the License at
    +
    +   http://www.apache.org/licenses/LICENSE-2.0
    +
    +  Unless required by applicable law or agreed to in writing, software
    +  distributed under the License is distributed on an "AS IS" BASIS,
    +  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +  See the License for the specific language governing permissions and
    +  limitations under the License. See accompanying LICENSE file.
    +-->
    +
    +* This will become a table of contents (this text will be scraped).
    +{:toc}
    +
    +## Introduction
    +
    +
    +All major cloud providers offer persistent data storage in *object stores*.
    +These are not classic "POSIX" file systems.
    +In order to store hundreds of petabytes of data without any single points of failure,
    +object stores replace the classic filesystem directory tree
    +with a simpler model of `object-name => data`. To enable remote access, operations
    +on objects are usually offered as (slow) HTTP REST operations.
    +
    +Spark can read and write data in object stores through filesystem connectors implemented
    +in Hadoop or provided by the infrastructure suppliers themselves.
    +These connectors make the object stores look *almost* like filesystems, with directories and files
    +and the classic operations on them such as list, delete and rename.
    +
    +
    +### Important: Cloud Object Stores are Not Real Filesystems
    +
    +While the stores appear to be filesystems, underneath
    +they are still object stores, [and the difference is significant](http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/filesystem/introduction.html)
    --- End diff --
    
    Now that we're down to tiny nits, I'd link everywhere to https:// URLs


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17834: [SPARK-7481] [build] Add spark-hadoop-cloud modul...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/17834


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17834: [SPARK-7481] [build] Add spark-hadoop-cloud modul...

Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17834#discussion_r114652036
  
    --- Diff: docs/cloud-integration.md ---
    @@ -0,0 +1,190 @@
    +---
    +layout: global
    +displayTitle: Integration with Cloud Infrastructures
    +title: Integration with Cloud Infrastructures
    +description: Introduction to cloud storage support in Apache Spark SPARK_VERSION_SHORT
    +---
    +<!---
    +  Licensed under the Apache License, Version 2.0 (the "License");
    +  you may not use this file except in compliance with the License.
    +  You may obtain a copy of the License at
    +
    +   http://www.apache.org/licenses/LICENSE-2.0
    +
    +  Unless required by applicable law or agreed to in writing, software
    +  distributed under the License is distributed on an "AS IS" BASIS,
    +  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +  See the License for the specific language governing permissions and
    +  limitations under the License. See accompanying LICENSE file.
    +-->
    +
    +* This will become a table of contents (this text will be scraped).
    +{:toc}
    +
    +## Introduction
    +
    +
    +All major cloud providers offer persistent data storage in *object stores*.
    +These are not classic "POSIX" file systems.
    +In order to store hundreds of petabytes of data without any single points of failure,
    +object stores replace the classic filesystem directory tree
    +with a simpler model of `object-name => data`. To enable remote access, operations
    +on objects are usually offered as (slow) HTTP REST operations.
    +
    +Spark can read and write data in object stores through filesystem connectors implemented
    +in Hadoop or provided by the infrastructure suppliers themselves.
    +These connectors make the object stores look *almost* like filesystems, with directories and files
    +and the classic operations on them such as list, delete and rename.
    +
    +
    +### Important: Cloud Object Stores are Not Real Filesystems
    +
    +While the stores appear to be filesystems, underneath
    +they are still object stores, [and the difference is significant](http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/filesystem/introduction.html)
    +
    +They cannot be used as a direct replacement for a cluster filesystem such as HDFS
    +*except where this is explicitly stated*.
    +
    +Key differences are
    +
    +* Changes to stored objects may not be immediately visible, both in directory listings and actual data access.
    +* The means by which directories are emulated may make working with them slow.
    +* Rename operations may be very slow and, on failure, leave the store in an unknown state.
    +* Seeking within a file may require new HTTP calls, hurting performance. 
    +
    +How does affect Spark? 
    +
    +1. Reading and writing data can be significantly slower than working with a normal filesystem.
    +1. Some directory structures may be very inefficient to scan during query split calculation.
    +1. The output of work may not be immediately visible to a follow-on query.
    +1. The rename-based algorithm by which Spark normally commits work when saving an RDD, DataFrame or Dataset
    + is potentially both slow and unreliable.
    +
    +For these reasons, it is not always safe to use an object store as a direct destination of queries, or as
    +an intermediate store in a chain of queries. Consult the documentation of the object store and its
    +connector to determine which uses are considered safe.
    +
    +### Installation
    +
    +With the relevant libraries on the classpath and Spark configured with valid credentials,
    +objects can be can be read or written by using their URLs as the path to data.
    +For example `sparkContext.textFile("s3a://landsat-pds/scene_list.gz")` will create
    +an RDD of the file `scene_list.gz` stored in S3, using the s3a connector.
    +
    +To add the relevant libraries to an application's classpath, include the `spark-hadoop-cloud` 
    +module and its dependencies.
    +
    +In Maven, add the following to the `pom.xml` file, assuming `spark.version`
    +is set to the chosen version of Spark:
    +
    +{% highlight xml %}
    +<dependencyManagement>
    +  ...
    +  <dependency>
    +    <groupId>org.apache.spark</groupId>
    +    <artifactId>spark-hadoop-cloud_2.11</artifactId>
    +    <version>${spark.version}</version>
    +  </dependency>
    +  ...
    +</dependencyManagement>
    +{% endhighlight %}
    +
    +Commercial products based on Apache Spark generally directly set up the classpath
    +for talking to cloud infrastructures, in which case this module may not be needed.
    +
    +### Authenticating
    +
    +Spark jobs must authenticate with the object stores to access data within them.
    +
    +1. When Spark is running in a cloud infrastructure, the credentials are usually automatically set up.
    +1. `spark-submit` reads the `AWS_ACCESS_KEY`, `AWS_SECRET_KEY`
    +and `AWS_SESSION_TOKEN` environment variables and sets the associated authentication options
    +for the `s3n` and `s3a` connectors to Amazon S3.
    +1. In a Hadoop cluster, settings may be set in the `core-site.xml` file.
    +1. Authentication details may be manually added to the Spark configuration in `spark-default.conf`
    +1. Alternatively, they can be programmatically set in the `SparkConf` instance used to configure 
    +the application's `SparkContext`.
    +
    +*Important: never check authentication secrets into source code repositories,
    +especially public ones*
    +
    +Consult [the Hadoop documentation](http://hadoop.apache.org/docs/current/) for the relevant
    +configuration and security options.
    +
    +## Configuring
    +
    +Each cloud connector has its own set of configuration parameters, again, 
    +consult the relevant documentation.
    +
    +### Recommended settings for writing to object stores
    +
    +Here are some settings to use when writing to object stores. 
    +
    +```
    +spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2
    +spark.hadoop.mapreduce.fileoutputcommitter.cleanup-failures.ignored true
    +```
    +
    +This uses the "version 2" algorithm for committing files, which does less
    +renaming than the "version 1" algorithm, though as it still uses `rename()`
    +to commit files, it may be unsafe to use.
    --- End diff --
    
    I'll try to think of a better phrasing, saying "if your object store is consistent enough use v2 for speed".


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17834: [SPARK-7481] [build] Add spark-hadoop-cloud modul...

Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17834#discussion_r114644953
  
    --- Diff: pom.xml ---
    @@ -1145,6 +1150,70 @@
               </exclusion>
             </exclusions>
           </dependency>
    +      <!--
    +        the AWS module pulls in jackson; its transitive dependencies can create
    +        intra-jackson-module version problems.
    +        -->
    +      <dependency>
    --- End diff --
    
    OK, I'll do that. It's easily enough revisited in future.
    
    The one thing that will probably need revisiting jackson dependency, as that's got the highest risk that someone will add it to another module. Though as it is driven by the same jackson.version variable as everywhere else, a duplicate declaration elsewhere isn't going to cause any harm



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17834: [SPARK-7481] [build] Add spark-hadoop-cloud modul...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17834#discussion_r114975135
  
    --- Diff: docs/cloud-integration.md ---
    @@ -0,0 +1,203 @@
    +---
    +layout: global
    +displayTitle: Integration with Cloud Infrastructures
    +title: Integration with Cloud Infrastructures
    +description: Introduction to cloud storage support in Apache Spark SPARK_VERSION_SHORT
    +---
    +<!---
    +  Licensed under the Apache License, Version 2.0 (the "License");
    +  you may not use this file except in compliance with the License.
    +  You may obtain a copy of the License at
    +
    +   http://www.apache.org/licenses/LICENSE-2.0
    +
    +  Unless required by applicable law or agreed to in writing, software
    +  distributed under the License is distributed on an "AS IS" BASIS,
    +  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +  See the License for the specific language governing permissions and
    +  limitations under the License. See accompanying LICENSE file.
    +-->
    +
    +* This will become a table of contents (this text will be scraped).
    +{:toc}
    +
    +## Introduction
    +
    +
    +All major cloud providers offer persistent data storage in *object stores*.
    +These are not classic "POSIX" file systems.
    +In order to store hundreds of petabytes of data without any single points of failure,
    +object stores replace the classic filesystem directory tree
    +with a simpler model of `object-name => data`. To enable remote access, operations
    +on objects are usually offered as (slow) HTTP REST operations.
    +
    +Spark can read and write data in object stores through filesystem connectors implemented
    +in Hadoop or provided by the infrastructure suppliers themselves.
    +These connectors make the object stores look *almost* like filesystems, with directories and files
    +and the classic operations on them such as list, delete and rename.
    +
    +
    +### Important: Cloud Object Stores are Not Real Filesystems
    +
    +While the stores appear to be filesystems, underneath
    +they are still object stores, [and the difference is significant](http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/filesystem/introduction.html)
    +
    +They cannot be used as a direct replacement for a cluster filesystem such as HDFS
    +*except where this is explicitly stated*.
    +
    +Key differences are
    --- End diff --
    
    Nit: I'd end the line with a colon to make it clear it's not dangling


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17834: [SPARK-7481] [build] Add spark-hadoop-cloud modul...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17834#discussion_r114536487
  
    --- Diff: assembly/pom.xml ---
    @@ -226,5 +226,19 @@
             <parquet.deps.scope>provided</parquet.deps.scope>
           </properties>
         </profile>
    +
    +    <!--
    +     Pull in spark-hadoop-cloud and its associated JARs,
    +    -->
    +    <profile>
    +      <id>cloud</id>
    --- End diff --
    
    Continuing from https://github.com/apache/spark/pull/12004#discussion_r113694301 : I still think this should be `hadoop-cloud` just like the artifact name is `spark-hadoop-cloud`. The profile name itself doesn't cause an artifact to be generated called `hadoop-cloud`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17834: [SPARK-7481] [build] Add spark-hadoop-cloud module to pu...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/17834
  
    Merged to master


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17834: [SPARK-7481] [build] Add spark-hadoop-cloud module to pu...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17834
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76492/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17834: [SPARK-7481] [build] Add spark-hadoop-cloud module to pu...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17834
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17834: [SPARK-7481] [build] Add spark-hadoop-cloud module to pu...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17834
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17834: [SPARK-7481] [build] Add spark-hadoop-cloud module to pu...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/17834
  
    That's because you also changed the artifact name to `hadoop-cloud-...`. That should remain `spark-hadoop-cloud-...`. It's the directory, module name, and profile ID that should be `hadoop-cloud`. It is at least consistent with how other modules and profiles are named.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17834: [SPARK-7481] [build] Add spark-hadoop-cloud module to pu...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/17834
  
    OK, but does this address the comments from the other PR? it didn't look like it (yet) from a quick look.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17834: [SPARK-7481] [build] Add spark-hadoop-cloud module to pu...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17834
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76426/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17834: [SPARK-7481] [build] Add spark-hadoop-cloud modul...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17834#discussion_r114551400
  
    --- Diff: pom.xml ---
    @@ -1145,6 +1150,70 @@
               </exclusion>
             </exclusions>
           </dependency>
    +      <!--
    +        the AWS module pulls in jackson; its transitive dependencies can create
    +        intra-jackson-module version problems.
    +        -->
    +      <dependency>
    --- End diff --
    
    Continuing https://github.com/apache/spark/pull/12004#discussion_r113700119 - I think it's useful to put things in the parent when they're used across multiple modules. For something only referenced in one child POM, it is probably less overhead to just put it the dependency in the child rather than split and repeat part of it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17834: [SPARK-7481] [build] Add spark-hadoop-cloud modul...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17834#discussion_r114975089
  
    --- Diff: docs/cloud-integration.md ---
    @@ -0,0 +1,203 @@
    +---
    +layout: global
    +displayTitle: Integration with Cloud Infrastructures
    +title: Integration with Cloud Infrastructures
    +description: Introduction to cloud storage support in Apache Spark SPARK_VERSION_SHORT
    +---
    +<!---
    +  Licensed under the Apache License, Version 2.0 (the "License");
    +  you may not use this file except in compliance with the License.
    +  You may obtain a copy of the License at
    +
    +   http://www.apache.org/licenses/LICENSE-2.0
    +
    +  Unless required by applicable law or agreed to in writing, software
    +  distributed under the License is distributed on an "AS IS" BASIS,
    +  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +  See the License for the specific language governing permissions and
    +  limitations under the License. See accompanying LICENSE file.
    +-->
    +
    +* This will become a table of contents (this text will be scraped).
    +{:toc}
    +
    +## Introduction
    +
    +
    +All major cloud providers offer persistent data storage in *object stores*.
    +These are not classic "POSIX" file systems.
    +In order to store hundreds of petabytes of data without any single points of failure,
    +object stores replace the classic filesystem directory tree
    +with a simpler model of `object-name => data`. To enable remote access, operations
    +on objects are usually offered as (slow) HTTP REST operations.
    +
    +Spark can read and write data in object stores through filesystem connectors implemented
    +in Hadoop or provided by the infrastructure suppliers themselves.
    +These connectors make the object stores look *almost* like filesystems, with directories and files
    +and the classic operations on them such as list, delete and rename.
    +
    +
    +### Important: Cloud Object Stores are Not Real Filesystems
    +
    +While the stores appear to be filesystems, underneath
    +they are still object stores, [and the difference is significant](http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/filesystem/introduction.html)
    +
    +They cannot be used as a direct replacement for a cluster filesystem such as HDFS
    +*except where this is explicitly stated*.
    +
    +Key differences are
    +
    +* Changes to stored objects may not be immediately visible, both in directory listings and actual data access.
    +* The means by which directories are emulated may make working with them slow.
    +* Rename operations may be very slow and, on failure, leave the store in an unknown state.
    +* Seeking within a file may require new HTTP calls, hurting performance. 
    +
    +How does affect Spark? 
    --- End diff --
    
    Just noticed a small typo: How does this affect Spark?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17834: [SPARK-7481] [build] Add spark-hadoop-cloud modul...

Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17834#discussion_r114982436
  
    --- Diff: docs/cloud-integration.md ---
    @@ -0,0 +1,203 @@
    +---
    +layout: global
    +displayTitle: Integration with Cloud Infrastructures
    +title: Integration with Cloud Infrastructures
    +description: Introduction to cloud storage support in Apache Spark SPARK_VERSION_SHORT
    +---
    +<!---
    +  Licensed under the Apache License, Version 2.0 (the "License");
    +  you may not use this file except in compliance with the License.
    +  You may obtain a copy of the License at
    +
    +   http://www.apache.org/licenses/LICENSE-2.0
    +
    +  Unless required by applicable law or agreed to in writing, software
    +  distributed under the License is distributed on an "AS IS" BASIS,
    +  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +  See the License for the specific language governing permissions and
    +  limitations under the License. See accompanying LICENSE file.
    +-->
    +
    +* This will become a table of contents (this text will be scraped).
    +{:toc}
    +
    +## Introduction
    +
    +
    +All major cloud providers offer persistent data storage in *object stores*.
    +These are not classic "POSIX" file systems.
    +In order to store hundreds of petabytes of data without any single points of failure,
    +object stores replace the classic filesystem directory tree
    +with a simpler model of `object-name => data`. To enable remote access, operations
    +on objects are usually offered as (slow) HTTP REST operations.
    +
    +Spark can read and write data in object stores through filesystem connectors implemented
    +in Hadoop or provided by the infrastructure suppliers themselves.
    +These connectors make the object stores look *almost* like filesystems, with directories and files
    +and the classic operations on them such as list, delete and rename.
    +
    +
    +### Important: Cloud Object Stores are Not Real Filesystems
    +
    +While the stores appear to be filesystems, underneath
    +they are still object stores, [and the difference is significant](http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/filesystem/introduction.html)
    +
    +They cannot be used as a direct replacement for a cluster filesystem such as HDFS
    +*except where this is explicitly stated*.
    +
    +Key differences are
    +
    +* Changes to stored objects may not be immediately visible, both in directory listings and actual data access.
    +* The means by which directories are emulated may make working with them slow.
    +* Rename operations may be very slow and, on failure, leave the store in an unknown state.
    +* Seeking within a file may require new HTTP calls, hurting performance. 
    +
    +How does affect Spark? 
    --- End diff --
    
    fixed, "how does this affect Spark?"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17834: [SPARK-7481] [build] Add spark-hadoop-cloud modul...

Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17834#discussion_r114646578
  
    --- Diff: cloud/pom.xml ---
    @@ -0,0 +1,106 @@
    +<?xml version="1.0" encoding="UTF-8"?>
    --- End diff --
    
    OK


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17834: [SPARK-7481] [build] Add spark-hadoop-cloud modul...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17834#discussion_r114556308
  
    --- Diff: docs/cloud-integration.md ---
    @@ -0,0 +1,190 @@
    +---
    +layout: global
    +displayTitle: Integration with Cloud Infrastructures
    +title: Integration with Cloud Infrastructures
    +description: Introduction to cloud storage support in Apache Spark SPARK_VERSION_SHORT
    +---
    +<!---
    +  Licensed under the Apache License, Version 2.0 (the "License");
    +  you may not use this file except in compliance with the License.
    +  You may obtain a copy of the License at
    +
    +   http://www.apache.org/licenses/LICENSE-2.0
    +
    +  Unless required by applicable law or agreed to in writing, software
    +  distributed under the License is distributed on an "AS IS" BASIS,
    +  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +  See the License for the specific language governing permissions and
    +  limitations under the License. See accompanying LICENSE file.
    +-->
    +
    +* This will become a table of contents (this text will be scraped).
    +{:toc}
    +
    +## Introduction
    +
    +
    +All major cloud providers offer persistent data storage in *object stores*.
    +These are not classic "POSIX" file systems.
    +In order to store hundreds of petabytes of data without any single points of failure,
    +object stores replace the classic filesystem directory tree
    +with a simpler model of `object-name => data`. To enable remote access, operations
    +on objects are usually offered as (slow) HTTP REST operations.
    +
    +Spark can read and write data in object stores through filesystem connectors implemented
    +in Hadoop or provided by the infrastructure suppliers themselves.
    +These connectors make the object stores look *almost* like filesystems, with directories and files
    +and the classic operations on them such as list, delete and rename.
    +
    +
    +### Important: Cloud Object Stores are Not Real Filesystems
    +
    +While the stores appear to be filesystems, underneath
    +they are still object stores, [and the difference is significant](http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/filesystem/introduction.html)
    +
    +They cannot be used as a direct replacement for a cluster filesystem such as HDFS
    +*except where this is explicitly stated*.
    +
    +Key differences are
    +
    +* Changes to stored objects may not be immediately visible, both in directory listings and actual data access.
    +* The means by which directories are emulated may make working with them slow.
    +* Rename operations may be very slow and, on failure, leave the store in an unknown state.
    +* Seeking within a file may require new HTTP calls, hurting performance. 
    +
    +How does affect Spark? 
    +
    +1. Reading and writing data can be significantly slower than working with a normal filesystem.
    +1. Some directory structures may be very inefficient to scan during query split calculation.
    +1. The output of work may not be immediately visible to a follow-on query.
    +1. The rename-based algorithm by which Spark normally commits work when saving an RDD, DataFrame or Dataset
    + is potentially both slow and unreliable.
    +
    +For these reasons, it is not always safe to use an object store as a direct destination of queries, or as
    +an intermediate store in a chain of queries. Consult the documentation of the object store and its
    +connector to determine which uses are considered safe.
    +
    +### Installation
    +
    +With the relevant libraries on the classpath and Spark configured with valid credentials,
    +objects can be can be read or written by using their URLs as the path to data.
    +For example `sparkContext.textFile("s3a://landsat-pds/scene_list.gz")` will create
    +an RDD of the file `scene_list.gz` stored in S3, using the s3a connector.
    +
    +To add the relevant libraries to an application's classpath, include the `spark-hadoop-cloud` 
    +module and its dependencies.
    +
    +In Maven, add the following to the `pom.xml` file, assuming `spark.version`
    +is set to the chosen version of Spark:
    +
    +{% highlight xml %}
    +<dependencyManagement>
    +  ...
    +  <dependency>
    +    <groupId>org.apache.spark</groupId>
    +    <artifactId>spark-hadoop-cloud_2.11</artifactId>
    +    <version>${spark.version}</version>
    +  </dependency>
    +  ...
    +</dependencyManagement>
    +{% endhighlight %}
    +
    +Commercial products based on Apache Spark generally directly set up the classpath
    +for talking to cloud infrastructures, in which case this module may not be needed.
    +
    +### Authenticating
    +
    +Spark jobs must authenticate with the object stores to access data within them.
    +
    +1. When Spark is running in a cloud infrastructure, the credentials are usually automatically set up.
    +1. `spark-submit` reads the `AWS_ACCESS_KEY`, `AWS_SECRET_KEY`
    +and `AWS_SESSION_TOKEN` environment variables and sets the associated authentication options
    +for the `s3n` and `s3a` connectors to Amazon S3.
    +1. In a Hadoop cluster, settings may be set in the `core-site.xml` file.
    +1. Authentication details may be manually added to the Spark configuration in `spark-default.conf`
    +1. Alternatively, they can be programmatically set in the `SparkConf` instance used to configure 
    +the application's `SparkContext`.
    +
    +*Important: never check authentication secrets into source code repositories,
    +especially public ones*
    +
    +Consult [the Hadoop documentation](http://hadoop.apache.org/docs/current/) for the relevant
    +configuration and security options.
    +
    +## Configuring
    +
    +Each cloud connector has its own set of configuration parameters, again, 
    +consult the relevant documentation.
    +
    +### Recommended settings for writing to object stores
    +
    +Here are some settings to use when writing to object stores. 
    +
    +```
    +spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2
    +spark.hadoop.mapreduce.fileoutputcommitter.cleanup-failures.ignored true
    +```
    +
    +This uses the "version 2" algorithm for committing files, which does less
    +renaming than the "version 1" algorithm, though as it still uses `rename()`
    +to commit files, it may be unsafe to use.
    --- End diff --
    
    It's hard to give advice here, as we discussed. It's possible to link to external documentation. Here, however, it sounds like you're saying version 2 is recommended, then immediately saying it could be unsafe to use. What does the reader make of that? what's the upside, what's the nature and likelihood of failure?
    
    If there's no clear answer, is this really a reliable recommendation?
    Should this perhaps call out the issue and link to a larger discussion instead?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17834: [SPARK-7481] [build] Add spark-hadoop-cloud module to pu...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17834
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76390/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17834: [SPARK-7481] [build] Add spark-hadoop-cloud modul...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17834#discussion_r114551666
  
    --- Diff: cloud/pom.xml ---
    @@ -0,0 +1,106 @@
    +<?xml version="1.0" encoding="UTF-8"?>
    --- End diff --
    
    Oh, just noticed this -- should this perhaps be in a directory `hadoop-cloud` to match the profile and module name?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17834: [SPARK-7481] [build] Add spark-hadoop-cloud module to pu...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17834
  
    **[Test build #76492 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76492/testReport)** for PR 17834 at commit [`32ebc8c`](https://github.com/apache/spark/commit/32ebc8cd15cd3705279fbeee1b9527abd903023d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17834: [SPARK-7481] [build] Add spark-hadoop-cloud module to pu...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17834
  
    **[Test build #76457 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76457/testReport)** for PR 17834 at commit [`e173e3f`](https://github.com/apache/spark/commit/e173e3f2a60a8ecc9875dbda24beba793d86d019).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17834: [SPARK-7481] [build] Add spark-hadoop-cloud module to pu...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17834
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76457/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17834: [SPARK-7481] [build] Add spark-hadoop-cloud modul...

Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17834#discussion_r114982772
  
    --- Diff: hadoop-cloud/pom.xml ---
    @@ -0,0 +1,185 @@
    +<?xml version="1.0" encoding="UTF-8"?>
    +<!--
    +  ~ Licensed to the Apache Software Foundation (ASF) under one or more
    +  ~ contributor license agreements.  See the NOTICE file distributed with
    +  ~ this work for additional information regarding copyright ownership.
    +  ~ The ASF licenses this file to You under the Apache License, Version 2.0
    +  ~ (the "License"); you may not use this file except in compliance with
    +  ~ the License.  You may obtain a copy of the License at
    +  ~
    +  ~    http://www.apache.org/licenses/LICENSE-2.0
    +  ~
    +  ~ Unless required by applicable law or agreed to in writing, software
    +  ~ distributed under the License is distributed on an "AS IS" BASIS,
    +  ~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +  ~ See the License for the specific language governing permissions and
    +  ~ limitations under the License.
    +  -->
    +<project xmlns="http://maven.apache.org/POM/4.0.0"
    +  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    +  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    +  <modelVersion>4.0.0</modelVersion>
    +  <parent>
    +    <groupId>org.apache.spark</groupId>
    +    <artifactId>spark-parent_2.11</artifactId>
    +    <version>2.3.0-SNAPSHOT</version>
    +    <relativePath>../pom.xml</relativePath>
    +  </parent>
    +
    +  <artifactId>spark-hadoop-cloud_2.11</artifactId>
    +  <packaging>jar</packaging>
    +  <name>Spark Project Cloud Integration</name>
    --- End diff --
    
    As it works just as well standalone & on mesos, I'll try " through Hadoop Libraries" & see how that reads. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17834: [SPARK-7481] [build] Add spark-hadoop-cloud module to pu...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17834
  
    **[Test build #76426 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76426/testReport)** for PR 17834 at commit [`b788494`](https://github.com/apache/spark/commit/b788494cb63c91814309cdf22b55d3301292ac66).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17834: [SPARK-7481] [build] Add spark-hadoop-cloud module to pu...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17834
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17834: [SPARK-7481] [build] Add spark-hadoop-cloud module to pu...

Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on the issue:

    https://github.com/apache/spark/pull/17834
  
    OK, now I understand. let me revert that bit of the patch


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17834: [SPARK-7481] [build] Add spark-hadoop-cloud module to pu...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17834
  
    **[Test build #76492 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76492/testReport)** for PR 17834 at commit [`32ebc8c`](https://github.com/apache/spark/commit/32ebc8cd15cd3705279fbeee1b9527abd903023d).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17834: [SPARK-7481] [build] Add spark-hadoop-cloud module to pu...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17834
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17834: [SPARK-7481] [build] Add spark-hadoop-cloud module to pu...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17834
  
    **[Test build #76426 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76426/testReport)** for PR 17834 at commit [`b788494`](https://github.com/apache/spark/commit/b788494cb63c91814309cdf22b55d3301292ac66).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17834: [SPARK-7481] [build] Add spark-hadoop-cloud module to pu...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17834
  
    **[Test build #76390 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76390/testReport)** for PR 17834 at commit [`72a03ed`](https://github.com/apache/spark/commit/72a03ed58331813b0ad4bc9517fcc1f23a5eda6f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17834: [SPARK-7481] [build] Add spark-hadoop-cloud modul...

Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17834#discussion_r114982357
  
    --- Diff: docs/cloud-integration.md ---
    @@ -0,0 +1,203 @@
    +---
    +layout: global
    +displayTitle: Integration with Cloud Infrastructures
    +title: Integration with Cloud Infrastructures
    +description: Introduction to cloud storage support in Apache Spark SPARK_VERSION_SHORT
    +---
    +<!---
    +  Licensed under the Apache License, Version 2.0 (the "License");
    +  you may not use this file except in compliance with the License.
    +  You may obtain a copy of the License at
    +
    +   http://www.apache.org/licenses/LICENSE-2.0
    +
    +  Unless required by applicable law or agreed to in writing, software
    +  distributed under the License is distributed on an "AS IS" BASIS,
    +  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +  See the License for the specific language governing permissions and
    +  limitations under the License. See accompanying LICENSE file.
    +-->
    +
    +* This will become a table of contents (this text will be scraped).
    +{:toc}
    +
    +## Introduction
    +
    +
    +All major cloud providers offer persistent data storage in *object stores*.
    +These are not classic "POSIX" file systems.
    +In order to store hundreds of petabytes of data without any single points of failure,
    +object stores replace the classic filesystem directory tree
    +with a simpler model of `object-name => data`. To enable remote access, operations
    +on objects are usually offered as (slow) HTTP REST operations.
    +
    +Spark can read and write data in object stores through filesystem connectors implemented
    +in Hadoop or provided by the infrastructure suppliers themselves.
    +These connectors make the object stores look *almost* like filesystems, with directories and files
    +and the classic operations on them such as list, delete and rename.
    +
    +
    +### Important: Cloud Object Stores are Not Real Filesystems
    +
    +While the stores appear to be filesystems, underneath
    +they are still object stores, [and the difference is significant](http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/filesystem/introduction.html)
    --- End diff --
    
    good catch. Done throughout the file except in the apache license header. Rendered HTML and clicked through the links.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17834: [SPARK-7481] [build] Add spark-hadoop-cloud modul...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17834#discussion_r114975430
  
    --- Diff: hadoop-cloud/pom.xml ---
    @@ -0,0 +1,185 @@
    +<?xml version="1.0" encoding="UTF-8"?>
    +<!--
    +  ~ Licensed to the Apache Software Foundation (ASF) under one or more
    +  ~ contributor license agreements.  See the NOTICE file distributed with
    +  ~ this work for additional information regarding copyright ownership.
    +  ~ The ASF licenses this file to You under the Apache License, Version 2.0
    +  ~ (the "License"); you may not use this file except in compliance with
    +  ~ the License.  You may obtain a copy of the License at
    +  ~
    +  ~    http://www.apache.org/licenses/LICENSE-2.0
    +  ~
    +  ~ Unless required by applicable law or agreed to in writing, software
    +  ~ distributed under the License is distributed on an "AS IS" BASIS,
    +  ~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +  ~ See the License for the specific language governing permissions and
    +  ~ limitations under the License.
    +  -->
    +<project xmlns="http://maven.apache.org/POM/4.0.0"
    +  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    +  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    +  <modelVersion>4.0.0</modelVersion>
    +  <parent>
    +    <groupId>org.apache.spark</groupId>
    +    <artifactId>spark-parent_2.11</artifactId>
    +    <version>2.3.0-SNAPSHOT</version>
    +    <relativePath>../pom.xml</relativePath>
    +  </parent>
    +
    +  <artifactId>spark-hadoop-cloud_2.11</artifactId>
    +  <packaging>jar</packaging>
    +  <name>Spark Project Cloud Integration</name>
    --- End diff --
    
    Maybe Spark Project Cloud Object Store Integration for Hadoop? I feel like getting "Hadoop" in there is useful to somehow emphasize this is adding connectivity from HDFS APIs to object stores in particular


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17834: [SPARK-7481] [build] Add spark-hadoop-cloud module to pu...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17834
  
    **[Test build #76390 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76390/testReport)** for PR 17834 at commit [`72a03ed`](https://github.com/apache/spark/commit/72a03ed58331813b0ad4bc9517fcc1f23a5eda6f).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17834: [SPARK-7481] [build] Add spark-hadoop-cloud module to pu...

Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on the issue:

    https://github.com/apache/spark/pull/17834
  
    I've just pushed up an update which changes the module name; tested in maven and SBT; hadoop cloud JAR dependencies pulled down.
    
    A JAR is created, it's just a stub one. As a result, when you build the assembly with -Phadoop-cloud, you get something from spark mixed in with the hadoop modules
    ```
    assembly/target/scala-2.11/jars/hadoop-aws-2.7.3.jar
    assembly/target/scala-2.11/jars/hadoop-azure-2.7.3.jar
    assembly/target/scala-2.11/jars/hadoop-client-2.7.3.jar
    assembly/target/scala-2.11/jars/hadoop-cloud_2.11-2.3.0-SNAPSHOT.jar
    assembly/target/scala-2.11/jars/hadoop-common-2.7.3.jar
    assembly/target/scala-2.11/jars/hadoop-hdfs-2.7.3.jar
    assembly/target/scala-2.11/jars/hadoop-mapreduce-client-app-2.7
    ```
    
    Like I've said before, that's a bit dangerous, especially with a hadoop-cloud-projects POM module now upstream. We won't ever produce a hadoop-cloud-projects JAR, so there won't be direct conflict, more potential confusion if people see a jar beginning hadoop-* with different version info. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17834: [SPARK-7481] [build] Add spark-hadoop-cloud module to pu...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17834
  
    **[Test build #76457 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76457/testReport)** for PR 17834 at commit [`e173e3f`](https://github.com/apache/spark/commit/e173e3f2a60a8ecc9875dbda24beba793d86d019).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org