You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by James Malone <ja...@google.com.INVALID> on 2015/12/07 23:35:19 UTC

Google Cloud Storage connector into Hadoop

Hello,

We're from a team within Google Cloud Platform focused on OSS and data
technologies, especially Hadoop (and Spark.) Before we cut a JIRA for
something we’d like to do, we wanted to reach out to this list to ask a two
quick questions, describe our proposed action, and check for any major
objections.

Proposed action:
We have a Hadoop connector[1] (more info[2]) for Google Cloud Storage (GCS)
which we have been building and maintaining for some time. After we clean
up our code and tests to conform (to these[3] and other requirements) we
would like to contribute it to Hadoop. We have many customers using the
connector in high-throughput production Hadoop clusters; we’d like to make
it easier and faster to use Hadoop and GCS.

Timeline:
Presently, we are working on the beta of Google Cloud Dataproc[4] which
limits our time a bit, so we’re targeting late Q1 2016 for creating a JIRA
issue and adapting our connector code as needed.

Our (quick) questions:
* Do we need to take any (non-coding) action for this beyond submitting a
JIRA when we are ready?
* Are there any up-front concerns or questions which we can (or will need
to) address?

Thank you!

James Malone
On behalf of the Google Big Data OSS Engineering Team

Links:
[1] - https://github.com/GoogleCloudPlatform/bigdata-interop/tree/master/gcs
[2] - https://cloud.google.com/hadoop/google-cloud-storage-connector
[3] - https://github.com/GoogleCloudPlatform/bigdata-interop/tree/master/gcs
[4] - https://cloud.google.com/dataproc

Re: Google Cloud Storage connector into Hadoop

Posted by jay vyas <ja...@gmail.com>.

See also the HCFS wiki page https://wiki.apache.org/hadoop/HCFS/Progress
which attempts to explain this stuff for the community, maybe it needs some
updates as well, i haven't looked in a while as ive moved onto working on
other products nowadays




On Tue, Dec 8, 2015 at 12:50 PM, Steve Loughran <st...@hortonworks.com>
wrote:

>
> 1. do what chris says: go for the abstract contract tests. They'll find
> the troublespots in your code, like the way seek(-1) appears to have
> entertaining results, what happens on operations to closed files, etc, and
> help identify where the semantics of your FS varies from HDFS.
>
> 2. You will need to stay with the versions of artifacts in the Hadoop
> codebase. Troublespots there are protobuf (frozen @ 2.5) and guava
> (shipping with 11.02, code must run against 18.x + if someone upgrades). If
> this is problematic you may want discuss the versioning issues there with
> your colleagues; see https://issues.apache.org/jira/browse/HADOOP-10101
> for the details.
>
> 3. the object stores get undertested: jenkins doesn't touch them for patch
> review or nightly runs —you can't give jenkins the right credentials.
> Setting up your own jenkins server to build the Hadoop versions and flag
> problems would be a great contribution here. Also: help with the release
> testing; if someone has a patch for the hadoop-gcs module, review and test
> that too would be great; stops these patches being neglected.
>
> 4. We could do with some more scale tests of the object stores, to test
> creating many thousands of small files, etc. Contributions welcome
>
> 5. We could do with a lot more downstream testing of things like hive &
> spark IO on object stores, especially via ORC and Parquet. Helping to write
> those tests would stop regressions in the stack, and help tune Hadoop for
> your FS.
>
> 6. Finally: don't be afraid to get involved with the rest of the codebase.
> It can only get better.
>
>
> > On 8 Dec 2015, at 00:20, James Malone <ja...@google.com.INVALID>
> wrote:
> >
> > Haohui & Chris,
> >
> > Sounds great, thank you very much! We'll cut a JIRA once we get
> everything
> > lined up.
> >
> > Best,
> >
> > James
> >
> > On Mon, Dec 7, 2015 at 3:54 PM, Chris Nauroth <cn...@hortonworks.com>
> > wrote:
> >
> >> Hi James,
> >>
> >> This sounds great!  Thank you for considering contributing the code.
> >>
> >> Just seconding what Haohui said, there is existing precedent for
> >> alternative implementations of the Hadoop FileSystem in our codebase.
> We
> >> currently have similar plugins for S3 [1], Azure [2] and OpenStack Swift
> >> [3].  Additionally, we have a suite of FileSystem contract tests [4].
> >> These tests are designed to help developers of alternative file systems
> >> assess how closely they match the semantics expected by Hadoop ecosystem
> >> components.
> >>
> >> Many Hadoop users are accustomed to using HDFS instead of these
> >> alternative file systems, so none of the alternatives are on the default
> >> Hadoop classpath immediately after deployment.  Instead, the code for
> each
> >> one is in a separate module under the "hadoop-tools" directory in the
> >> source tree.  Users who need to use the alternative file systems take
> >> extra steps post-deployment to add them to the classpath where
> necessary.
> >> This achieves the dependency isolation needed.  For example, users who
> >> never use the Azure plugin won't accidentally pick up a transitive
> >> dependency on the Azure SDK jar.
> >>
> >> I recommend taking a quick glance through the existing modules for S3,
> >> Azure and OpenStack.  We'll likely ask that a new FileSystem
> >> implementation follow the same patterns if feasible for consistency.
> This
> >> would include things like using the contract tests, having a provision
> to
> >> execute tests both offline/mocked and live/integrated with the real
> >> service and providing a documentation page that explains configuration
> for
> >> end users.
> >>
> >> For now, please feel free to file a HADOOP JIRA with your proposal.  We
> >> can work out the details of all of this in discussion on that JIRA.
> >>
> >> Something else to follow up on will be licensing concerns.  I see the
> >> project already uses the Apache license, but it appears to be an
> existing
> >> body of code initially developed at Google.  That might require a
> Software
> >> Grant Agreement [5].  Again, this is something that can be hashed out in
> >> discussion on the JIRA after you create it.
> >>
> >> [1]
> >>
> http://hadoop.apache.org/docs/r2.7.1/hadoop-aws/tools/hadoop-aws/index.html
> >> [2] http://hadoop.apache.org/docs/r2.7.1/hadoop-azure/index.html
> >> [3] http://hadoop.apache.org/docs/r2.7.1/hadoop-openstack/index.html
> >> [4]
> >>
> http://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-common/file
> >> system/testing.html
> >> [5] http://www.apache.org/licenses/
> >>
> >> --Chris Nauroth
> >>
> >>
> >>
> >>
> >> On 12/7/15, 3:10 PM, "Haohui Mai" <ri...@gmail.com> wrote:
> >>
> >>> Hi,
> >>>
> >>> Thanks for reaching out. It would be great to see this in the Hadoop
> >>> ecosystem.
> >>>
> >>> In Hadoop we have AWS S3 support. IMO they address similar use cases
> >>> thus I think that it should be relatively straightforward to adopt the
> >>> code.
> >>>
> >>> The only catch in my head right now is to properly isolate dependency.
> >>> Not only the code needs to be put into a separate module, but many
> >>> Hadoop applications also depend on different versions of Guava. I
> >>> think it might be a problem that needs some attentions at the very
> >>> beginning.
> >>>
> >>> Please feel free to reach out if you have any other questions.
> >>>
> >>> Regards,
> >>> Haohui
> >>>
> >>>
> >>> On Mon, Dec 7, 2015 at 2:35 PM, James Malone
> >>> <ja...@google.com.invalid> wrote:
> >>>> Hello,
> >>>>
> >>>> We're from a team within Google Cloud Platform focused on OSS and data
> >>>> technologies, especially Hadoop (and Spark.) Before we cut a JIRA for
> >>>> something we¹d like to do, we wanted to reach out to this list to ask
> a
> >>>> two
> >>>> quick questions, describe our proposed action, and check for any major
> >>>> objections.
> >>>>
> >>>> Proposed action:
> >>>> We have a Hadoop connector[1] (more info[2]) for Google Cloud Storage
> >>>> (GCS)
> >>>> which we have been building and maintaining for some time. After we
> >>>> clean
> >>>> up our code and tests to conform (to these[3] and other requirements)
> we
> >>>> would like to contribute it to Hadoop. We have many customers using
> the
> >>>> connector in high-throughput production Hadoop clusters; we¹d like to
> >>>> make
> >>>> it easier and faster to use Hadoop and GCS.
> >>>>
> >>>> Timeline:
> >>>> Presently, we are working on the beta of Google Cloud Dataproc[4]
> which
> >>>> limits our time a bit, so we¹re targeting late Q1 2016 for creating a
> >>>> JIRA
> >>>> issue and adapting our connector code as needed.
> >>>>
> >>>> Our (quick) questions:
> >>>> * Do we need to take any (non-coding) action for this beyond
> submitting
> >>>> a
> >>>> JIRA when we are ready?
> >>>> * Are there any up-front concerns or questions which we can (or will
> >>>> need
> >>>> to) address?
> >>>>
> >>>> Thank you!
> >>>>
> >>>> James Malone
> >>>> On behalf of the Google Big Data OSS Engineering Team
> >>>>
> >>>> Links:
> >>>> [1] -
> >>>>
> https://github.com/GoogleCloudPlatform/bigdata-interop/tree/master/gcs
> >>>> [2] - https://cloud.google.com/hadoop/google-cloud-storage-connector
> >>>> [3] -
> >>>>
> https://github.com/GoogleCloudPlatform/bigdata-interop/tree/master/gcs
> >>>> [4] - https://cloud.google.com/dataproc
> >>>
> >>
> >>
>
>


-- 
jay vyas

Re: Google Cloud Storage connector into Hadoop

Posted by Steve Loughran <st...@hortonworks.com>.

1. do what chris says: go for the abstract contract tests. They'll find the troublespots in your code, like the way seek(-1) appears to have entertaining results, what happens on operations to closed files, etc, and help identify where the semantics of your FS varies from HDFS.

2. You will need to stay with the versions of artifacts in the Hadoop codebase. Troublespots there are protobuf (frozen @ 2.5) and guava (shipping with 11.02, code must run against 18.x + if someone upgrades). If this is problematic you may want discuss the versioning issues there with your colleagues; see https://issues.apache.org/jira/browse/HADOOP-10101 for the details.

3. the object stores get undertested: jenkins doesn't touch them for patch review or nightly runs —you can't give jenkins the right credentials. Setting up your own jenkins server to build the Hadoop versions and flag problems would be a great contribution here. Also: help with the release testing; if someone has a patch for the hadoop-gcs module, review and test that too would be great; stops these patches being neglected.

4. We could do with some more scale tests of the object stores, to test creating many thousands of small files, etc. Contributions welcome

5. We could do with a lot more downstream testing of things like hive & spark IO on object stores, especially via ORC and Parquet. Helping to write those tests would stop regressions in the stack, and help tune Hadoop for your FS.

6. Finally: don't be afraid to get involved with the rest of the codebase. It can only get better.


> On 8 Dec 2015, at 00:20, James Malone <ja...@google.com.INVALID> wrote:
> 
> Haohui & Chris,
> 
> Sounds great, thank you very much! We'll cut a JIRA once we get everything
> lined up.
> 
> Best,
> 
> James
> 
> On Mon, Dec 7, 2015 at 3:54 PM, Chris Nauroth <cn...@hortonworks.com>
> wrote:
> 
>> Hi James,
>> 
>> This sounds great!  Thank you for considering contributing the code.
>> 
>> Just seconding what Haohui said, there is existing precedent for
>> alternative implementations of the Hadoop FileSystem in our codebase.  We
>> currently have similar plugins for S3 [1], Azure [2] and OpenStack Swift
>> [3].  Additionally, we have a suite of FileSystem contract tests [4].
>> These tests are designed to help developers of alternative file systems
>> assess how closely they match the semantics expected by Hadoop ecosystem
>> components.
>> 
>> Many Hadoop users are accustomed to using HDFS instead of these
>> alternative file systems, so none of the alternatives are on the default
>> Hadoop classpath immediately after deployment.  Instead, the code for each
>> one is in a separate module under the "hadoop-tools" directory in the
>> source tree.  Users who need to use the alternative file systems take
>> extra steps post-deployment to add them to the classpath where necessary.
>> This achieves the dependency isolation needed.  For example, users who
>> never use the Azure plugin won't accidentally pick up a transitive
>> dependency on the Azure SDK jar.
>> 
>> I recommend taking a quick glance through the existing modules for S3,
>> Azure and OpenStack.  We'll likely ask that a new FileSystem
>> implementation follow the same patterns if feasible for consistency.  This
>> would include things like using the contract tests, having a provision to
>> execute tests both offline/mocked and live/integrated with the real
>> service and providing a documentation page that explains configuration for
>> end users.
>> 
>> For now, please feel free to file a HADOOP JIRA with your proposal.  We
>> can work out the details of all of this in discussion on that JIRA.
>> 
>> Something else to follow up on will be licensing concerns.  I see the
>> project already uses the Apache license, but it appears to be an existing
>> body of code initially developed at Google.  That might require a Software
>> Grant Agreement [5].  Again, this is something that can be hashed out in
>> discussion on the JIRA after you create it.
>> 
>> [1]
>> http://hadoop.apache.org/docs/r2.7.1/hadoop-aws/tools/hadoop-aws/index.html
>> [2] http://hadoop.apache.org/docs/r2.7.1/hadoop-azure/index.html
>> [3] http://hadoop.apache.org/docs/r2.7.1/hadoop-openstack/index.html
>> [4]
>> http://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-common/file
>> system/testing.html
>> [5] http://www.apache.org/licenses/
>> 
>> --Chris Nauroth
>> 
>> 
>> 
>> 
>> On 12/7/15, 3:10 PM, "Haohui Mai" <ri...@gmail.com> wrote:
>> 
>>> Hi,
>>> 
>>> Thanks for reaching out. It would be great to see this in the Hadoop
>>> ecosystem.
>>> 
>>> In Hadoop we have AWS S3 support. IMO they address similar use cases
>>> thus I think that it should be relatively straightforward to adopt the
>>> code.
>>> 
>>> The only catch in my head right now is to properly isolate dependency.
>>> Not only the code needs to be put into a separate module, but many
>>> Hadoop applications also depend on different versions of Guava. I
>>> think it might be a problem that needs some attentions at the very
>>> beginning.
>>> 
>>> Please feel free to reach out if you have any other questions.
>>> 
>>> Regards,
>>> Haohui
>>> 
>>> 
>>> On Mon, Dec 7, 2015 at 2:35 PM, James Malone
>>> <ja...@google.com.invalid> wrote:
>>>> Hello,
>>>> 
>>>> We're from a team within Google Cloud Platform focused on OSS and data
>>>> technologies, especially Hadoop (and Spark.) Before we cut a JIRA for
>>>> something we¹d like to do, we wanted to reach out to this list to ask a
>>>> two
>>>> quick questions, describe our proposed action, and check for any major
>>>> objections.
>>>> 
>>>> Proposed action:
>>>> We have a Hadoop connector[1] (more info[2]) for Google Cloud Storage
>>>> (GCS)
>>>> which we have been building and maintaining for some time. After we
>>>> clean
>>>> up our code and tests to conform (to these[3] and other requirements) we
>>>> would like to contribute it to Hadoop. We have many customers using the
>>>> connector in high-throughput production Hadoop clusters; we¹d like to
>>>> make
>>>> it easier and faster to use Hadoop and GCS.
>>>> 
>>>> Timeline:
>>>> Presently, we are working on the beta of Google Cloud Dataproc[4] which
>>>> limits our time a bit, so we¹re targeting late Q1 2016 for creating a
>>>> JIRA
>>>> issue and adapting our connector code as needed.
>>>> 
>>>> Our (quick) questions:
>>>> * Do we need to take any (non-coding) action for this beyond submitting
>>>> a
>>>> JIRA when we are ready?
>>>> * Are there any up-front concerns or questions which we can (or will
>>>> need
>>>> to) address?
>>>> 
>>>> Thank you!
>>>> 
>>>> James Malone
>>>> On behalf of the Google Big Data OSS Engineering Team
>>>> 
>>>> Links:
>>>> [1] -
>>>> https://github.com/GoogleCloudPlatform/bigdata-interop/tree/master/gcs
>>>> [2] - https://cloud.google.com/hadoop/google-cloud-storage-connector
>>>> [3] -
>>>> https://github.com/GoogleCloudPlatform/bigdata-interop/tree/master/gcs
>>>> [4] - https://cloud.google.com/dataproc
>>> 
>> 
>>

Re: Google Cloud Storage connector into Hadoop

Posted by James Malone <ja...@google.com.INVALID>.

Haohui & Chris,

Sounds great, thank you very much! We'll cut a JIRA once we get everything
lined up.

Best,

James

On Mon, Dec 7, 2015 at 3:54 PM, Chris Nauroth <cn...@hortonworks.com>
wrote:

> Hi James,
>
> This sounds great!  Thank you for considering contributing the code.
>
> Just seconding what Haohui said, there is existing precedent for
> alternative implementations of the Hadoop FileSystem in our codebase.  We
> currently have similar plugins for S3 [1], Azure [2] and OpenStack Swift
> [3].  Additionally, we have a suite of FileSystem contract tests [4].
> These tests are designed to help developers of alternative file systems
> assess how closely they match the semantics expected by Hadoop ecosystem
> components.
>
> Many Hadoop users are accustomed to using HDFS instead of these
> alternative file systems, so none of the alternatives are on the default
> Hadoop classpath immediately after deployment.  Instead, the code for each
> one is in a separate module under the "hadoop-tools" directory in the
> source tree.  Users who need to use the alternative file systems take
> extra steps post-deployment to add them to the classpath where necessary.
> This achieves the dependency isolation needed.  For example, users who
> never use the Azure plugin won't accidentally pick up a transitive
> dependency on the Azure SDK jar.
>
> I recommend taking a quick glance through the existing modules for S3,
> Azure and OpenStack.  We'll likely ask that a new FileSystem
> implementation follow the same patterns if feasible for consistency.  This
> would include things like using the contract tests, having a provision to
> execute tests both offline/mocked and live/integrated with the real
> service and providing a documentation page that explains configuration for
> end users.
>
> For now, please feel free to file a HADOOP JIRA with your proposal.  We
> can work out the details of all of this in discussion on that JIRA.
>
> Something else to follow up on will be licensing concerns.  I see the
> project already uses the Apache license, but it appears to be an existing
> body of code initially developed at Google.  That might require a Software
> Grant Agreement [5].  Again, this is something that can be hashed out in
> discussion on the JIRA after you create it.
>
> [1]
> http://hadoop.apache.org/docs/r2.7.1/hadoop-aws/tools/hadoop-aws/index.html
> [2] http://hadoop.apache.org/docs/r2.7.1/hadoop-azure/index.html
> [3] http://hadoop.apache.org/docs/r2.7.1/hadoop-openstack/index.html
> [4]
> http://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-common/file
> system/testing.html
> [5] http://www.apache.org/licenses/
>
> --Chris Nauroth
>
>
>
>
> On 12/7/15, 3:10 PM, "Haohui Mai" <ri...@gmail.com> wrote:
>
> >Hi,
> >
> >Thanks for reaching out. It would be great to see this in the Hadoop
> >ecosystem.
> >
> >In Hadoop we have AWS S3 support. IMO they address similar use cases
> >thus I think that it should be relatively straightforward to adopt the
> >code.
> >
> >The only catch in my head right now is to properly isolate dependency.
> >Not only the code needs to be put into a separate module, but many
> >Hadoop applications also depend on different versions of Guava. I
> >think it might be a problem that needs some attentions at the very
> >beginning.
> >
> >Please feel free to reach out if you have any other questions.
> >
> >Regards,
> >Haohui
> >
> >
> >On Mon, Dec 7, 2015 at 2:35 PM, James Malone
> ><ja...@google.com.invalid> wrote:
> >> Hello,
> >>
> >> We're from a team within Google Cloud Platform focused on OSS and data
> >> technologies, especially Hadoop (and Spark.) Before we cut a JIRA for
> >> something we¹d like to do, we wanted to reach out to this list to ask a
> >>two
> >> quick questions, describe our proposed action, and check for any major
> >> objections.
> >>
> >> Proposed action:
> >> We have a Hadoop connector[1] (more info[2]) for Google Cloud Storage
> >>(GCS)
> >> which we have been building and maintaining for some time. After we
> >>clean
> >> up our code and tests to conform (to these[3] and other requirements) we
> >> would like to contribute it to Hadoop. We have many customers using the
> >> connector in high-throughput production Hadoop clusters; we¹d like to
> >>make
> >> it easier and faster to use Hadoop and GCS.
> >>
> >> Timeline:
> >> Presently, we are working on the beta of Google Cloud Dataproc[4] which
> >> limits our time a bit, so we¹re targeting late Q1 2016 for creating a
> >>JIRA
> >> issue and adapting our connector code as needed.
> >>
> >> Our (quick) questions:
> >> * Do we need to take any (non-coding) action for this beyond submitting
> >>a
> >> JIRA when we are ready?
> >> * Are there any up-front concerns or questions which we can (or will
> >>need
> >> to) address?
> >>
> >> Thank you!
> >>
> >> James Malone
> >> On behalf of the Google Big Data OSS Engineering Team
> >>
> >> Links:
> >> [1] -
> >>https://github.com/GoogleCloudPlatform/bigdata-interop/tree/master/gcs
> >> [2] - https://cloud.google.com/hadoop/google-cloud-storage-connector
> >> [3] -
> >>https://github.com/GoogleCloudPlatform/bigdata-interop/tree/master/gcs
> >> [4] - https://cloud.google.com/dataproc
> >
>
>

Re: Google Cloud Storage connector into Hadoop

Posted by Chris Nauroth <cn...@hortonworks.com>.

Hi James,

This sounds great!  Thank you for considering contributing the code.

Just seconding what Haohui said, there is existing precedent for
alternative implementations of the Hadoop FileSystem in our codebase.  We
currently have similar plugins for S3 [1], Azure [2] and OpenStack Swift
[3].  Additionally, we have a suite of FileSystem contract tests [4].
These tests are designed to help developers of alternative file systems
assess how closely they match the semantics expected by Hadoop ecosystem
components.

Many Hadoop users are accustomed to using HDFS instead of these
alternative file systems, so none of the alternatives are on the default
Hadoop classpath immediately after deployment.  Instead, the code for each
one is in a separate module under the "hadoop-tools" directory in the
source tree.  Users who need to use the alternative file systems take
extra steps post-deployment to add them to the classpath where necessary.
This achieves the dependency isolation needed.  For example, users who
never use the Azure plugin won't accidentally pick up a transitive
dependency on the Azure SDK jar.

I recommend taking a quick glance through the existing modules for S3,
Azure and OpenStack.  We'll likely ask that a new FileSystem
implementation follow the same patterns if feasible for consistency.  This
would include things like using the contract tests, having a provision to
execute tests both offline/mocked and live/integrated with the real
service and providing a documentation page that explains configuration for
end users.

For now, please feel free to file a HADOOP JIRA with your proposal.  We
can work out the details of all of this in discussion on that JIRA.

Something else to follow up on will be licensing concerns.  I see the
project already uses the Apache license, but it appears to be an existing
body of code initially developed at Google.  That might require a Software
Grant Agreement [5].  Again, this is something that can be hashed out in
discussion on the JIRA after you create it.

[1] 
http://hadoop.apache.org/docs/r2.7.1/hadoop-aws/tools/hadoop-aws/index.html
[2] http://hadoop.apache.org/docs/r2.7.1/hadoop-azure/index.html
[3] http://hadoop.apache.org/docs/r2.7.1/hadoop-openstack/index.html
[4] 
http://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-common/file
system/testing.html
[5] http://www.apache.org/licenses/

--Chris Nauroth

On 12/7/15, 3:10 PM, "Haohui Mai" <ri...@gmail.com> wrote:

>Hi,
>
>Thanks for reaching out. It would be great to see this in the Hadoop
>ecosystem.
>
>In Hadoop we have AWS S3 support. IMO they address similar use cases
>thus I think that it should be relatively straightforward to adopt the
>code.
>
>The only catch in my head right now is to properly isolate dependency.
>Not only the code needs to be put into a separate module, but many
>Hadoop applications also depend on different versions of Guava. I
>think it might be a problem that needs some attentions at the very
>beginning.
>
>Please feel free to reach out if you have any other questions.
>
>Regards,
>Haohui
>
>
>On Mon, Dec 7, 2015 at 2:35 PM, James Malone
><ja...@google.com.invalid> wrote:
>> Hello,
>>
>> We're from a team within Google Cloud Platform focused on OSS and data
>> technologies, especially Hadoop (and Spark.) Before we cut a JIRA for
>> something we¹d like to do, we wanted to reach out to this list to ask a
>>two
>> quick questions, describe our proposed action, and check for any major
>> objections.
>>
>> Proposed action:
>> We have a Hadoop connector[1] (more info[2]) for Google Cloud Storage
>>(GCS)
>> which we have been building and maintaining for some time. After we
>>clean
>> up our code and tests to conform (to these[3] and other requirements) we
>> would like to contribute it to Hadoop. We have many customers using the
>> connector in high-throughput production Hadoop clusters; we¹d like to
>>make
>> it easier and faster to use Hadoop and GCS.
>>
>> Timeline:
>> Presently, we are working on the beta of Google Cloud Dataproc[4] which
>> limits our time a bit, so we¹re targeting late Q1 2016 for creating a
>>JIRA
>> issue and adapting our connector code as needed.
>>
>> Our (quick) questions:
>> * Do we need to take any (non-coding) action for this beyond submitting
>>a
>> JIRA when we are ready?
>> * Are there any up-front concerns or questions which we can (or will
>>need
>> to) address?
>>
>> Thank you!
>>
>> James Malone
>> On behalf of the Google Big Data OSS Engineering Team
>>
>> Links:
>> [1] - 
>>https://github.com/GoogleCloudPlatform/bigdata-interop/tree/master/gcs
>> [2] - https://cloud.google.com/hadoop/google-cloud-storage-connector
>> [3] - 
>>https://github.com/GoogleCloudPlatform/bigdata-interop/tree/master/gcs
>> [4] - https://cloud.google.com/dataproc
>

Re: Google Cloud Storage connector into Hadoop

Posted by Haohui Mai <ri...@gmail.com>.

Hi,

Thanks for reaching out. It would be great to see this in the Hadoop ecosystem.

In Hadoop we have AWS S3 support. IMO they address similar use cases
thus I think that it should be relatively straightforward to adopt the
code.

The only catch in my head right now is to properly isolate dependency.
Not only the code needs to be put into a separate module, but many
Hadoop applications also depend on different versions of Guava. I
think it might be a problem that needs some attentions at the very
beginning.

Please feel free to reach out if you have any other questions.

Regards,
Haohui


On Mon, Dec 7, 2015 at 2:35 PM, James Malone
<ja...@google.com.invalid> wrote:
> Hello,
>
> We're from a team within Google Cloud Platform focused on OSS and data
> technologies, especially Hadoop (and Spark.) Before we cut a JIRA for
> something we’d like to do, we wanted to reach out to this list to ask a two
> quick questions, describe our proposed action, and check for any major
> objections.
>
> Proposed action:
> We have a Hadoop connector[1] (more info[2]) for Google Cloud Storage (GCS)
> which we have been building and maintaining for some time. After we clean
> up our code and tests to conform (to these[3] and other requirements) we
> would like to contribute it to Hadoop. We have many customers using the
> connector in high-throughput production Hadoop clusters; we’d like to make
> it easier and faster to use Hadoop and GCS.
>
> Timeline:
> Presently, we are working on the beta of Google Cloud Dataproc[4] which
> limits our time a bit, so we’re targeting late Q1 2016 for creating a JIRA
> issue and adapting our connector code as needed.
>
> Our (quick) questions:
> * Do we need to take any (non-coding) action for this beyond submitting a
> JIRA when we are ready?
> * Are there any up-front concerns or questions which we can (or will need
> to) address?
>
> Thank you!
>
> James Malone
> On behalf of the Google Big Data OSS Engineering Team
>
> Links:
> [1] - https://github.com/GoogleCloudPlatform/bigdata-interop/tree/master/gcs
> [2] - https://cloud.google.com/hadoop/google-cloud-storage-connector
> [3] - https://github.com/GoogleCloudPlatform/bigdata-interop/tree/master/gcs
> [4] - https://cloud.google.com/dataproc