You are viewing a plain text version of this content. The canonical link for it is here.

Posted to general@hadoop.apache.org by Eric Baldeschwieler <er...@hortonworks.com> on 2013/03/01 06:02:16 UTC

Re: where do side-projects go in trunk now that contrib/ is gone?

I agree with where this is going.

Swift and S3 are compelling enough that they should be in the source tree IMO.  Hadoop needs to play well with common platforms such as the major clouds.

On the other hand, it would be great if we could segregate them enough that each builds is its own JAR and folks have the option of not pulling their dependancies in and not building / testing them in a clean way.

On Feb 14, 2013, at 6:05 AM, Steve Loughran <st...@gmail.com> wrote:

> On 13 February 2013 20:07, Alejandro Abdelnur <tu...@cloudera.com> wrote:
> 
>> Steve,
>> 
>> I like the idea of testing all FS for expected behavior, in HttpFS we are
>> already doing something along these lines testing HttpFS against HDFS and
>> LocalFS. Also testing 2 WebHDFS clients.
>> 
> 
> excellent. I look forward to your test contributions!
> 
>> 
>> Regarding where these 'extensions' would go, well, we could have something
>> like share/hadoop/common/filesystem-ext/s3 and whoever wants to use s3
>> would have to symlink those JARs into common/lib. Or having a way to
>> activate via a HADOOP_COMMON_FS_EXT env which extension JARs to pick up. I
>> guess the BigTop guys could help defining this magic.
>> 
>> 
>> I was thinking of less of "where should it go at install time" and "where
> do we keep it in SVN"
> 
> at install time you'd need the JAR + any dependencies on the daemon paths
> -if it is to be everywhere- or uploaded with a job into distributed cache.
> Testing that the latter works with filesystem.get() would be something to
> play with.
> 
> & yes, bigtop could help there

Re: where do side-projects go in trunk now that contrib/ is gone?

Posted by Alejandro Abdelnur <tu...@cloudera.com>.

sounds good, thx


On Sat, Mar 9, 2013 at 3:36 AM, Steve Loughran <st...@gmail.com>wrote:

> On 8 March 2013 18:47, Alejandro Abdelnur <tu...@cloudera.com> wrote:
>
> > I was chatting offline with Roman about this, his point is
> >
> > 1* segration of the FS impls into different modules makes sense
> > 2* it should be OK if they have mock services for unittests
> >
>
> not so much mock tests as live tests against individual features (rename,
> delete, mkdirs), but not full tests of MR jobs, Pig jobs, etc -which verify
> that real code works with it
>
>
> > 3* bigtop could do real integration testing
> >
>
> exactly -it's at the end of the dependency graph, and the best place to do
> that
>
>
> > 4* by doing this, the diff FileSystem impls would be there out of the box
> >
> > If we go down this path, I'm OK with it.
> >
>
>
> >
> > Thoughts?
> >
> >
> This is exactly what I've been thinking
>



-- 
Alejandro

Re: where do side-projects go in trunk now that contrib/ is gone?

Posted by Steve Loughran <st...@gmail.com>.

On 8 March 2013 18:47, Alejandro Abdelnur <tu...@cloudera.com> wrote:

> I was chatting offline with Roman about this, his point is
>
> 1* segration of the FS impls into different modules makes sense
> 2* it should be OK if they have mock services for unittests
>

not so much mock tests as live tests against individual features (rename,
delete, mkdirs), but not full tests of MR jobs, Pig jobs, etc -which verify
that real code works with it


> 3* bigtop could do real integration testing
>

exactly -it's at the end of the dependency graph, and the best place to do
that


> 4* by doing this, the diff FileSystem impls would be there out of the box
>
> If we go down this path, I'm OK with it.
>


>
> Thoughts?
>
>
This is exactly what I've been thinking

Re: where do side-projects go in trunk now that contrib/ is gone?

Posted by Alejandro Abdelnur <tu...@cloudera.com>.

I was chatting offline with Roman about this, his point is

1* segration of the FS impls into different modules makes sense
2* it should be OK if they have mock services for unittests
3* bigtop could do real integration testing
4* by doing this, the diff FileSystem impls would be there out of the box

If we go down this path, I'm OK with it.

Thoughts?




On Fri, Mar 8, 2013 at 9:07 AM, Alejandro Abdelnur <tu...@cloudera.com>wrote:

>
> > We are already there with the S3 and Azure blobstores, as well as the FTP
> > filesystem
>
> I think this is not correct and we should plan moving them out.
>
> This is independent on the effort of straighten up the FS spec, which I
> think is great.
>
> Thx
>
> On Fri, Mar 8, 2013 at 8:57 AM, Steve Loughran <st...@gmail.com>wrote:
>
>> On 8 March 2013 16:15, Alejandro Abdelnur <tu...@cloudera.com> wrote:
>>
>> > jumping a bit late into the discussion.
>> >
>> > yes. I started it in common-dev first, in the "where does contrib stuff
>> go
>> now", moved to general, where the conclusion was "except for special cases
>> like FS clients, it isn't".
>>
>> Now I'm trying to lay down the location for FS stuff, both for openstack,
>> and to handle so proposed dependency changes for s3n://
>>
>>
>> > I'd argue that unless those filesystems are part of hadoop, their
>> clients
>> > should not be distributed/build by hadoop.
>> >
>> > an analogy to this is not wanting Yarn to be the home for AM
>> > implementations.
>> >
>> > a key concern is testability and maintainability.
>> >
>>
>> We are already there with the S3 and Azure blobstores, as well as the FTP
>> filesystem
>>
>> The testability is straightforward for blobstores precisely because all
>> you
>> need is some credentials and cluster time; there's no requirement to have
>> some specific filesystem to hand. Any of those -very much in the vendors
>> hand to do their own testing, especially if the "it's a replacement for
>> HDFS" assertion is made.
>>
>> If you look at HADOOP-9361 you can see that I've been defining more
>> rigorously than before what our FS expectations are, with HADOOP-9371
>> spelling it out "what happens when you try to readFully() past the end of
>> a
>> file, or call getBlockLocations("/")? HDFS has actions here, and
>> downstream
>> code depends on some things (e.g. getBlockLocations() behaviour on
>> directories)
>>
>> https://issues.apache.org/jira/secure/attachment/12572328/HadoopFilesystemContract.pdf
>>
>> So far my initially blobstore-specific tests for the functional parts of
>> the specification (not the consistency, concurrency, atomicity parts) are
>> in github
>>
>> https://github.com/hortonworks/Hadoop-and-Swift-integration/tree/master/swift-file-system/src/test/java/org/apache/hadoop/fs/swift
>>
>>
>> I've also added more tests to the existing FS contract test, and in doing
>> so showed that s3 and s3n have some data-loss risks which need to be fixed
>> -that's an argument in having favour of the (testable, low-maintenance
>> cost) filesystems somewhere where any of us is free to fix.
>>
>> While we refine that spec better, I want to take those per-operation tests
>> from the SwiftFS support, make them retargetable at other filesystems, and
>> slowly apply them to all the distributed filesystems. Your colleague
>> Andrew
>> Wang is helping there by abstracting FileSystem and FileContext away, so
>> we
>> can test both.
>>
>> still, i see bigtop as the integration point and the mean of making those
>> > jars avail to a setup.
>> >
>> >
>> I plan to put integration -the tests that try to run Pig with arbitrary
>> source and dest filesystems, same for hive, plus some scale tests -can we
>> upload an 8GB file? What do you get back? can I create > 65536 entries in
>> a
>> single directory, and what happens to ls / performance?
>>
>> To summarise then
>>
>>    1. blobstores, ftpfilesystem & c could gradually move to a
>>    hadoop-common/hadoop-filesystem-clients
>>    2. A stricter specification of compliance, for the benefit of everyone
>>    -us, other FS implementors and users of FS APIs
>>    3. Lots of new functional tests for compliance -abstract in
>>    hadoop-common; FS-specific in hadoop-filesystem-clients..
>>    4. Integration & scale tests in bigtop
>>    5. Anyone writing a "hadoop compatible FS" can grab the functional and
>>    integration tests and see what breaks -fixing their code.
>>    6. The combination of (Java API files, specification doc, functional
>>    tests, HDFS implementation) define the expected behavior of a
>> filesystem
>>
>> -Steve
>>
>>
>> -Steve
>>
>
>
>
> --
> Alejandro
>



-- 
Alejandro

Re: where do side-projects go in trunk now that contrib/ is gone?

Posted by Alejandro Abdelnur <tu...@cloudera.com>.

> We are already there with the S3 and Azure blobstores, as well as the FTP
> filesystem

I think this is not correct and we should plan moving them out.

This is independent on the effort of straighten up the FS spec, which I
think is great.

Thx

On Fri, Mar 8, 2013 at 8:57 AM, Steve Loughran <st...@gmail.com>wrote:

> On 8 March 2013 16:15, Alejandro Abdelnur <tu...@cloudera.com> wrote:
>
> > jumping a bit late into the discussion.
> >
> > yes. I started it in common-dev first, in the "where does contrib stuff
> go
> now", moved to general, where the conclusion was "except for special cases
> like FS clients, it isn't".
>
> Now I'm trying to lay down the location for FS stuff, both for openstack,
> and to handle so proposed dependency changes for s3n://
>
>
> > I'd argue that unless those filesystems are part of hadoop, their clients
> > should not be distributed/build by hadoop.
> >
> > an analogy to this is not wanting Yarn to be the home for AM
> > implementations.
> >
> > a key concern is testability and maintainability.
> >
>
> We are already there with the S3 and Azure blobstores, as well as the FTP
> filesystem
>
> The testability is straightforward for blobstores precisely because all you
> need is some credentials and cluster time; there's no requirement to have
> some specific filesystem to hand. Any of those -very much in the vendors
> hand to do their own testing, especially if the "it's a replacement for
> HDFS" assertion is made.
>
> If you look at HADOOP-9361 you can see that I've been defining more
> rigorously than before what our FS expectations are, with HADOOP-9371
> spelling it out "what happens when you try to readFully() past the end of a
> file, or call getBlockLocations("/")? HDFS has actions here, and downstream
> code depends on some things (e.g. getBlockLocations() behaviour on
> directories)
>
> https://issues.apache.org/jira/secure/attachment/12572328/HadoopFilesystemContract.pdf
>
> So far my initially blobstore-specific tests for the functional parts of
> the specification (not the consistency, concurrency, atomicity parts) are
> in github
>
> https://github.com/hortonworks/Hadoop-and-Swift-integration/tree/master/swift-file-system/src/test/java/org/apache/hadoop/fs/swift
>
>
> I've also added more tests to the existing FS contract test, and in doing
> so showed that s3 and s3n have some data-loss risks which need to be fixed
> -that's an argument in having favour of the (testable, low-maintenance
> cost) filesystems somewhere where any of us is free to fix.
>
> While we refine that spec better, I want to take those per-operation tests
> from the SwiftFS support, make them retargetable at other filesystems, and
> slowly apply them to all the distributed filesystems. Your colleague Andrew
> Wang is helping there by abstracting FileSystem and FileContext away, so we
> can test both.
>
> still, i see bigtop as the integration point and the mean of making those
> > jars avail to a setup.
> >
> >
> I plan to put integration -the tests that try to run Pig with arbitrary
> source and dest filesystems, same for hive, plus some scale tests -can we
> upload an 8GB file? What do you get back? can I create > 65536 entries in a
> single directory, and what happens to ls / performance?
>
> To summarise then
>
>    1. blobstores, ftpfilesystem & c could gradually move to a
>    hadoop-common/hadoop-filesystem-clients
>    2. A stricter specification of compliance, for the benefit of everyone
>    -us, other FS implementors and users of FS APIs
>    3. Lots of new functional tests for compliance -abstract in
>    hadoop-common; FS-specific in hadoop-filesystem-clients..
>    4. Integration & scale tests in bigtop
>    5. Anyone writing a "hadoop compatible FS" can grab the functional and
>    integration tests and see what breaks -fixing their code.
>    6. The combination of (Java API files, specification doc, functional
>    tests, HDFS implementation) define the expected behavior of a filesystem
>
> -Steve
>
>
> -Steve
>



-- 
Alejandro

Re: where do side-projects go in trunk now that contrib/ is gone?

Posted by Steve Loughran <st...@gmail.com>.

On 8 March 2013 16:15, Alejandro Abdelnur <tu...@cloudera.com> wrote:

> jumping a bit late into the discussion.
>
> yes. I started it in common-dev first, in the "where does contrib stuff go
now", moved to general, where the conclusion was "except for special cases
like FS clients, it isn't".

Now I'm trying to lay down the location for FS stuff, both for openstack,
and to handle so proposed dependency changes for s3n://

> I'd argue that unless those filesystems are part of hadoop, their clients
> should not be distributed/build by hadoop.
>
> an analogy to this is not wanting Yarn to be the home for AM
> implementations.
>
> a key concern is testability and maintainability.
>

We are already there with the S3 and Azure blobstores, as well as the FTP
filesystem

The testability is straightforward for blobstores precisely because all you
need is some credentials and cluster time; there's no requirement to have
some specific filesystem to hand. Any of those -very much in the vendors
hand to do their own testing, especially if the "it's a replacement for
HDFS" assertion is made.

If you look at HADOOP-9361 you can see that I've been defining more
rigorously than before what our FS expectations are, with HADOOP-9371
spelling it out "what happens when you try to readFully() past the end of a
file, or call getBlockLocations("/")? HDFS has actions here, and downstream
code depends on some things (e.g. getBlockLocations() behaviour on
directories)
https://issues.apache.org/jira/secure/attachment/12572328/HadoopFilesystemContract.pdf

So far my initially blobstore-specific tests for the functional parts of
the specification (not the consistency, concurrency, atomicity parts) are
in github
https://github.com/hortonworks/Hadoop-and-Swift-integration/tree/master/swift-file-system/src/test/java/org/apache/hadoop/fs/swift

I've also added more tests to the existing FS contract test, and in doing
so showed that s3 and s3n have some data-loss risks which need to be fixed
-that's an argument in having favour of the (testable, low-maintenance
cost) filesystems somewhere where any of us is free to fix.

While we refine that spec better, I want to take those per-operation tests
from the SwiftFS support, make them retargetable at other filesystems, and
slowly apply them to all the distributed filesystems. Your colleague Andrew
Wang is helping there by abstracting FileSystem and FileContext away, so we
can test both.

still, i see bigtop as the integration point and the mean of making those
> jars avail to a setup.
>
>
I plan to put integration -the tests that try to run Pig with arbitrary
source and dest filesystems, same for hive, plus some scale tests -can we
upload an 8GB file? What do you get back? can I create > 65536 entries in a
single directory, and what happens to ls / performance?

To summarise then

   1. blobstores, ftpfilesystem & c could gradually move to a
   hadoop-common/hadoop-filesystem-clients
   2. A stricter specification of compliance, for the benefit of everyone
   -us, other FS implementors and users of FS APIs
   3. Lots of new functional tests for compliance -abstract in
   hadoop-common; FS-specific in hadoop-filesystem-clients..
   4. Integration & scale tests in bigtop
   5. Anyone writing a "hadoop compatible FS" can grab the functional and
   integration tests and see what breaks -fixing their code.
   6. The combination of (Java API files, specification doc, functional
   tests, HDFS implementation) define the expected behavior of a filesystem

-Steve

-Steve

Re: where do side-projects go in trunk now that contrib/ is gone?

Posted by Alejandro Abdelnur <tu...@cloudera.com>.

jumping a bit late into the discussion. 

I'd argue that unless those filesystems are part of hadoop, their clients should not be distributed/build by hadoop. 

an analogy to this is not wanting Yarn to be the home for AM implementations. 

a key concern is testability and maintainability. 

still, i see bigtop as the integration point and the mean of making those jars avail to a setup. 

thanks

Alejandro
(phone typing)

On Mar 8, 2013, at 6:43 AM, Steve Loughran <st...@gmail.com> wrote:

> On 1 March 2013 05:02, Eric Baldeschwieler <er...@hortonworks.com> wrote:
> 
>> I agree with where this is going.
>> 
>> Swift and S3 are compelling enough that they should be in the source tree
>> IMO.  Hadoop needs to play well with common platforms such as the major
>> clouds.
>> 
>> On the other hand, it would be great if we could segregate them enough
>> that each builds is its own JAR and folks have the option of not pulling
>> their dependancies in and not building / testing them in a clean way.
> I've added a JIRA on setting up a bit of the src tree and subproject(s) for
> these : https://issues.apache.org/jira/browse/HADOOP-9385
> 
> Test plans go into https://issues.apache.org/jira/browse/HADOOP-9361, which
> can evolve at  different rate
> 
> -Steve

Re: where do side-projects go in trunk now that contrib/ is gone?

Posted by Steve Loughran <st...@gmail.com>.

On 1 March 2013 05:02, Eric Baldeschwieler <er...@hortonworks.com> wrote:

> I agree with where this is going.
>
> Swift and S3 are compelling enough that they should be in the source tree
> IMO.  Hadoop needs to play well with common platforms such as the major
> clouds.
>
> On the other hand, it would be great if we could segregate them enough
> that each builds is its own JAR and folks have the option of not pulling
> their dependancies in and not building / testing them in a clean way.
>
>
I've added a JIRA on setting up a bit of the src tree and subproject(s) for
these : https://issues.apache.org/jira/browse/HADOOP-9385

Test plans go into https://issues.apache.org/jira/browse/HADOOP-9361, which
can evolve at  different rate

-Steve