You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by Alan Burlison <Al...@oracle.com> on 2015/09/30 10:14:46 UTC

DomainSocket issues on Solaris

Now that the Hadoop native code builds on Solaris I've been chipping
away at all the test failures. About 50% of the failures involve
DomainSocket, either directly or indirectly. That seems to be mainly
because the tests use DomainSocket to do single-node testing, whereas in
production it seems that DomainSocket is less commonly used
(https://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-hdfs/ShortCircuitLocalReads.html).

The particular problem on Solaris is that socket read/write timeouts
(the SO_SNDTIMEO and SO_RCVTIMEO socket options) are not supported for
UNIX domain (PF_UNIX) sockets. Those options are however supported for
PF_INET sockets. That's because the socket implementation on Solaris is
split roughly into two parts, for inet sockets and for STREAMS sockets,
and the STREAMS implementation lacks support for SO_SNDTIMEO and
SO_RCVTIMEO. As an aside, performance of sockets that use loopback or
the host's own IP is slightly better than that of UNIX domain sockets on
Solaris.

I'm investigating getting timeouts supported for PF_UNIX sockets added
to Solaris, but in the meantime I'm also looking how this might be
worked around in Hadoop. One way would be to implement timeouts by
wrapping all the read/write/send/recv etc calls in DomainSocket.c with
either poll() or select().

The basic idea is to add two new fields to DomainSocket.c to hold the
read/write timeouts. On platforms that support SO_SNDTIMEO and
SO_RCVTIMEO these would be unused as setsockopt() would be used to set
the socket timeouts. On platforms such as Solaris the JNI code would use
the values to implement the timeouts appropriately.

To prevent the code in DomainSocket.c becoming a #ifdef hairball, the
current socket IO function calls such as accept(), send(), read() etc
would be replaced with a macros such as HD_ACCEPT. On platforms that
provide timeouts these would just expand to the normal socket functions,
on platforms that don't support timeouts it would expand to wrappers
that implements timeouts for them.

The only caveats are that all code that does anything to a PF_UNIX
socket would *always* have to do so via DomainSocket. As far as I can
tell that's not an issue, but it would have to be borne in mind if any
changes were made in this area.

Before I set about doing this, does the approach seem reasonable?

Thanks,

--
Alan Burlison
--

Re: DomainSocket issues on Solaris

Posted by Chris Nauroth <cn...@hortonworks.com>.

Alan, thank you for picking up HADOOP-11127.  I think it has needed a
strong use case to kick it back into action, and maybe Solaris support is
that use case.  I'll join the discussion on the JIRA.

--Chris Nauroth

On 10/8/15, 9:40 AM, "Alan Burlison" <Al...@oracle.com> wrote:

>On 07/10/2015 22:05, Alan Burlison wrote:
>
>> I'll draft up a proposal and attach it to HADOOP-11127.
>
>Attached to HADOOP-11127 as proposal.txt
>
>-- 
>Alan Burlison
>--
>

Re: DomainSocket issues on Solaris

Posted by Alan Burlison <Al...@oracle.com>.

On 07/10/2015 22:05, Alan Burlison wrote:

> I'll draft up a proposal and attach it to HADOOP-11127.

Attached to HADOOP-11127 as proposal.txt

-- 
Alan Burlison
--

Re: DomainSocket issues on Solaris

Posted by Alan Burlison <Al...@oracle.com>.

On 07/10/15 18:53, Colin P. McCabe wrote:

> I think you could come up with a select/poll solution while using the
> old function signatures.  A 4-byte int is more than enough information
> to pass in, given that you can use it as an index into a table in the
> C code.

I have thought about that but a simple table would not work very well. 
It would have to be potentially quite large and would be sparsely 
populated. It would really have to be some sort of map and would most 
likely have to be implemented in C. However it is done it becomes a 
Solaris-only maintenance burden. Yes it's possible, but it seemed 
distinctly undesirable.

>  There are also a lot of other solution to this problem, like
> I pointed out earlier.  For example, you dismissed the timer wheel
> suggestion because of a detail of a unit test, but we could easily
> change the test.

Unfortunately there are somewhere around 100 test failures that I think 
are related to the socket timeout issue, which is why I focussed on it.

> Anyway, changing the function signatures in the way you described is
> certainly reasonable and I wouldn't object to it.  It is probably the
> most natural solution.

That's the conclusion I came to, but I fully understand there has to be 
a solution for the Java/JNI versioning issue as well.

>> Does that sound acceptable? If so I can draft up a proposal for native
>> library version and platform naming, library search locations etc.
>
> Yes, I think it would be good to make some progress on HADOOP-11127.
> We have been putting off the issue for too long.

Even if I put together a solution for DomainSocket that doesn't need 
changes to the JNI interface I'm almost certain that subsequent work 
will hit the same issue. I'd rather spend the time up front and come up 
with a once-and-for-all solution, I think overall that will work out to 
be less effort and certainly less risky.

I'll draft up a proposal and attach it to HADOOP-11127.

Thanks,

-- 
Alan Burlison
--

Re: DomainSocket issues on Solaris

Posted by "Colin P. McCabe" <cm...@apache.org>.

On Wed, Oct 7, 2015 at 9:35 AM, Alan Burlison <Al...@oracle.com> wrote:
> On 06/10/2015 10:52, Steve Loughran wrote:
>
>> HADOOP-11127, "Improve versioning and compatibility support in native
>> library for downstream hadoop-common users." says "we need to do
>> better here", which is probably some way of packaging native libs.
>
>
> From that JIRA:
>
>> Colin Patrick McCabe added a comment - 18/Apr/15 00:48
>>
>> I was thinking we:
>> 1. Add the Hadoop release version to libhadoop.so. It's very, very
>> simple and solves a lot of problems here.
>> 2. Remove libhadoop.so and libhdfs.so from the release tarball, since
>> they are CPU and OS-specific and the tarballs are not
>> 3. Schedule some follow-on work to include the native libraries
>> inside jars, as Chris suggested. This will take longer but ultimately
>> be the best solution.
>
>
> And:
>
>> I just spotted one: HADOOP-10027.  A field was removed from the Java
>> layer, which still could get referenced by an older version of the native
>> layer.  A backwards-compatible version of that patch would preserve the
>> old fields in the Java layer.
>
>
> I've been thinking about this and I really don't think the strategy of
> trying to shim old methods and fields back in to Hadoop is the correct one.
> The current Java-JNI interactions have been developed in an ad-hoc manner
> with no formal API definition and are explicitly Not-An-Interface and as a
> result no consideration has been given to cross-version stability. A
> compatibility shim approach is neither sustainable nor maintainable even on
> a single platform, and will severely compromise efforts to get Hadoop native
> components working on other platforms.

I agree.

>
> The approach suggested in HADOOP-11127 seems a much better way forward, in
> particular #2 (versioned libhadoop). As pointed out in the JIRA, #1 (freeze
> libahdoop forever) is an obvious non-starter, and #3 (distribute libahadoop
> inside the JAR) is also a non-starter as it will not work cross-platform.
>
> I'm happy to work on HADOOP-10027 and make that a prerequisite for fixing
> the Solaris DomainSocket issues discussed in this thread. I believe it's not
> practical to provide a fix for DomainSocket on Solaris with a 'No JNI
> signature changes' restriction.

I think you could come up with a select/poll solution while using the
old function signatures.  A 4-byte int is more than enough information
to pass in, given that you can use it as an index into a table in the
C code.  There are also a lot of other solution to this problem, like
I pointed out earlier.  For example, you dismissed the timer wheel
suggestion because of a detail of a unit test, but we could easily
change the test.

Anyway, changing the function signatures in the way you described is
certainly reasonable and I wouldn't object to it.  It is probably the
most natural solution.

>
> Does that sound acceptable? If so I can draft up a proposal for native
> library version and platform naming, library search locations etc.

Yes, I think it would be good to make some progress on HADOOP-11127.
We have been putting off the issue for too long.

best,
Colin

>
>
> Thanks,
>
> --
> Alan Burlison
> --

Re: DomainSocket issues on Solaris

Posted by Alan Burlison <Al...@oracle.com>.

On 06/10/2015 10:52, Steve Loughran wrote:

> HADOOP-11127, "Improve versioning and compatibility support in native
> library for downstream hadoop-common users." says "we need to do
> better here", which is probably some way of packaging native libs.

 From that JIRA:

> Colin Patrick McCabe added a comment - 18/Apr/15 00:48
>
> I was thinking we:
> 1. Add the Hadoop release version to libhadoop.so. It's very, very
> simple and solves a lot of problems here.
> 2. Remove libhadoop.so and libhdfs.so from the release tarball, since
> they are CPU and OS-specific and the tarballs are not
> 3. Schedule some follow-on work to include the native libraries
> inside jars, as Chris suggested. This will take longer but ultimately
> be the best solution.

And:

> I just spotted one: HADOOP-10027.  A field was removed from the Java
> layer, which still could get referenced by an older version of the native
> layer.  A backwards-compatible version of that patch would preserve the
> old fields in the Java layer.

I've been thinking about this and I really don't think the strategy of 
trying to shim old methods and fields back in to Hadoop is the correct 
one.  The current Java-JNI interactions have been developed in an ad-hoc 
manner with no formal API definition and are explicitly Not-An-Interface 
and as a result no consideration has been given to cross-version 
stability. A compatibility shim approach is neither sustainable nor 
maintainable even on a single platform, and will severely compromise 
efforts to get Hadoop native components working on other platforms.

The approach suggested in HADOOP-11127 seems a much better way forward, 
in particular #2 (versioned libhadoop). As pointed out in the JIRA, #1 
(freeze libahdoop forever) is an obvious non-starter, and #3 (distribute 
libahadoop inside the JAR) is also a non-starter as it will not work 
cross-platform.

I'm happy to work on HADOOP-10027 and make that a prerequisite for 
fixing the Solaris DomainSocket issues discussed in this thread. I 
believe it's not practical to provide a fix for DomainSocket on Solaris 
with a 'No JNI signature changes' restriction.

Does that sound acceptable? If so I can draft up a proposal for native 
library version and platform naming, library search locations etc.

Thanks,

-- 
Alan Burlison
--

Re: DomainSocket issues on Solaris

Posted by Alan Burlison <Al...@oracle.com>.

On 06/10/2015 17:03, Chris Nauroth wrote:

> Alan, would you please list the specific patches/JIRA issues that broke
> compatibility?  I have not been reviewing the native code lately, so it
> would help me catch up quickly if you already know which specific patches
> have introduced problems.  If those patches currently reside only on trunk
> and branch-2, then they have not yet shipped in an Apache release.  We'd
> still have an opportunity to fix them and avoid "dropping the match"
> before shipping 2.8.0.

https://issues.apache.org/jira/browse/HADOOP-11985 was the one I was 
thinking about as it changed fields from final to static. I haven't 
figured out what impact that has on the classes & shared object. Plus 
https://issues.apache.org/jira/browse/HADOOP-12184 which removed some 
fields.

-- 
Alan Burlison
--

Re: DomainSocket issues on Solaris

Posted by Steve Loughran <st...@hortonworks.com>.

>> 
>> 
>> On 10/6/15, 8:25 AM, "Alan Burlison" <Al...@oracle.com> wrote:
>>> 
>>>>> In any case the constraint you are requesting would flat-out
>>>>> preclude this change, and would also mean that most of the other
>>>>> JNI changes that have been committed recently would have to be
>>>>> ripped out as well . In summary, the bridge is already burned.
>>>> 
>>>> We've covered the bridge in petrol but not quite dropped a match on
>>>> it.
>>> 
>>> No, I'm reasonable certain you've already dropped the match, and if you
>>> haven't its just good fortune.
>>> 
>>> -- 
>>> Alan Burlison
>>> --
>>> 
>> 
>> 
> 

Ok, we just hadn't noticed the bridge was on fire...

Re: DomainSocket issues on Solaris

Posted by Chris Nauroth <cn...@hortonworks.com>.

I just spotted one: HADOOP-10027.  A field was removed from the Java
layer, which still could get referenced by an older version of the native
layer.  A backwards-compatible version of that patch would preserve the
old fields in the Java layer.

Full disclosure: I was the one who committed that patch, so this was a
miss by me during the code review.

--Chris Nauroth




On 10/6/15, 9:03 AM, "Chris Nauroth" <cn...@hortonworks.com> wrote:

>Alan, would you please list the specific patches/JIRA issues that broke
>compatibility?  I have not been reviewing the native code lately, so it
>would help me catch up quickly if you already know which specific patches
>have introduced problems.  If those patches currently reside only on trunk
>and branch-2, then they have not yet shipped in an Apache release.  We'd
>still have an opportunity to fix them and avoid "dropping the match"
>before shipping 2.8.0.
>
>Yes, we are aware that binary compatibility goes beyond the function
>signatures and into data layout and semantics.
>
>--Chris Nauroth
>
>
>
>
>On 10/6/15, 8:25 AM, "Alan Burlison" <Al...@oracle.com> wrote:
>
>>On 06/10/2015 10:52, Steve Loughran wrote:
>>
>>>> That's not achievable as the method signatures need to change. Even
>>>> though they are private they need to change from static to normal
>>>> methods and the signatures need to change as well, as I said.
>>>
>>> We've done it before, simply by retaining the older method entry
>>> points. Moving from static to instance-specific is a bigger change.
>>> If the old entry points are there and retained, even if all uses have
>>> been ripped out of the hadoop code, then the new methods will get
>>> used. It's just that old stuff will still link.
>>
>>As I explained in my last email, converting the old static JNI functions
>>to be wrappers around new instance JNI functions requires a jobject
>>reference to be passed into the new function that the old one wraps
>>around. The static methods can't magic one up. An instance pointer *is*
>>available, the current code flow is Java object method -> static JNI
>>function so if we could change the JNI from static->instance then we'd
>>have what we needed. But if you are considering the JNI layer to be a
>>public interface (which I think is a big mistake, no matter how
>>convenient it might be), then you are simply screwed, both here and in
>>other places. As I've said, I have a suspicion that changes we've
>>already made have broken that compatibility anyway.
>>
>>>> JNI code is intimately  intertwined with the Java code it runs
>>>> with. Running mismatching Java & JNI versions is going to be a
>>>> recipe for eventual disaster as the JVM explicitly does *not* do
>>>> any error checking between Java and JNI.
>>>
>>> You mean jni code built for java7 isn't guaranteed to work on Java 8?
>>> If so, that's not something we knew of ‹and something to worry
>>> about.
>>
>>Actually I think that particular scenario is going to be OK. I wasn't
>>clear - sorry - what I was musing about was the fact that the Hadoop JNI
>>IO code delves into the innards of the platform Java classes and pulls
>>out bits of private data. That's explicitly not-an-interface and could
>>break at any time, although the likelihood may be low the JVM developers
>>could change it and you'd just be SOL. The same goes for all the other
>>private Java interfaces that Hadoop consumes - all the ones you get
>>warnings about when you build it. For example there are already plans to
>>make significant changes to sun.misc.unsafe for example. That will
>>affect Hadoop.
>>
>>>> At some point some innocuous change will be made that will just
>>>> cause undefined behaviour.
>>>>
>>>> I don't actually know how you'd get a JAR/JNI mismatch as they are
>>>> built and packaged together, so I'm struggling to understand what
>>>> the potential issue is here.
>>>
>>> it arises whenever you try to deploy to YARN any application
>>> containing directly or indirectly (e.g. inside the spark-assembly
>>> JAR) the Hadoop java classes of a previous Java version. libhadoop is
>>> on the PATH of the far end, your app uploads their hadoop JARs, and
>>> the moment something tries to use the JNI-backed method you get to
>>> see a stack trace.
>>>
>>> https://issues.apache.org/jira/browse/HADOOP-11064
>>>
>>> if you look at the patch there, that's the kind of thing I'd like to
>>> see to address your solaris issues.
>>
>>Hmm, yes. That's appears to be a short-term hack-around to keep things
>>running, not a fix. At very best, it's extremely fragile.
>>
>> From the bug:
>>
>>"We don't have any way of enforcing C API stability. Jenkins doesn't
>>check for it, most Java programmers don't know how to achieve it."
>>
>>In which case I think reading this will be helpful:
>>http://docs.oracle.com/cd/E19253-01/817-1984/chapter5-84101/index.html
>>
>>The assumption seems to be that as long as libhadoop.so keeps the same
>>list of functions with the same arguments then it will be
>>backwards-compatible. Unfortunately that's just flat out wrong. Binary
>>compatibility requires more than that. It also requires that there are
>>no changes to any data structures, and that the semantics of all the
>>functions remain completely unchanged. I'd put money on that not being
>>the case already. The errors you saw HADOOP-11064 are the easy ones
>>because you got a run-time linker error. The others will cause
>>mysterious behaviour, memory corruption and general WTFness.
>>
>>>> In any case the constraint you are requesting would flat-out
>>>> preclude this change, and would also mean that most of the other
>>>> JNI changes that have been committed recently would have to be
>>>> ripped out as well . In summary, the bridge is already burned.
>>>
>>> We've covered the bridge in petrol but not quite dropped a match on
>>> it.
>>
>>No, I'm reasonable certain you've already dropped the match, and if you
>>haven't its just good fortune.
>>
>>-- 
>>Alan Burlison
>>--
>>
>
>

Re: DomainSocket issues on Solaris

Posted by Chris Nauroth <cn...@hortonworks.com>.

Alan, would you please list the specific patches/JIRA issues that broke
compatibility?  I have not been reviewing the native code lately, so it
would help me catch up quickly if you already know which specific patches
have introduced problems.  If those patches currently reside only on trunk
and branch-2, then they have not yet shipped in an Apache release.  We'd
still have an opportunity to fix them and avoid "dropping the match"
before shipping 2.8.0.

Yes, we are aware that binary compatibility goes beyond the function
signatures and into data layout and semantics.

--Chris Nauroth




On 10/6/15, 8:25 AM, "Alan Burlison" <Al...@oracle.com> wrote:

>On 06/10/2015 10:52, Steve Loughran wrote:
>
>>> That's not achievable as the method signatures need to change. Even
>>> though they are private they need to change from static to normal
>>> methods and the signatures need to change as well, as I said.
>>
>> We've done it before, simply by retaining the older method entry
>> points. Moving from static to instance-specific is a bigger change.
>> If the old entry points are there and retained, even if all uses have
>> been ripped out of the hadoop code, then the new methods will get
>> used. It's just that old stuff will still link.
>
>As I explained in my last email, converting the old static JNI functions
>to be wrappers around new instance JNI functions requires a jobject
>reference to be passed into the new function that the old one wraps
>around. The static methods can't magic one up. An instance pointer *is*
>available, the current code flow is Java object method -> static JNI
>function so if we could change the JNI from static->instance then we'd
>have what we needed. But if you are considering the JNI layer to be a
>public interface (which I think is a big mistake, no matter how
>convenient it might be), then you are simply screwed, both here and in
>other places. As I've said, I have a suspicion that changes we've
>already made have broken that compatibility anyway.
>
>>> JNI code is intimately  intertwined with the Java code it runs
>>> with. Running mismatching Java & JNI versions is going to be a
>>> recipe for eventual disaster as the JVM explicitly does *not* do
>>> any error checking between Java and JNI.
>>
>> You mean jni code built for java7 isn't guaranteed to work on Java 8?
>> If so, that's not something we knew of ‹and something to worry
>> about.
>
>Actually I think that particular scenario is going to be OK. I wasn't
>clear - sorry - what I was musing about was the fact that the Hadoop JNI
>IO code delves into the innards of the platform Java classes and pulls
>out bits of private data. That's explicitly not-an-interface and could
>break at any time, although the likelihood may be low the JVM developers
>could change it and you'd just be SOL. The same goes for all the other
>private Java interfaces that Hadoop consumes - all the ones you get
>warnings about when you build it. For example there are already plans to
>make significant changes to sun.misc.unsafe for example. That will
>affect Hadoop.
>
>>> At some point some innocuous change will be made that will just
>>> cause undefined behaviour.
>>>
>>> I don't actually know how you'd get a JAR/JNI mismatch as they are
>>> built and packaged together, so I'm struggling to understand what
>>> the potential issue is here.
>>
>> it arises whenever you try to deploy to YARN any application
>> containing directly or indirectly (e.g. inside the spark-assembly
>> JAR) the Hadoop java classes of a previous Java version. libhadoop is
>> on the PATH of the far end, your app uploads their hadoop JARs, and
>> the moment something tries to use the JNI-backed method you get to
>> see a stack trace.
>>
>> https://issues.apache.org/jira/browse/HADOOP-11064
>>
>> if you look at the patch there, that's the kind of thing I'd like to
>> see to address your solaris issues.
>
>Hmm, yes. That's appears to be a short-term hack-around to keep things
>running, not a fix. At very best, it's extremely fragile.
>
> From the bug:
>
>"We don't have any way of enforcing C API stability. Jenkins doesn't
>check for it, most Java programmers don't know how to achieve it."
>
>In which case I think reading this will be helpful:
>http://docs.oracle.com/cd/E19253-01/817-1984/chapter5-84101/index.html
>
>The assumption seems to be that as long as libhadoop.so keeps the same
>list of functions with the same arguments then it will be
>backwards-compatible. Unfortunately that's just flat out wrong. Binary
>compatibility requires more than that. It also requires that there are
>no changes to any data structures, and that the semantics of all the
>functions remain completely unchanged. I'd put money on that not being
>the case already. The errors you saw HADOOP-11064 are the easy ones
>because you got a run-time linker error. The others will cause
>mysterious behaviour, memory corruption and general WTFness.
>
>>> In any case the constraint you are requesting would flat-out
>>> preclude this change, and would also mean that most of the other
>>> JNI changes that have been committed recently would have to be
>>> ripped out as well . In summary, the bridge is already burned.
>>
>> We've covered the bridge in petrol but not quite dropped a match on
>> it.
>
>No, I'm reasonable certain you've already dropped the match, and if you
>haven't its just good fortune.
>
>-- 
>Alan Burlison
>--
>

Re: DomainSocket issues on Solaris

Posted by Alan Burlison <Al...@oracle.com>.

On 06/10/2015 10:52, Steve Loughran wrote:

>> That's not achievable as the method signatures need to change. Even
>> though they are private they need to change from static to normal
>> methods and the signatures need to change as well, as I said.
>
> We've done it before, simply by retaining the older method entry
> points. Moving from static to instance-specific is a bigger change.
> If the old entry points are there and retained, even if all uses have
> been ripped out of the hadoop code, then the new methods will get
> used. It's just that old stuff will still link.

As I explained in my last email, converting the old static JNI functions 
to be wrappers around new instance JNI functions requires a jobject 
reference to be passed into the new function that the old one wraps 
around. The static methods can't magic one up. An instance pointer *is* 
available, the current code flow is Java object method -> static JNI 
function so if we could change the JNI from static->instance then we'd 
have what we needed. But if you are considering the JNI layer to be a 
public interface (which I think is a big mistake, no matter how 
convenient it might be), then you are simply screwed, both here and in 
other places. As I've said, I have a suspicion that changes we've 
already made have broken that compatibility anyway.

>> JNI code is intimately  intertwined with the Java code it runs
>> with. Running mismatching Java & JNI versions is going to be a
>> recipe for eventual disaster as the JVM explicitly does *not* do
>> any error checking between Java and JNI.
>
> You mean jni code built for java7 isn't guaranteed to work on Java 8?
> If so, that's not something we knew of —and something to worry
> about.

Actually I think that particular scenario is going to be OK. I wasn't 
clear - sorry - what I was musing about was the fact that the Hadoop JNI 
IO code delves into the innards of the platform Java classes and pulls 
out bits of private data. That's explicitly not-an-interface and could 
break at any time, although the likelihood may be low the JVM developers 
could change it and you'd just be SOL. The same goes for all the other 
private Java interfaces that Hadoop consumes - all the ones you get 
warnings about when you build it. For example there are already plans to 
make significant changes to sun.misc.unsafe for example. That will 
affect Hadoop.

>> At some point some innocuous change will be made that will just
>> cause undefined behaviour.
>>
>> I don't actually know how you'd get a JAR/JNI mismatch as they are
>> built and packaged together, so I'm struggling to understand what
>> the potential issue is here.
>
> it arises whenever you try to deploy to YARN any application
> containing directly or indirectly (e.g. inside the spark-assembly
> JAR) the Hadoop java classes of a previous Java version. libhadoop is
> on the PATH of the far end, your app uploads their hadoop JARs, and
> the moment something tries to use the JNI-backed method you get to
> see a stack trace.
>
> https://issues.apache.org/jira/browse/HADOOP-11064
>
> if you look at the patch there, that's the kind of thing I'd like to
> see to address your solaris issues.

Hmm, yes. That's appears to be a short-term hack-around to keep things 
running, not a fix. At very best, it's extremely fragile.

 From the bug:

"We don't have any way of enforcing C API stability. Jenkins doesn't 
check for it, most Java programmers don't know how to achieve it."

In which case I think reading this will be helpful: 
http://docs.oracle.com/cd/E19253-01/817-1984/chapter5-84101/index.html

The assumption seems to be that as long as libhadoop.so keeps the same 
list of functions with the same arguments then it will be 
backwards-compatible. Unfortunately that's just flat out wrong. Binary 
compatibility requires more than that. It also requires that there are 
no changes to any data structures, and that the semantics of all the 
functions remain completely unchanged. I'd put money on that not being 
the case already. The errors you saw HADOOP-11064 are the easy ones 
because you got a run-time linker error. The others will cause 
mysterious behaviour, memory corruption and general WTFness.

>> In any case the constraint you are requesting would flat-out
>> preclude this change, and would also mean that most of the other
>> JNI changes that have been committed recently would have to be
>> ripped out as well . In summary, the bridge is already burned.
>
> We've covered the bridge in petrol but not quite dropped a match on
> it.

No, I'm reasonable certain you've already dropped the match, and if you 
haven't its just good fortune.

-- 
Alan Burlison
--

Re: DomainSocket issues on Solaris

Posted by Steve Loughran <st...@hortonworks.com>.

> On 5 Oct 2015, at 15:56, Alan Burlison <Al...@oracle.com> wrote:
> 
> On 05/10/2015 15:14, Steve Loughran wrote:
> 
>> I don't think anyone would object for the changes, except for one big
>> caveat: a lot of us would like that binary file to be backwards
>> compatible; a Hadoop 2.6 JAR should be able to link to the 2.8+
>> libhadoop. So whatever gets changed, the old methods are still going
>> to hang around
> 
> That's not achievable as the method signatures need to change. Even though they are private they need to change from static to normal methods and the signatures need to change as well, as I said.

We've done it before, simply by retaining the older method entry points. Moving from static to instance-specific is a bigger change. If the old entry points are there and retained, even if all uses have been ripped out of the hadoop code, then the new methods will get used. It's just that old stuff will still link.  

> 
> JNI code is intimately  intertwined with the Java code it runs with. Running mismatching Java & JNI versions is going to be a recipe for eventual disaster as the JVM explicitly does *not* do any error checking between Java and JNI.

You mean jni code built for java7 isn't guaranteed to work on Java 8? If so, that's not something we knew of —and something to worry about.

> At some point some innocuous change will be made that will just cause undefined behaviour.
> 
> I don't actually know how you'd get a JAR/JNI mismatch as they are built and packaged together, so I'm struggling to understand what the potential issue is here.

it arises whenever you try to deploy to YARN any application containing directly or indirectly (e.g. inside the spark-assembly JAR) the Hadoop java classes of a previous Java version. libhadoop is on the PATH of the far end, your app uploads their hadoop JARs, and the moment something tries to use the JNI-backed method you get to see a stack trace.

https://issues.apache.org/jira/browse/HADOOP-11064

if you look at the patch there, that's the kind of thing I'd like to see to address your solaris issues.

> 
> In any case the constraint you are requesting would flat-out preclude this change, and would also mean that most of the other JNI changes that have been committed recently would have to be ripped out as well . In summary, the bridge is already burned.
> 

We've covered the bridge in petrol but not quite dropped a match on it.

HADOOP-11127, "Improve versioning and compatibility support in native library for downstream hadoop-common users." says "we need to do better here", which is probably some way of packaging native libs.

Now, if you look at our compatibility statement, we don't say anything about native binary linking:
http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/Compatibility.html

We have managed to avoid addressing this issue to date: the HADOOP-11064 problem was caught before 2.6 shipped, and the patch put in without setting an immutable guarantee of compatibility going forward. We just don't want to light that bridge when a lot of users are on the other side of it.

-Steve

Re: DomainSocket issues on Solaris

Posted by Alan Burlison <Al...@oracle.com>.

On 06/10/2015 11:01, Steve Loughran wrote:

>> I really don't want to do that as it relegates Solaris to only ever
>> being a second-class citizen.
>
> I know that Solaris matters to you 100%, and we've tried to be as
> supportive as we can, even though it's not viewed as important to
> anyone else. We don't want to make it 2nd class, just want to get it
> to be 1st class in a way which doesn't create lots of compatibility
> problems.

Yes you have been supportive, I recognise that and I'm grateful for it 
:-) Although I'm the main Solaris person that's visible, I'm not the 
only one who is interested. And I fully get the backwards compatibility 
thing, it's one of the main features of Solaris. However keeping 
backwards binary compatibility is something you really have to decide up 
front and design for, it's very difficult to add it as a constraint 
after the fact, as this scenario illustrates. And without internal or 
external library versioning support, its even harder still.

> Is the per-socket timeout assumption used anywhere outside the
> tests?

I've no real idea yet as I haven't yet got to the point where I have a 
'Full Fat JNI' version of Hadoop on Solaris, I do know that around 50% 
of the ~200 test failures I'm seeing are most likely related to timeout 
handling, which is why I'm concentrating on it.

> so we move from
>
> function(fileHandle)
>
> to function(Object), where object->fileHandle and object->timeout are both there?

To be precise, the signature change I have at the moment is (for example)

JNIEXPORT jint JNICALL
Java_org_apache_hadoop_net_unix_DomainSocket_accept0(
JNIEnv *env, jclass clazz, jint fd)

to

JNIEXPORT jint JNICALL
Java_org_apache_hadoop_net_unix_DomainSocket_accept0(
JNIEnv *env, jobject obj)

filehandle, readTimeout and writeTimeout are then accessed as members of 
the jobject.

> what about
>
> function(fileHandle, timeout)
>
> where we retain
>
> function(fileHandle) { return function(fileHandle, defaultTimeout)}?
>
> And then never invoke it in our existing code, which now calls the new operation?
> or if there's a call
>
> setTimeout(fileHandle, timeout)
>
> which for linux sets the socket timeout —and in solaris updates some
> map handle->timeout used in the select() call.

Yes, I'd thought of that. The problem is the 'some map' bit. Maintaining 
that map would be clunky - file descriptor IDs are not going to be 
sequential and are reused so we'd have to store them in some sort of 
shadow data structure and track each and every close, and that's fiddly.

And the 'default timeout' option is I believe a non-starter, the default 
timeout is 2 minutes and many of the tests set it to a much shorter 
interval and expect it to time out at the specified time.

The problem is that if we store the timeout along the filehandle then we 
need access to an object pointer to retrieve it during the socket call. 
As the existing functions are static ones an object pointer isn't available.

I've looked long and hard at this, I have not come up with a mechanism 
that is both backwards binary compatible and not totally vile.

>> The other option is to effectively write a complete Solaris-only
>> replacement for DomainSocket, whether switching between that and the
>> current one is done at compile or run-time isn't really the point.
>> There's a fairly even split between the Java & JNI components of
>> DomainSocket, so whichever way it's done there will be significant
>> duplication of the overall logic and most likely code duplication.
>> That means that bug fixes in one place have to be exactly mirrored in
>> another, and that's unlikely to be sustainable.
>
> It's not going to be maintained, or more precisely: it'll be broken
> on a regular basis and you are the one left to handle it.

Exactly, which is why it is a non-starter. Whatever I do to fix this 
needs to be as minimal as possible and needs to disappear on platforms 
which don't need it.

>> Unfortunately I can't predict when that might happen by, though. In
>> my prototype it probes for working timeouts at configure time, so
>> when they do become available they'll be used automatically.
>
> I agree that there is no formal libhadoop.so compatibility policy and
> that is frustrating.  This has been an issue for those who want to run
> jars compiled against multiple different versions of hadoop through
> the same YARN instance.  We've discussed it in the past, but never
> really come up with a great solution.  The best approach really would
> be to bundle libhadoop.so inside the hadoop jar files, so that it
> could be integral to the Hadoop version itself.  However, nobody has
> done the work to make that happen.  The second-best approach would be
> to include the Hadoop version in the libhadoop name itself (so we'd
> have libhadoop28.so for hadoop 2.8, and so forth.)  Anyway, I think we
> can solve this particular issue without going down that rathole...

Unfortunately I don't think we can, not without further complicating the 
existing complicated code with a lot of scaffolding.

I don't understand how YARN & multiple Hadoop versions interact, but if 
they are all in the same JVM instance then no amount of fiddling with 
shared objects will help as you can't have multiple SOs providing the 
same APIs within the same process - or at least not without a lot of 
complicated, fragile and utterly platform-specific configuration and code.

-- 
Alan Burlison
--

Re: DomainSocket issues on Solaris

Posted by Steve Loughran <st...@hortonworks.com>.

On 6 Oct 2015, at 00:34, Alan Burlison <Al...@oracle.com>> wrote:

On 05/10/15 18:30, Colin P. McCabe wrote:

1. Don't get DomainSocket working on Solaris.  Rely on the legacy
short-circuit read instead.  It has poorer security guarantees, but
doesn't require domain sockets.  You can add a line of code to the
failing junit tests to skip them on Solaris.

I really don't want to do that as it relegates Solaris to only ever being a second-class citizen.

I know that Solaris matters to you 100%, and we've tried to be as supportive as we can, even though it's not viewed as important to anyone else. We don't want to make it 2nd class, just want to get it to be 1st class in a way which doesn't create lots of compatibility problems.


2. Use a separate "timer wheel" thread which implements coarse-grained
timeouts by calling shutdown() on domain sockets that have been active
for too long.  This thread could be global (one per JVM).

From what I can tell that won't stop all the test failures as they are written with the assumption that per-socket timeouts are available and that they time out exactly when expected.


Is the per-socket timeout assumption used anywhere outside the tests?

3. Implement the poll/select loop you discussed earlier.  As Steve
commented, it would be easier to do this by adding new functions,
rather than by changing existing ones.  I don't think "ifdef skid
marks" are necessary since poll and select are supported on Linux and
so forth as well as Solaris.  You would just need some code in
DomainSocket.java to select the appropriate implementation at runtime
based on the OS.

I could switch the implementation over to use poll everywhere but I haven't done that - Linux still uses socket timeouts. The issue is that in order to make poll() work I need to maintain the read/write timeouts alongside the filehandle - I can't store the timeout 'inside' the filehandle using setsockopt(). That means that the filehandle and the timeouts have to be stored together somewhere. The logical place to put the timeouts is in the same DomainSocket instances that holds the filehandle. If the DomainSocket JNI methods were all instance methods then there wouldn't be a problem, but they aren't, they are static methods where the integer filehandle is passed in as a parameter. And it wouldn't work if I change the native method parameter lists to include the timeouts as they need to be read/write. The only non-vile way I can come up with of doing this is to convert the JNI methods from static into instance methods. Even if that's the only change I make and I still pass in the filehandle as a parameter, the signatures will have changed as the 2nd parameter would now be an object reference and not a class reference.

so we move from

function(fileHandle)

to function(Object), where object->fileHandle and object->timeout are both there?

what about

function(fileHandle, timeout)

where we retain

function(fileHandle) { return function(fileHandle, defaultTimeout)}?

And then never invoke it in our existing code, which now calls the new operation?

or if there's a call

setTimeout(fileHandle, timeout)

which for linux sets the socket timeout —and in solaris updates some map handle->timeout used in the select() call.


The other option is to effectively write a complete Solaris-only replacement for DomainSocket, whether switching between that and the current one is done at compile or run-time isn't really the point. There's a fairly even split between the Java & JNI components of DomainSocket, so whichever way it's done there will be significant duplication of the overall logic and most likely code duplication. That means that bug fixes in one place have to be exactly mirrored in another, and that's unlikely to be sustainable.


It's not going to be maintained, or more precisely: it'll be broken on a regular basis and you are the one left to handle it.


My goal has been to keep the current logic as unchanged as possible. My prototype does that by literally prefixing each libc socket operation with a poll() call to check the filehandle is ready. The rest of the logic in DomainSocket is completely unchanged. That means that the behaviour between Linux and Solaris should be as identical as is possible.

Since you commented that Solaris is implementing timeout support in
the future, approaches #1 or #2 could be placeholders until that's
finished.

Unfortunately I can't predict when that might happen by, though. In my prototype it probes for working timeouts at configure time, so when they do become available they'll be used automatically.

I agree that there is no formal libhadoop.so compatibility policy and
that is frustrating.  This has been an issue for those who want to run
jars compiled against multiple different versions of hadoop through
the same YARN instance.  We've discussed it in the past, but never
really come up with a great solution.  The best approach really would
be to bundle libhadoop.so inside the hadoop jar files, so that it
could be integral to the Hadoop version itself.  However, nobody has
done the work to make that happen.  The second-best approach would be
to include the Hadoop version in the libhadoop name itself (so we'd
have libhadoop28.so for hadoop 2.8, and so forth.)  Anyway, I think we
can solve this particular issue without going down that rathole...

As I said, I believe that ship has long since sailed. Changes that have already been let in have I believe broken the backwards binary compatibility of the Java/JNI interface. Broken is broken, arguing that this proposal shouldn't be allowed in because it simply adds more brokenness to the existing brokenness is really missing the point. As far as I can tell, there already is no backwards compatibility.

--
Alan Burlison
--

Re: DomainSocket issues on Solaris

Posted by Alan Burlison <Al...@oracle.com>.

On 05/10/15 18:30, Colin P. McCabe wrote:

> 1. Don't get DomainSocket working on Solaris.  Rely on the legacy
> short-circuit read instead.  It has poorer security guarantees, but
> doesn't require domain sockets.  You can add a line of code to the
> failing junit tests to skip them on Solaris.

I really don't want to do that as it relegates Solaris to only ever 
being a second-class citizen.

> 2. Use a separate "timer wheel" thread which implements coarse-grained
> timeouts by calling shutdown() on domain sockets that have been active
> for too long.  This thread could be global (one per JVM).

 From what I can tell that won't stop all the test failures as they are 
written with the assumption that per-socket timeouts are available and 
that they time out exactly when expected.

> 3. Implement the poll/select loop you discussed earlier.  As Steve
> commented, it would be easier to do this by adding new functions,
> rather than by changing existing ones.  I don't think "ifdef skid
> marks" are necessary since poll and select are supported on Linux and
> so forth as well as Solaris.  You would just need some code in
> DomainSocket.java to select the appropriate implementation at runtime
> based on the OS.

I could switch the implementation over to use poll everywhere but I 
haven't done that - Linux still uses socket timeouts. The issue is that 
in order to make poll() work I need to maintain the read/write timeouts 
alongside the filehandle - I can't store the timeout 'inside' the 
filehandle using setsockopt(). That means that the filehandle and the 
timeouts have to be stored together somewhere. The logical place to put 
the timeouts is in the same DomainSocket instances that holds the 
filehandle. If the DomainSocket JNI methods were all instance methods 
then there wouldn't be a problem, but they aren't, they are static 
methods where the integer filehandle is passed in as a parameter. And it 
wouldn't work if I change the native method parameter lists to include 
the timeouts as they need to be read/write. The only non-vile way I can 
come up with of doing this is to convert the JNI methods from static 
into instance methods. Even if that's the only change I make and I still 
pass in the filehandle as a parameter, the signatures will have changed 
as the 2nd parameter would now be an object reference and not a class 
reference.

The other option is to effectively write a complete Solaris-only 
replacement for DomainSocket, whether switching between that and the 
current one is done at compile or run-time isn't really the point. 
There's a fairly even split between the Java & JNI components of 
DomainSocket, so whichever way it's done there will be significant 
duplication of the overall logic and most likely code duplication. That 
means that bug fixes in one place have to be exactly mirrored in 
another, and that's unlikely to be sustainable.

My goal has been to keep the current logic as unchanged as possible. My 
prototype does that by literally prefixing each libc socket operation 
with a poll() call to check the filehandle is ready. The rest of the 
logic in DomainSocket is completely unchanged. That means that the 
behaviour between Linux and Solaris should be as identical as is possible.

> Since you commented that Solaris is implementing timeout support in
> the future, approaches #1 or #2 could be placeholders until that's
> finished.

Unfortunately I can't predict when that might happen by, though. In my 
prototype it probes for working timeouts at configure time, so when they 
do become available they'll be used automatically.

> I agree that there is no formal libhadoop.so compatibility policy and
> that is frustrating.  This has been an issue for those who want to run
> jars compiled against multiple different versions of hadoop through
> the same YARN instance.  We've discussed it in the past, but never
> really come up with a great solution.  The best approach really would
> be to bundle libhadoop.so inside the hadoop jar files, so that it
> could be integral to the Hadoop version itself.  However, nobody has
> done the work to make that happen.  The second-best approach would be
> to include the Hadoop version in the libhadoop name itself (so we'd
> have libhadoop28.so for hadoop 2.8, and so forth.)  Anyway, I think we
> can solve this particular issue without going down that rathole...

As I said, I believe that ship has long since sailed. Changes that have 
already been let in have I believe broken the backwards binary 
compatibility of the Java/JNI interface. Broken is broken, arguing that 
this proposal shouldn't be allowed in because it simply adds more 
brokenness to the existing brokenness is really missing the point. As 
far as I can tell, there already is no backwards compatibility.

-- 
Alan Burlison
--

Re: DomainSocket issues on Solaris

Posted by "Colin P. McCabe" <cm...@apache.org>.

Hi Alan,

As Chris commented earlier, the main use of DomainSocket is to
transfer file descriptors from the DataNode to the DFSClient.  As you
know, this is something that can only be done through domain sockets,
not through inet sockets.  We do support passing data over domain
sockets, but in practice we rarely turn it on since we haven't seen a
performance advantage.

As I see it, you have a few different options here for getting this
working on Solaris.

1. Don't get DomainSocket working on Solaris.  Rely on the legacy
short-circuit read instead.  It has poorer security guarantees, but
doesn't require domain sockets.  You can add a line of code to the
failing junit tests to skip them on Solaris.

2. Use a separate "timer wheel" thread which implements coarse-grained
timeouts by calling shutdown() on domain sockets that have been active
for too long.  This thread could be global (one per JVM).

3. Implement the poll/select loop you discussed earlier.  As Steve
commented, it would be easier to do this by adding new functions,
rather than by changing existing ones.  I don't think "ifdef skid
marks" are necessary since poll and select are supported on Linux and
so forth as well as Solaris.  You would just need some code in
DomainSocket.java to select the appropriate implementation at runtime
based on the OS.

Since you commented that Solaris is implementing timeout support in
the future, approaches #1 or #2 could be placeholders until that's
finished.

I agree that there is no formal libhadoop.so compatibility policy and
that is frustrating.  This has been an issue for those who want to run
jars compiled against multiple different versions of hadoop through
the same YARN instance.  We've discussed it in the past, but never
really come up with a great solution.  The best approach really would
be to bundle libhadoop.so inside the hadoop jar files, so that it
could be integral to the Hadoop version itself.  However, nobody has
done the work to make that happen.  The second-best approach would be
to include the Hadoop version in the libhadoop name itself (so we'd
have libhadoop28.so for hadoop 2.8, and so forth.)  Anyway, I think we
can solve this particular issue without going down that rathole...

best,
Colin

On Mon, Oct 5, 2015 at 7:56 AM, Alan Burlison <Al...@oracle.com> wrote:
> On 05/10/2015 15:14, Steve Loughran wrote:
>
>> I don't think anyone would object for the changes, except for one big
>> caveat: a lot of us would like that binary file to be backwards
>> compatible; a Hadoop 2.6 JAR should be able to link to the 2.8+
>> libhadoop. So whatever gets changed, the old methods are still going
>> to hang around
>
>
> That's not achievable as the method signatures need to change. Even though
> they are private they need to change from static to normal methods and the
> signatures need to change as well, as I said.
>
> JNI code is intimately  intertwined with the Java code it runs with. Running
> mismatching Java & JNI versions is going to be a recipe for eventual
> disaster as the JVM explicitly does *not* do any error checking between Java
> and JNI. At some point some innocuous change will be made that will just
> cause undefined behaviour.
>
> I don't actually know how you'd get a JAR/JNI mismatch as they are built and
> packaged together, so I'm struggling to understand what the potential issue
> is here.
>
> In any case the constraint you are requesting would flat-out preclude this
> change, and would also mean that most of the other JNI changes that have
> been committed recently would have to be ripped out as well . In summary,
> the bridge is already burned.
>
> --
> Alan Burlison
> --

Re: DomainSocket issues on Solaris

Posted by Alan Burlison <Al...@oracle.com>.

On 05/10/2015 15:14, Steve Loughran wrote:

> I don't think anyone would object for the changes, except for one big
> caveat: a lot of us would like that binary file to be backwards
> compatible; a Hadoop 2.6 JAR should be able to link to the 2.8+
> libhadoop. So whatever gets changed, the old methods are still going
> to hang around

That's not achievable as the method signatures need to change. Even 
though they are private they need to change from static to normal 
methods and the signatures need to change as well, as I said.

JNI code is intimately  intertwined with the Java code it runs with. 
Running mismatching Java & JNI versions is going to be a recipe for 
eventual disaster as the JVM explicitly does *not* do any error checking 
between Java and JNI. At some point some innocuous change will be made 
that will just cause undefined behaviour.

I don't actually know how you'd get a JAR/JNI mismatch as they are built 
and packaged together, so I'm struggling to understand what the 
potential issue is here.

In any case the constraint you are requesting would flat-out preclude 
this change, and would also mean that most of the other JNI changes that 
have been committed recently would have to be ripped out as well . In 
summary, the bridge is already burned.

-- 
Alan Burlison
--

Re: DomainSocket issues on Solaris

Posted by Steve Loughran <st...@hortonworks.com>.

I don't think anyone would object for the changes, except for one big caveat: a lot of us would like that binary file to be backwards compatible; a Hadoop 2.6 JAR should be able to link to the 2.8+ libhadoop. So whatever gets changed, the old methods are still going to hang around

> On 2 Oct 2015, at 17:46, Alan Burlison <Al...@oracle.com> wrote:
> 
> On 30/09/2015 09:14, Alan Burlison wrote:
> 
>> The basic idea is to add two new fields to DomainSocket.c to hold the
>> read/write timeouts. On platforms that support SO_SNDTIMEO and
>> SO_RCVTIMEO these would be unused as setsockopt() would be used to set
>> the socket timeouts. On platforms such as Solaris the JNI code would use
>> the values to implement the timeouts appropriately.
> 
> Unfortunately it's not a simple as I'd hoped. For some reason I don't really understand, nearly all the JNI methods are declared as static and therefore don't get a "this" pointer and as a consequence all the class data members that are needed by the JNI code have to be passed in as parameters. That also means it's not possible to store the timeouts in the DomainSocket fields from within the JNI code. Most of the JNI methods should be instance methods rather than static ones, but making that change would require some significant surgery to DomainSocket.
> 
> -- 
> Alan Burlison
> --
>

Re: DomainSocket issues on Solaris

Posted by Alan Burlison <Al...@oracle.com>.

On 30/09/2015 09:14, Alan Burlison wrote:

> The basic idea is to add two new fields to DomainSocket.c to hold the
> read/write timeouts. On platforms that support SO_SNDTIMEO and
> SO_RCVTIMEO these would be unused as setsockopt() would be used to set
> the socket timeouts. On platforms such as Solaris the JNI code would use
> the values to implement the timeouts appropriately.

Unfortunately it's not a simple as I'd hoped. For some reason I don't 
really understand, nearly all the JNI methods are declared as static and 
therefore don't get a "this" pointer and as a consequence all the class 
data members that are needed by the JNI code have to be passed in as 
parameters. That also means it's not possible to store the timeouts in 
the DomainSocket fields from within the JNI code. Most of the JNI 
methods should be instance methods rather than static ones, but making 
that change would require some significant surgery to DomainSocket.

-- 
Alan Burlison
--

Re: DomainSocket issues on Solaris

Posted by Alan Burlison <Al...@oracle.com>.

On 30/09/2015 17:23, Chris Nauroth wrote:

> I think file descriptor sharing is a capability of Unix domain
> sockets only, and not INET sockets.

Yes, that's correct.

-- 
Alan Burlison
--

Re: DomainSocket issues on Solaris

Posted by Chris Nauroth <cn...@hortonworks.com>.

That's an interesting find, though I don't think we'd be able to swap in
INET sockets in this part of the code.  We use Unix domain sockets to
share an open file descriptor from the DataNode process to the HDFS client
process, and then the client reads directly from that open file
descriptor.  I think file descriptor sharing is a capability of Unix
domain sockets only, and not INET sockets.  As you said, I wouldn't expect
throughput on the Unix domain socket to be a bottleneck, because there is
very little data transferred.

--Chris Nauroth

On 9/30/15, 9:12 AM, "Alan Burlison" <Al...@oracle.com> wrote:

>On 30/09/2015 16:56, Chris Nauroth wrote:
>
>> Alan, I also meant to say that I didn't understand the comment about "in
>> production it seems that DomainSocket is less commonly used".  The
>>current
>> implementation of short-circuit read definitely utilizes DomainSocket,
>>and
>> it's very common to enable this in production clusters.  The
>>documentation
>> page you mentioned includes discussion of a legacy short-circuit read
>> implementation, which did not utilize UNIX domain sockets, but the
>>legacy
>> implementation is rarely used in practice now.
>
>Oh, OK - thanks for the clarification. I couldn't find much about
>DomainSocket other than the link I posted and that didn't make it sound
>like it was used all that much. I'll make sure the JIRA reflects what
>you said above.
>
>Interestingly, INET sockets are faster than UNIX sockets on Linux as
>well as on Solaris. There's not much in it, around 10% in both cases,
>and I suspect socket throughput isn't the rate-limiting step anyway.
>
>-- 
>Alan Burlison
>--
>

Re: DomainSocket issues on Solaris

Posted by Alan Burlison <Al...@oracle.com>.

On 30/09/2015 16:56, Chris Nauroth wrote:

> Alan, I also meant to say that I didn't understand the comment about "in
> production it seems that DomainSocket is less commonly used".  The current
> implementation of short-circuit read definitely utilizes DomainSocket, and
> it's very common to enable this in production clusters.  The documentation
> page you mentioned includes discussion of a legacy short-circuit read
> implementation, which did not utilize UNIX domain sockets, but the legacy
> implementation is rarely used in practice now.

Oh, OK - thanks for the clarification. I couldn't find much about 
DomainSocket other than the link I posted and that didn't make it sound 
like it was used all that much. I'll make sure the JIRA reflects what 
you said above.

Interestingly, INET sockets are faster than UNIX sockets on Linux as 
well as on Solaris. There's not much in it, around 10% in both cases, 
and I suspect socket throughput isn't the rate-limiting step anyway.

-- 
Alan Burlison
--

Re: DomainSocket issues on Solaris

Posted by Chris Nauroth <cn...@hortonworks.com>.

Alan, I also meant to say that I didn't understand the comment about "in
production it seems that DomainSocket is less commonly used".  The current
implementation of short-circuit read definitely utilizes DomainSocket, and
it's very common to enable this in production clusters.  The documentation
page you mentioned includes discussion of a legacy short-circuit read
implementation, which did not utilize UNIX domain sockets, but the legacy
implementation is rarely used in practice now.

--Chris Nauroth




On 9/30/15, 8:46 AM, "Chris Nauroth" <cn...@hortonworks.com> wrote:

>Hello Alan,
>
>I think this sounds like a reasonable approach.  I recommend that you file
>a JIRA with the proposal (copy-paste the content of your email into a
>comment) and then wait a few days before starting work in earnest to see
>if anyone else wants to discuss it first.  I also recommend notifying
>Colin Patrick McCabe on that JIRA.  It would be good to get a second
>opinion from him, since he is the original author of much of this code.
>
>--Chris Nauroth
>
>
>
>
>On 9/30/15, 1:14 AM, "Alan Burlison" <Al...@oracle.com> wrote:
>
>>Now that the Hadoop native code builds on Solaris I've been chipping
>>away at all the test failures. About 50% of the failures involve
>>DomainSocket, either directly or indirectly. That seems to be mainly
>>because the tests use DomainSocket to do single-node testing, whereas in
>>production it seems that DomainSocket is less commonly used
>>(https://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-hdfs/Sh
>>o
>>rtCircuitLocalReads.html).
>>
>>The particular problem on Solaris is that socket read/write timeouts
>>(the SO_SNDTIMEO and SO_RCVTIMEO socket options) are not supported for
>>UNIX domain (PF_UNIX) sockets. Those options are however supported for
>>PF_INET sockets. That's because the socket implementation on Solaris is
>>split roughly into two parts, for inet sockets and for STREAMS sockets,
>>and the STREAMS implementation lacks support for SO_SNDTIMEO and
>>SO_RCVTIMEO. As an aside, performance of sockets that use loopback or
>>the host's own IP is slightly better than that of UNIX domain sockets on
>>Solaris.
>>
>>I'm investigating getting timeouts supported for PF_UNIX sockets added
>>to Solaris, but in the meantime I'm also looking how this might be
>>worked around in Hadoop. One way would be to implement timeouts by
>>wrapping all the read/write/send/recv etc calls in DomainSocket.c with
>>either poll() or select().
>>
>>The basic idea is to add two new fields to DomainSocket.c to hold the
>>read/write timeouts. On platforms that support SO_SNDTIMEO and
>>SO_RCVTIMEO these would be unused as setsockopt() would be used to set
>>the socket timeouts. On platforms such as Solaris the JNI code would use
>>the values to implement the timeouts appropriately.
>>
>>To prevent the code in DomainSocket.c becoming a #ifdef hairball, the
>>current socket IO function calls such as accept(), send(), read() etc
>>would be replaced with a macros such as HD_ACCEPT. On platforms that
>>provide timeouts these would just expand to the normal socket functions,
>>on platforms that don't support timeouts it would expand to wrappers
>>that implements timeouts for them.
>>
>>The only caveats are that all code that does anything to a PF_UNIX
>>socket would *always* have to do so via DomainSocket. As far as I can
>>tell that's not an issue, but it would have to be borne in mind if any
>>changes were made in this area.
>>
>>Before I set about doing this, does the approach seem reasonable?
>>
>>Thanks,
>>
>>-- 
>>Alan Burlison
>>--
>>
>

Re: DomainSocket issues on Solaris

Posted by Chris Nauroth <cn...@hortonworks.com>.

Hello Alan,

I think this sounds like a reasonable approach.  I recommend that you file
a JIRA with the proposal (copy-paste the content of your email into a
comment) and then wait a few days before starting work in earnest to see
if anyone else wants to discuss it first.  I also recommend notifying
Colin Patrick McCabe on that JIRA.  It would be good to get a second
opinion from him, since he is the original author of much of this code.

--Chris Nauroth




On 9/30/15, 1:14 AM, "Alan Burlison" <Al...@oracle.com> wrote:

>Now that the Hadoop native code builds on Solaris I've been chipping
>away at all the test failures. About 50% of the failures involve
>DomainSocket, either directly or indirectly. That seems to be mainly
>because the tests use DomainSocket to do single-node testing, whereas in
>production it seems that DomainSocket is less commonly used
>(https://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-hdfs/Sho
>rtCircuitLocalReads.html).
>
>The particular problem on Solaris is that socket read/write timeouts
>(the SO_SNDTIMEO and SO_RCVTIMEO socket options) are not supported for
>UNIX domain (PF_UNIX) sockets. Those options are however supported for
>PF_INET sockets. That's because the socket implementation on Solaris is
>split roughly into two parts, for inet sockets and for STREAMS sockets,
>and the STREAMS implementation lacks support for SO_SNDTIMEO and
>SO_RCVTIMEO. As an aside, performance of sockets that use loopback or
>the host's own IP is slightly better than that of UNIX domain sockets on
>Solaris.
>
>I'm investigating getting timeouts supported for PF_UNIX sockets added
>to Solaris, but in the meantime I'm also looking how this might be
>worked around in Hadoop. One way would be to implement timeouts by
>wrapping all the read/write/send/recv etc calls in DomainSocket.c with
>either poll() or select().
>
>The basic idea is to add two new fields to DomainSocket.c to hold the
>read/write timeouts. On platforms that support SO_SNDTIMEO and
>SO_RCVTIMEO these would be unused as setsockopt() would be used to set
>the socket timeouts. On platforms such as Solaris the JNI code would use
>the values to implement the timeouts appropriately.
>
>To prevent the code in DomainSocket.c becoming a #ifdef hairball, the
>current socket IO function calls such as accept(), send(), read() etc
>would be replaced with a macros such as HD_ACCEPT. On platforms that
>provide timeouts these would just expand to the normal socket functions,
>on platforms that don't support timeouts it would expand to wrappers
>that implements timeouts for them.
>
>The only caveats are that all code that does anything to a PF_UNIX
>socket would *always* have to do so via DomainSocket. As far as I can
>tell that's not an issue, but it would have to be borne in mind if any
>changes were made in this area.
>
>Before I set about doing this, does the approach seem reasonable?
>
>Thanks,
>
>-- 
>Alan Burlison
>--
>