You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by Alan Burlison <Al...@oracle.com> on 2015/09/30 10:14:46 UTC
DomainSocket issues on Solaris
Now that the Hadoop native code builds on Solaris I've been chipping
away at all the test failures. About 50% of the failures involve
DomainSocket, either directly or indirectly. That seems to be mainly
because the tests use DomainSocket to do single-node testing, whereas in
production it seems that DomainSocket is less commonly used
(https://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-hdfs/ShortCircuitLocalReads.html).
The particular problem on Solaris is that socket read/write timeouts
(the SO_SNDTIMEO and SO_RCVTIMEO socket options) are not supported for
UNIX domain (PF_UNIX) sockets. Those options are however supported for
PF_INET sockets. That's because the socket implementation on Solaris is
split roughly into two parts, for inet sockets and for STREAMS sockets,
and the STREAMS implementation lacks support for SO_SNDTIMEO and
SO_RCVTIMEO. As an aside, performance of sockets that use loopback or
the host's own IP is slightly better than that of UNIX domain sockets on
Solaris.
I'm investigating getting timeouts supported for PF_UNIX sockets added
to Solaris, but in the meantime I'm also looking how this might be
worked around in Hadoop. One way would be to implement timeouts by
wrapping all the read/write/send/recv etc calls in DomainSocket.c with
either poll() or select().
The basic idea is to add two new fields to DomainSocket.c to hold the
read/write timeouts. On platforms that support SO_SNDTIMEO and
SO_RCVTIMEO these would be unused as setsockopt() would be used to set
the socket timeouts. On platforms such as Solaris the JNI code would use
the values to implement the timeouts appropriately.
To prevent the code in DomainSocket.c becoming a #ifdef hairball, the
current socket IO function calls such as accept(), send(), read() etc
would be replaced with a macros such as HD_ACCEPT. On platforms that
provide timeouts these would just expand to the normal socket functions,
on platforms that don't support timeouts it would expand to wrappers
that implements timeouts for them.
The only caveats are that all code that does anything to a PF_UNIX
socket would *always* have to do so via DomainSocket. As far as I can
tell that's not an issue, but it would have to be borne in mind if any
changes were made in this area.
Before I set about doing this, does the approach seem reasonable?
Thanks,
--
Alan Burlison
--
Re: DomainSocket issues on Solaris
Posted by Chris Nauroth <cn...@hortonworks.com>.
Alan, thank you for picking up HADOOP-11127. I think it has needed a
strong use case to kick it back into action, and maybe Solaris support is
that use case. I'll join the discussion on the JIRA.
--Chris Nauroth
On 10/8/15, 9:40 AM, "Alan Burlison" <Al...@oracle.com> wrote:
>On 07/10/2015 22:05, Alan Burlison wrote:
>
>> I'll draft up a proposal and attach it to HADOOP-11127.
>
>Attached to HADOOP-11127 as proposal.txt
>
>--
>Alan Burlison
>--
>
Re: DomainSocket issues on Solaris
Posted by Alan Burlison <Al...@oracle.com>.
On 07/10/2015 22:05, Alan Burlison wrote:
> I'll draft up a proposal and attach it to HADOOP-11127.
Attached to HADOOP-11127 as proposal.txt
--
Alan Burlison
--
Re: DomainSocket issues on Solaris
Posted by Alan Burlison <Al...@oracle.com>.
On 07/10/15 18:53, Colin P. McCabe wrote:
> I think you could come up with a select/poll solution while using the
> old function signatures. A 4-byte int is more than enough information
> to pass in, given that you can use it as an index into a table in the
> C code.
I have thought about that but a simple table would not work very well.
It would have to be potentially quite large and would be sparsely
populated. It would really have to be some sort of map and would most
likely have to be implemented in C. However it is done it becomes a
Solaris-only maintenance burden. Yes it's possible, but it seemed
distinctly undesirable.
> There are also a lot of other solution to this problem, like
> I pointed out earlier. For example, you dismissed the timer wheel
> suggestion because of a detail of a unit test, but we could easily
> change the test.
Unfortunately there are somewhere around 100 test failures that I think
are related to the socket timeout issue, which is why I focussed on it.
> Anyway, changing the function signatures in the way you described is
> certainly reasonable and I wouldn't object to it. It is probably the
> most natural solution.
That's the conclusion I came to, but I fully understand there has to be
a solution for the Java/JNI versioning issue as well.
>> Does that sound acceptable? If so I can draft up a proposal for native
>> library version and platform naming, library search locations etc.
>
> Yes, I think it would be good to make some progress on HADOOP-11127.
> We have been putting off the issue for too long.
Even if I put together a solution for DomainSocket that doesn't need
changes to the JNI interface I'm almost certain that subsequent work
will hit the same issue. I'd rather spend the time up front and come up
with a once-and-for-all solution, I think overall that will work out to
be less effort and certainly less risky.
I'll draft up a proposal and attach it to HADOOP-11127.
Thanks,
--
Alan Burlison
--
Re: DomainSocket issues on Solaris
Posted by "Colin P. McCabe" <cm...@apache.org>.
On Wed, Oct 7, 2015 at 9:35 AM, Alan Burlison <Al...@oracle.com> wrote:
> On 06/10/2015 10:52, Steve Loughran wrote:
>
>> HADOOP-11127, "Improve versioning and compatibility support in native
>> library for downstream hadoop-common users." says "we need to do
>> better here", which is probably some way of packaging native libs.
>
>
> From that JIRA:
>
>> Colin Patrick McCabe added a comment - 18/Apr/15 00:48
>>
>> I was thinking we:
>> 1. Add the Hadoop release version to libhadoop.so. It's very, very
>> simple and solves a lot of problems here.
>> 2. Remove libhadoop.so and libhdfs.so from the release tarball, since
>> they are CPU and OS-specific and the tarballs are not
>> 3. Schedule some follow-on work to include the native libraries
>> inside jars, as Chris suggested. This will take longer but ultimately
>> be the best solution.
>
>
> And:
>
>> I just spotted one: HADOOP-10027. A field was removed from the Java
>> layer, which still could get referenced by an older version of the native
>> layer. A backwards-compatible version of that patch would preserve the
>> old fields in the Java layer.
>
>
> I've been thinking about this and I really don't think the strategy of
> trying to shim old methods and fields back in to Hadoop is the correct one.
> The current Java-JNI interactions have been developed in an ad-hoc manner
> with no formal API definition and are explicitly Not-An-Interface and as a
> result no consideration has been given to cross-version stability. A
> compatibility shim approach is neither sustainable nor maintainable even on
> a single platform, and will severely compromise efforts to get Hadoop native
> components working on other platforms.
I agree.
>
> The approach suggested in HADOOP-11127 seems a much better way forward, in
> particular #2 (versioned libhadoop). As pointed out in the JIRA, #1 (freeze
> libahdoop forever) is an obvious non-starter, and #3 (distribute libahadoop
> inside the JAR) is also a non-starter as it will not work cross-platform.
>
> I'm happy to work on HADOOP-10027 and make that a prerequisite for fixing
> the Solaris DomainSocket issues discussed in this thread. I believe it's not
> practical to provide a fix for DomainSocket on Solaris with a 'No JNI
> signature changes' restriction.
I think you could come up with a select/poll solution while using the
old function signatures. A 4-byte int is more than enough information
to pass in, given that you can use it as an index into a table in the
C code. There are also a lot of other solution to this problem, like
I pointed out earlier. For example, you dismissed the timer wheel
suggestion because of a detail of a unit test, but we could easily
change the test.
Anyway, changing the function signatures in the way you described is
certainly reasonable and I wouldn't object to it. It is probably the
most natural solution.
>
> Does that sound acceptable? If so I can draft up a proposal for native
> library version and platform naming, library search locations etc.
Yes, I think it would be good to make some progress on HADOOP-11127.
We have been putting off the issue for too long.
best,
Colin
>
>
> Thanks,
>
> --
> Alan Burlison
> --
Re: DomainSocket issues on Solaris
Posted by Alan Burlison <Al...@oracle.com>.
On 06/10/2015 10:52, Steve Loughran wrote:
> HADOOP-11127, "Improve versioning and compatibility support in native
> library for downstream hadoop-common users." says "we need to do
> better here", which is probably some way of packaging native libs.
From that JIRA:
> Colin Patrick McCabe added a comment - 18/Apr/15 00:48
>
> I was thinking we:
> 1. Add the Hadoop release version to libhadoop.so. It's very, very
> simple and solves a lot of problems here.
> 2. Remove libhadoop.so and libhdfs.so from the release tarball, since
> they are CPU and OS-specific and the tarballs are not
> 3. Schedule some follow-on work to include the native libraries
> inside jars, as Chris suggested. This will take longer but ultimately
> be the best solution.
And:
> I just spotted one: HADOOP-10027. A field was removed from the Java
> layer, which still could get referenced by an older version of the native
> layer. A backwards-compatible version of that patch would preserve the
> old fields in the Java layer.
I've been thinking about this and I really don't think the strategy of
trying to shim old methods and fields back in to Hadoop is the correct
one. The current Java-JNI interactions have been developed in an ad-hoc
manner with no formal API definition and are explicitly Not-An-Interface
and as a result no consideration has been given to cross-version
stability. A compatibility shim approach is neither sustainable nor
maintainable even on a single platform, and will severely compromise
efforts to get Hadoop native components working on other platforms.
The approach suggested in HADOOP-11127 seems a much better way forward,
in particular #2 (versioned libhadoop). As pointed out in the JIRA, #1
(freeze libahdoop forever) is an obvious non-starter, and #3 (distribute
libahadoop inside the JAR) is also a non-starter as it will not work
cross-platform.
I'm happy to work on HADOOP-10027 and make that a prerequisite for
fixing the Solaris DomainSocket issues discussed in this thread. I
believe it's not practical to provide a fix for DomainSocket on Solaris
with a 'No JNI signature changes' restriction.
Does that sound acceptable? If so I can draft up a proposal for native
library version and platform naming, library search locations etc.
Thanks,
--
Alan Burlison
--
Re: DomainSocket issues on Solaris
Posted by Alan Burlison <Al...@oracle.com>.
On 06/10/2015 17:03, Chris Nauroth wrote:
> Alan, would you please list the specific patches/JIRA issues that broke
> compatibility? I have not been reviewing the native code lately, so it
> would help me catch up quickly if you already know which specific patches
> have introduced problems. If those patches currently reside only on trunk
> and branch-2, then they have not yet shipped in an Apache release. We'd
> still have an opportunity to fix them and avoid "dropping the match"
> before shipping 2.8.0.
https://issues.apache.org/jira/browse/HADOOP-11985 was the one I was
thinking about as it changed fields from final to static. I haven't
figured out what impact that has on the classes & shared object. Plus
https://issues.apache.org/jira/browse/HADOOP-12184 which removed some
fields.
--
Alan Burlison
--
Re: DomainSocket issues on Solaris
Posted by Steve Loughran <st...@hortonworks.com>.
>>
>>
>> On 10/6/15, 8:25 AM, "Alan Burlison" <Al...@oracle.com> wrote:
>>>
>>>>> In any case the constraint you are requesting would flat-out
>>>>> preclude this change, and would also mean that most of the other
>>>>> JNI changes that have been committed recently would have to be
>>>>> ripped out as well . In summary, the bridge is already burned.
>>>>
>>>> We've covered the bridge in petrol but not quite dropped a match on
>>>> it.
>>>
>>> No, I'm reasonable certain you've already dropped the match, and if you
>>> haven't its just good fortune.
>>>
>>> --
>>> Alan Burlison
>>> --
>>>
>>
>>
>
Ok, we just hadn't noticed the bridge was on fire...
Re: DomainSocket issues on Solaris
Posted by Chris Nauroth <cn...@hortonworks.com>.
I just spotted one: HADOOP-10027. A field was removed from the Java
layer, which still could get referenced by an older version of the native
layer. A backwards-compatible version of that patch would preserve the
old fields in the Java layer.
Full disclosure: I was the one who committed that patch, so this was a
miss by me during the code review.
--Chris Nauroth
On 10/6/15, 9:03 AM, "Chris Nauroth" <cn...@hortonworks.com> wrote:
>Alan, would you please list the specific patches/JIRA issues that broke
>compatibility? I have not been reviewing the native code lately, so it
>would help me catch up quickly if you already know which specific patches
>have introduced problems. If those patches currently reside only on trunk
>and branch-2, then they have not yet shipped in an Apache release. We'd
>still have an opportunity to fix them and avoid "dropping the match"
>before shipping 2.8.0.
>
>Yes, we are aware that binary compatibility goes beyond the function
>signatures and into data layout and semantics.
>
>--Chris Nauroth
>
>
>
>
>On 10/6/15, 8:25 AM, "Alan Burlison" <Al...@oracle.com> wrote:
>
>>On 06/10/2015 10:52, Steve Loughran wrote:
>>
>>>> That's not achievable as the method signatures need to change. Even
>>>> though they are private they need to change from static to normal
>>>> methods and the signatures need to change as well, as I said.
>>>
>>> We've done it before, simply by retaining the older method entry
>>> points. Moving from static to instance-specific is a bigger change.
>>> If the old entry points are there and retained, even if all uses have
>>> been ripped out of the hadoop code, then the new methods will get
>>> used. It's just that old stuff will still link.
>>
>>As I explained in my last email, converting the old static JNI functions
>>to be wrappers around new instance JNI functions requires a jobject
>>reference to be passed into the new function that the old one wraps
>>around. The static methods can't magic one up. An instance pointer *is*
>>available, the current code flow is Java object method -> static JNI
>>function so if we could change the JNI from static->instance then we'd
>>have what we needed. But if you are considering the JNI layer to be a
>>public interface (which I think is a big mistake, no matter how
>>convenient it might be), then you are simply screwed, both here and in
>>other places. As I've said, I have a suspicion that changes we've
>>already made have broken that compatibility anyway.
>>
>>>> JNI code is intimately intertwined with the Java code it runs
>>>> with. Running mismatching Java & JNI versions is going to be a
>>>> recipe for eventual disaster as the JVM explicitly does *not* do
>>>> any error checking between Java and JNI.
>>>
>>> You mean jni code built for java7 isn't guaranteed to work on Java 8?
>>> If so, that's not something we knew of ‹and something to worry
>>> about.
>>
>>Actually I think that particular scenario is going to be OK. I wasn't
>>clear - sorry - what I was musing about was the fact that the Hadoop JNI
>>IO code delves into the innards of the platform Java classes and pulls
>>out bits of private data. That's explicitly not-an-interface and could
>>break at any time, although the likelihood may be low the JVM developers
>>could change it and you'd just be SOL. The same goes for all the other
>>private Java interfaces that Hadoop consumes - all the ones you get
>>warnings about when you build it. For example there are already plans to
>>make significant changes to sun.misc.unsafe for example. That will
>>affect Hadoop.
>>
>>>> At some point some innocuous change will be made that will just
>>>> cause undefined behaviour.
>>>>
>>>> I don't actually know how you'd get a JAR/JNI mismatch as they are
>>>> built and packaged together, so I'm struggling to understand what
>>>> the potential issue is here.
>>>
>>> it arises whenever you try to deploy to YARN any application
>>> containing directly or indirectly (e.g. inside the spark-assembly
>>> JAR) the Hadoop java classes of a previous Java version. libhadoop is
>>> on the PATH of the far end, your app uploads their hadoop JARs, and
>>> the moment something tries to use the JNI-backed method you get to
>>> see a stack trace.
>>>
>>> https://issues.apache.org/jira/browse/HADOOP-11064
>>>
>>> if you look at the patch there, that's the kind of thing I'd like to
>>> see to address your solaris issues.
>>
>>Hmm, yes. That's appears to be a short-term hack-around to keep things
>>running, not a fix. At very best, it's extremely fragile.
>>
>> From the bug:
>>
>>"We don't have any way of enforcing C API stability. Jenkins doesn't
>>check for it, most Java programmers don't know how to achieve it."
>>
>>In which case I think reading this will be helpful:
>>http://docs.oracle.com/cd/E19253-01/817-1984/chapter5-84101/index.html
>>
>>The assumption seems to be that as long as libhadoop.so keeps the same
>>list of functions with the same arguments then it will be
>>backwards-compatible. Unfortunately that's just flat out wrong. Binary
>>compatibility requires more than that. It also requires that there are
>>no changes to any data structures, and that the semantics of all the
>>functions remain completely unchanged. I'd put money on that not being
>>the case already. The errors you saw HADOOP-11064 are the easy ones
>>because you got a run-time linker error. The others will cause
>>mysterious behaviour, memory corruption and general WTFness.
>>
>>>> In any case the constraint you are requesting would flat-out
>>>> preclude this change, and would also mean that most of the other
>>>> JNI changes that have been committed recently would have to be
>>>> ripped out as well . In summary, the bridge is already burned.
>>>
>>> We've covered the bridge in petrol but not quite dropped a match on
>>> it.
>>
>>No, I'm reasonable certain you've already dropped the match, and if you
>>haven't its just good fortune.
>>
>>--
>>Alan Burlison
>>--
>>
>
>
Re: DomainSocket issues on Solaris
Posted by Chris Nauroth <cn...@hortonworks.com>.
Alan, would you please list the specific patches/JIRA issues that broke
compatibility? I have not been reviewing the native code lately, so it
would help me catch up quickly if you already know which specific patches
have introduced problems. If those patches currently reside only on trunk
and branch-2, then they have not yet shipped in an Apache release. We'd
still have an opportunity to fix them and avoid "dropping the match"
before shipping 2.8.0.
Yes, we are aware that binary compatibility goes beyond the function
signatures and into data layout and semantics.
--Chris Nauroth
On 10/6/15, 8:25 AM, "Alan Burlison" <Al...@oracle.com> wrote:
>On 06/10/2015 10:52, Steve Loughran wrote:
>
>>> That's not achievable as the method signatures need to change. Even
>>> though they are private they need to change from static to normal
>>> methods and the signatures need to change as well, as I said.
>>
>> We've done it before, simply by retaining the older method entry
>> points. Moving from static to instance-specific is a bigger change.
>> If the old entry points are there and retained, even if all uses have
>> been ripped out of the hadoop code, then the new methods will get
>> used. It's just that old stuff will still link.
>
>As I explained in my last email, converting the old static JNI functions
>to be wrappers around new instance JNI functions requires a jobject
>reference to be passed into the new function that the old one wraps
>around. The static methods can't magic one up. An instance pointer *is*
>available, the current code flow is Java object method -> static JNI
>function so if we could change the JNI from static->instance then we'd
>have what we needed. But if you are considering the JNI layer to be a
>public interface (which I think is a big mistake, no matter how
>convenient it might be), then you are simply screwed, both here and in
>other places. As I've said, I have a suspicion that changes we've
>already made have broken that compatibility anyway.
>
>>> JNI code is intimately intertwined with the Java code it runs
>>> with. Running mismatching Java & JNI versions is going to be a
>>> recipe for eventual disaster as the JVM explicitly does *not* do
>>> any error checking between Java and JNI.
>>
>> You mean jni code built for java7 isn't guaranteed to work on Java 8?
>> If so, that's not something we knew of ‹and something to worry
>> about.
>
>Actually I think that particular scenario is going to be OK. I wasn't
>clear - sorry - what I was musing about was the fact that the Hadoop JNI
>IO code delves into the innards of the platform Java classes and pulls
>out bits of private data. That's explicitly not-an-interface and could
>break at any time, although the likelihood may be low the JVM developers
>could change it and you'd just be SOL. The same goes for all the other
>private Java interfaces that Hadoop consumes - all the ones you get
>warnings about when you build it. For example there are already plans to
>make significant changes to sun.misc.unsafe for example. That will
>affect Hadoop.
>
>>> At some point some innocuous change will be made that will just
>>> cause undefined behaviour.
>>>
>>> I don't actually know how you'd get a JAR/JNI mismatch as they are
>>> built and packaged together, so I'm struggling to understand what
>>> the potential issue is here.
>>
>> it arises whenever you try to deploy to YARN any application
>> containing directly or indirectly (e.g. inside the spark-assembly
>> JAR) the Hadoop java classes of a previous Java version. libhadoop is
>> on the PATH of the far end, your app uploads their hadoop JARs, and
>> the moment something tries to use the JNI-backed method you get to
>> see a stack trace.
>>
>> https://issues.apache.org/jira/browse/HADOOP-11064
>>
>> if you look at the patch there, that's the kind of thing I'd like to
>> see to address your solaris issues.
>
>Hmm, yes. That's appears to be a short-term hack-around to keep things
>running, not a fix. At very best, it's extremely fragile.
>
> From the bug:
>
>"We don't have any way of enforcing C API stability. Jenkins doesn't
>check for it, most Java programmers don't know how to achieve it."
>
>In which case I think reading this will be helpful:
>http://docs.oracle.com/cd/E19253-01/817-1984/chapter5-84101/index.html
>
>The assumption seems to be that as long as libhadoop.so keeps the same
>list of functions with the same arguments then it will be
>backwards-compatible. Unfortunately that's just flat out wrong. Binary
>compatibility requires more than that. It also requires that there are
>no changes to any data structures, and that the semantics of all the
>functions remain completely unchanged. I'd put money on that not being
>the case already. The errors you saw HADOOP-11064 are the easy ones
>because you got a run-time linker error. The others will cause
>mysterious behaviour, memory corruption and general WTFness.
>
>>> In any case the constraint you are requesting would flat-out
>>> preclude this change, and would also mean that most of the other
>>> JNI changes that have been committed recently would have to be
>>> ripped out as well . In summary, the bridge is already burned.
>>
>> We've covered the bridge in petrol but not quite dropped a match on
>> it.
>
>No, I'm reasonable certain you've already dropped the match, and if you
>haven't its just good fortune.
>
>--
>Alan Burlison
>--
>
Re: DomainSocket issues on Solaris
Posted by Alan Burlison <Al...@oracle.com>.
On 06/10/2015 10:52, Steve Loughran wrote:
>> That's not achievable as the method signatures need to change. Even
>> though they are private they need to change from static to normal
>> methods and the signatures need to change as well, as I said.
>
> We've done it before, simply by retaining the older method entry
> points. Moving from static to instance-specific is a bigger change.
> If the old entry points are there and retained, even if all uses have
> been ripped out of the hadoop code, then the new methods will get
> used. It's just that old stuff will still link.
As I explained in my last email, converting the old static JNI functions
to be wrappers around new instance JNI functions requires a jobject
reference to be passed into the new function that the old one wraps
around. The static methods can't magic one up. An instance pointer *is*
available, the current code flow is Java object method -> static JNI
function so if we could change the JNI from static->instance then we'd
have what we needed. But if you are considering the JNI layer to be a
public interface (which I think is a big mistake, no matter how
convenient it might be), then you are simply screwed, both here and in
other places. As I've said, I have a suspicion that changes we've
already made have broken that compatibility anyway.
>> JNI code is intimately intertwined with the Java code it runs
>> with. Running mismatching Java & JNI versions is going to be a
>> recipe for eventual disaster as the JVM explicitly does *not* do
>> any error checking between Java and JNI.
>
> You mean jni code built for java7 isn't guaranteed to work on Java 8?
> If so, that's not something we knew of —and something to worry
> about.
Actually I think that particular scenario is going to be OK. I wasn't
clear - sorry - what I was musing about was the fact that the Hadoop JNI
IO code delves into the innards of the platform Java classes and pulls
out bits of private data. That's explicitly not-an-interface and could
break at any time, although the likelihood may be low the JVM developers
could change it and you'd just be SOL. The same goes for all the other
private Java interfaces that Hadoop consumes - all the ones you get
warnings about when you build it. For example there are already plans to
make significant changes to sun.misc.unsafe for example. That will
affect Hadoop.
>> At some point some innocuous change will be made that will just
>> cause undefined behaviour.
>>
>> I don't actually know how you'd get a JAR/JNI mismatch as they are
>> built and packaged together, so I'm struggling to understand what
>> the potential issue is here.
>
> it arises whenever you try to deploy to YARN any application
> containing directly or indirectly (e.g. inside the spark-assembly
> JAR) the Hadoop java classes of a previous Java version. libhadoop is
> on the PATH of the far end, your app uploads their hadoop JARs, and
> the moment something tries to use the JNI-backed method you get to
> see a stack trace.
>
> https://issues.apache.org/jira/browse/HADOOP-11064
>
> if you look at the patch there, that's the kind of thing I'd like to
> see to address your solaris issues.
Hmm, yes. That's appears to be a short-term hack-around to keep things
running, not a fix. At very best, it's extremely fragile.
From the bug:
"We don't have any way of enforcing C API stability. Jenkins doesn't
check for it, most Java programmers don't know how to achieve it."
In which case I think reading this will be helpful:
http://docs.oracle.com/cd/E19253-01/817-1984/chapter5-84101/index.html
The assumption seems to be that as long as libhadoop.so keeps the same
list of functions with the same arguments then it will be
backwards-compatible. Unfortunately that's just flat out wrong. Binary
compatibility requires more than that. It also requires that there are
no changes to any data structures, and that the semantics of all the
functions remain completely unchanged. I'd put money on that not being
the case already. The errors you saw HADOOP-11064 are the easy ones
because you got a run-time linker error. The others will cause
mysterious behaviour, memory corruption and general WTFness.
>> In any case the constraint you are requesting would flat-out
>> preclude this change, and would also mean that most of the other
>> JNI changes that have been committed recently would have to be
>> ripped out as well . In summary, the bridge is already burned.
>
> We've covered the bridge in petrol but not quite dropped a match on
> it.
No, I'm reasonable certain you've already dropped the match, and if you
haven't its just good fortune.
--
Alan Burlison
--
Re: DomainSocket issues on Solaris
Posted by Steve Loughran <st...@hortonworks.com>.
> On 5 Oct 2015, at 15:56, Alan Burlison <Al...@oracle.com> wrote:
>
> On 05/10/2015 15:14, Steve Loughran wrote:
>
>> I don't think anyone would object for the changes, except for one big
>> caveat: a lot of us would like that binary file to be backwards
>> compatible; a Hadoop 2.6 JAR should be able to link to the 2.8+
>> libhadoop. So whatever gets changed, the old methods are still going
>> to hang around
>
> That's not achievable as the method signatures need to change. Even though they are private they need to change from static to normal methods and the signatures need to change as well, as I said.
We've done it before, simply by retaining the older method entry points. Moving from static to instance-specific is a bigger change. If the old entry points are there and retained, even if all uses have been ripped out of the hadoop code, then the new methods will get used. It's just that old stuff will still link.
>
> JNI code is intimately intertwined with the Java code it runs with. Running mismatching Java & JNI versions is going to be a recipe for eventual disaster as the JVM explicitly does *not* do any error checking between Java and JNI.
You mean jni code built for java7 isn't guaranteed to work on Java 8? If so, that's not something we knew of —and something to worry about.
> At some point some innocuous change will be made that will just cause undefined behaviour.
>
> I don't actually know how you'd get a JAR/JNI mismatch as they are built and packaged together, so I'm struggling to understand what the potential issue is here.
it arises whenever you try to deploy to YARN any application containing directly or indirectly (e.g. inside the spark-assembly JAR) the Hadoop java classes of a previous Java version. libhadoop is on the PATH of the far end, your app uploads their hadoop JARs, and the moment something tries to use the JNI-backed method you get to see a stack trace.
https://issues.apache.org/jira/browse/HADOOP-11064
if you look at the patch there, that's the kind of thing I'd like to see to address your solaris issues.
>
> In any case the constraint you are requesting would flat-out preclude this change, and would also mean that most of the other JNI changes that have been committed recently would have to be ripped out as well . In summary, the bridge is already burned.
>
We've covered the bridge in petrol but not quite dropped a match on it.
HADOOP-11127, "Improve versioning and compatibility support in native library for downstream hadoop-common users." says "we need to do better here", which is probably some way of packaging native libs.
Now, if you look at our compatibility statement, we don't say anything about native binary linking:
http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/Compatibility.html
We have managed to avoid addressing this issue to date: the HADOOP-11064 problem was caught before 2.6 shipped, and the patch put in without setting an immutable guarantee of compatibility going forward. We just don't want to light that bridge when a lot of users are on the other side of it.
-Steve
Re: DomainSocket issues on Solaris
Posted by Alan Burlison <Al...@oracle.com>.
On 06/10/2015 11:01, Steve Loughran wrote:
>> I really don't want to do that as it relegates Solaris to only ever
>> being a second-class citizen.
>
> I know that Solaris matters to you 100%, and we've tried to be as
> supportive as we can, even though it's not viewed as important to
> anyone else. We don't want to make it 2nd class, just want to get it
> to be 1st class in a way which doesn't create lots of compatibility
> problems.
Yes you have been supportive, I recognise that and I'm grateful for it
:-) Although I'm the main Solaris person that's visible, I'm not the
only one who is interested. And I fully get the backwards compatibility
thing, it's one of the main features of Solaris. However keeping
backwards binary compatibility is something you really have to decide up
front and design for, it's very difficult to add it as a constraint
after the fact, as this scenario illustrates. And without internal or
external library versioning support, its even harder still.
> Is the per-socket timeout assumption used anywhere outside the
> tests?
I've no real idea yet as I haven't yet got to the point where I have a
'Full Fat JNI' version of Hadoop on Solaris, I do know that around 50%
of the ~200 test failures I'm seeing are most likely related to timeout
handling, which is why I'm concentrating on it.
> so we move from
>
> function(fileHandle)
>
> to function(Object), where object->fileHandle and object->timeout are both there?
To be precise, the signature change I have at the moment is (for example)
JNIEXPORT jint JNICALL
Java_org_apache_hadoop_net_unix_DomainSocket_accept0(
JNIEnv *env, jclass clazz, jint fd)
to
JNIEXPORT jint JNICALL
Java_org_apache_hadoop_net_unix_DomainSocket_accept0(
JNIEnv *env, jobject obj)
filehandle, readTimeout and writeTimeout are then accessed as members of
the jobject.
> what about
>
> function(fileHandle, timeout)
>
> where we retain
>
> function(fileHandle) { return function(fileHandle, defaultTimeout)}?
>
> And then never invoke it in our existing code, which now calls the new operation?
> or if there's a call
>
> setTimeout(fileHandle, timeout)
>
> which for linux sets the socket timeout —and in solaris updates some
> map handle->timeout used in the select() call.
Yes, I'd thought of that. The problem is the 'some map' bit. Maintaining
that map would be clunky - file descriptor IDs are not going to be
sequential and are reused so we'd have to store them in some sort of
shadow data structure and track each and every close, and that's fiddly.
And the 'default timeout' option is I believe a non-starter, the default
timeout is 2 minutes and many of the tests set it to a much shorter
interval and expect it to time out at the specified time.
The problem is that if we store the timeout along the filehandle then we
need access to an object pointer to retrieve it during the socket call.
As the existing functions are static ones an object pointer isn't available.
I've looked long and hard at this, I have not come up with a mechanism
that is both backwards binary compatible and not totally vile.
>> The other option is to effectively write a complete Solaris-only
>> replacement for DomainSocket, whether switching between that and the
>> current one is done at compile or run-time isn't really the point.
>> There's a fairly even split between the Java & JNI components of
>> DomainSocket, so whichever way it's done there will be significant
>> duplication of the overall logic and most likely code duplication.
>> That means that bug fixes in one place have to be exactly mirrored in
>> another, and that's unlikely to be sustainable.
>
> It's not going to be maintained, or more precisely: it'll be broken
> on a regular basis and you are the one left to handle it.
Exactly, which is why it is a non-starter. Whatever I do to fix this
needs to be as minimal as possible and needs to disappear on platforms
which don't need it.
>> Unfortunately I can't predict when that might happen by, though. In
>> my prototype it probes for working timeouts at configure time, so
>> when they do become available they'll be used automatically.
>
> I agree that there is no formal libhadoop.so compatibility policy and
> that is frustrating. This has been an issue for those who want to run
> jars compiled against multiple different versions of hadoop through
> the same YARN instance. We've discussed it in the past, but never
> really come up with a great solution. The best approach really would
> be to bundle libhadoop.so inside the hadoop jar files, so that it
> could be integral to the Hadoop version itself. However, nobody has
> done the work to make that happen. The second-best approach would be
> to include the Hadoop version in the libhadoop name itself (so we'd
> have libhadoop28.so for hadoop 2.8, and so forth.) Anyway, I think we
> can solve this particular issue without going down that rathole...
Unfortunately I don't think we can, not without further complicating the
existing complicated code with a lot of scaffolding.
I don't understand how YARN & multiple Hadoop versions interact, but if
they are all in the same JVM instance then no amount of fiddling with
shared objects will help as you can't have multiple SOs providing the
same APIs within the same process - or at least not without a lot of
complicated, fragile and utterly platform-specific configuration and code.
--
Alan Burlison
--
Re: DomainSocket issues on Solaris
Posted by Steve Loughran <st...@hortonworks.com>.
On 6 Oct 2015, at 00:34, Alan Burlison <Al...@oracle.com>> wrote:
On 05/10/15 18:30, Colin P. McCabe wrote:
1. Don't get DomainSocket working on Solaris. Rely on the legacy
short-circuit read instead. It has poorer security guarantees, but
doesn't require domain sockets. You can add a line of code to the
failing junit tests to skip them on Solaris.
I really don't want to do that as it relegates Solaris to only ever being a second-class citizen.
I know that Solaris matters to you 100%, and we've tried to be as supportive as we can, even though it's not viewed as important to anyone else. We don't want to make it 2nd class, just want to get it to be 1st class in a way which doesn't create lots of compatibility problems.
2. Use a separate "timer wheel" thread which implements coarse-grained
timeouts by calling shutdown() on domain sockets that have been active
for too long. This thread could be global (one per JVM).
From what I can tell that won't stop all the test failures as they are written with the assumption that per-socket timeouts are available and that they time out exactly when expected.
Is the per-socket timeout assumption used anywhere outside the tests?
3. Implement the poll/select loop you discussed earlier. As Steve
commented, it would be easier to do this by adding new functions,
rather than by changing existing ones. I don't think "ifdef skid
marks" are necessary since poll and select are supported on Linux and
so forth as well as Solaris. You would just need some code in
DomainSocket.java to select the appropriate implementation at runtime
based on the OS.
I could switch the implementation over to use poll everywhere but I haven't done that - Linux still uses socket timeouts. The issue is that in order to make poll() work I need to maintain the read/write timeouts alongside the filehandle - I can't store the timeout 'inside' the filehandle using setsockopt(). That means that the filehandle and the timeouts have to be stored together somewhere. The logical place to put the timeouts is in the same DomainSocket instances that holds the filehandle. If the DomainSocket JNI methods were all instance methods then there wouldn't be a problem, but they aren't, they are static methods where the integer filehandle is passed in as a parameter. And it wouldn't work if I change the native method parameter lists to include the timeouts as they need to be read/write. The only non-vile way I can come up with of doing this is to convert the JNI methods from static into instance methods. Even if that's the only change I make and I still pass in the filehandle as a parameter, the signatures will have changed as the 2nd parameter would now be an object reference and not a class reference.
so we move from
function(fileHandle)
to function(Object), where object->fileHandle and object->timeout are both there?
what about
function(fileHandle, timeout)
where we retain
function(fileHandle) { return function(fileHandle, defaultTimeout)}?
And then never invoke it in our existing code, which now calls the new operation?
or if there's a call
setTimeout(fileHandle, timeout)
which for linux sets the socket timeout —and in solaris updates some map handle->timeout used in the select() call.
The other option is to effectively write a complete Solaris-only replacement for DomainSocket, whether switching between that and the current one is done at compile or run-time isn't really the point. There's a fairly even split between the Java & JNI components of DomainSocket, so whichever way it's done there will be significant duplication of the overall logic and most likely code duplication. That means that bug fixes in one place have to be exactly mirrored in another, and that's unlikely to be sustainable.
It's not going to be maintained, or more precisely: it'll be broken on a regular basis and you are the one left to handle it.
My goal has been to keep the current logic as unchanged as possible. My prototype does that by literally prefixing each libc socket operation with a poll() call to check the filehandle is ready. The rest of the logic in DomainSocket is completely unchanged. That means that the behaviour between Linux and Solaris should be as identical as is possible.
Since you commented that Solaris is implementing timeout support in
the future, approaches #1 or #2 could be placeholders until that's
finished.
Unfortunately I can't predict when that might happen by, though. In my prototype it probes for working timeouts at configure time, so when they do become available they'll be used automatically.
I agree that there is no formal libhadoop.so compatibility policy and
that is frustrating. This has been an issue for those who want to run
jars compiled against multiple different versions of hadoop through
the same YARN instance. We've discussed it in the past, but never
really come up with a great solution. The best approach really would
be to bundle libhadoop.so inside the hadoop jar files, so that it
could be integral to the Hadoop version itself. However, nobody has
done the work to make that happen. The second-best approach would be
to include the Hadoop version in the libhadoop name itself (so we'd
have libhadoop28.so for hadoop 2.8, and so forth.) Anyway, I think we
can solve this particular issue without going down that rathole...
As I said, I believe that ship has long since sailed. Changes that have already been let in have I believe broken the backwards binary compatibility of the Java/JNI interface. Broken is broken, arguing that this proposal shouldn't be allowed in because it simply adds more brokenness to the existing brokenness is really missing the point. As far as I can tell, there already is no backwards compatibility.
--
Alan Burlison
--
Re: DomainSocket issues on Solaris
Posted by Alan Burlison <Al...@oracle.com>.
On 05/10/15 18:30, Colin P. McCabe wrote:
> 1. Don't get DomainSocket working on Solaris. Rely on the legacy
> short-circuit read instead. It has poorer security guarantees, but
> doesn't require domain sockets. You can add a line of code to the
> failing junit tests to skip them on Solaris.
I really don't want to do that as it relegates Solaris to only ever
being a second-class citizen.
> 2. Use a separate "timer wheel" thread which implements coarse-grained
> timeouts by calling shutdown() on domain sockets that have been active
> for too long. This thread could be global (one per JVM).
From what I can tell that won't stop all the test failures as they are
written with the assumption that per-socket timeouts are available and
that they time out exactly when expected.
> 3. Implement the poll/select loop you discussed earlier. As Steve
> commented, it would be easier to do this by adding new functions,
> rather than by changing existing ones. I don't think "ifdef skid
> marks" are necessary since poll and select are supported on Linux and
> so forth as well as Solaris. You would just need some code in
> DomainSocket.java to select the appropriate implementation at runtime
> based on the OS.
I could switch the implementation over to use poll everywhere but I
haven't done that - Linux still uses socket timeouts. The issue is that
in order to make poll() work I need to maintain the read/write timeouts
alongside the filehandle - I can't store the timeout 'inside' the
filehandle using setsockopt(). That means that the filehandle and the
timeouts have to be stored together somewhere. The logical place to put
the timeouts is in the same DomainSocket instances that holds the
filehandle. If the DomainSocket JNI methods were all instance methods
then there wouldn't be a problem, but they aren't, they are static
methods where the integer filehandle is passed in as a parameter. And it
wouldn't work if I change the native method parameter lists to include
the timeouts as they need to be read/write. The only non-vile way I can
come up with of doing this is to convert the JNI methods from static
into instance methods. Even if that's the only change I make and I still
pass in the filehandle as a parameter, the signatures will have changed
as the 2nd parameter would now be an object reference and not a class
reference.
The other option is to effectively write a complete Solaris-only
replacement for DomainSocket, whether switching between that and the
current one is done at compile or run-time isn't really the point.
There's a fairly even split between the Java & JNI components of
DomainSocket, so whichever way it's done there will be significant
duplication of the overall logic and most likely code duplication. That
means that bug fixes in one place have to be exactly mirrored in
another, and that's unlikely to be sustainable.
My goal has been to keep the current logic as unchanged as possible. My
prototype does that by literally prefixing each libc socket operation
with a poll() call to check the filehandle is ready. The rest of the
logic in DomainSocket is completely unchanged. That means that the
behaviour between Linux and Solaris should be as identical as is possible.
> Since you commented that Solaris is implementing timeout support in
> the future, approaches #1 or #2 could be placeholders until that's
> finished.
Unfortunately I can't predict when that might happen by, though. In my
prototype it probes for working timeouts at configure time, so when they
do become available they'll be used automatically.
> I agree that there is no formal libhadoop.so compatibility policy and
> that is frustrating. This has been an issue for those who want to run
> jars compiled against multiple different versions of hadoop through
> the same YARN instance. We've discussed it in the past, but never
> really come up with a great solution. The best approach really would
> be to bundle libhadoop.so inside the hadoop jar files, so that it
> could be integral to the Hadoop version itself. However, nobody has
> done the work to make that happen. The second-best approach would be
> to include the Hadoop version in the libhadoop name itself (so we'd
> have libhadoop28.so for hadoop 2.8, and so forth.) Anyway, I think we
> can solve this particular issue without going down that rathole...
As I said, I believe that ship has long since sailed. Changes that have
already been let in have I believe broken the backwards binary
compatibility of the Java/JNI interface. Broken is broken, arguing that
this proposal shouldn't be allowed in because it simply adds more
brokenness to the existing brokenness is really missing the point. As
far as I can tell, there already is no backwards compatibility.
--
Alan Burlison
--
Re: DomainSocket issues on Solaris
Posted by "Colin P. McCabe" <cm...@apache.org>.
Hi Alan,
As Chris commented earlier, the main use of DomainSocket is to
transfer file descriptors from the DataNode to the DFSClient. As you
know, this is something that can only be done through domain sockets,
not through inet sockets. We do support passing data over domain
sockets, but in practice we rarely turn it on since we haven't seen a
performance advantage.
As I see it, you have a few different options here for getting this
working on Solaris.
1. Don't get DomainSocket working on Solaris. Rely on the legacy
short-circuit read instead. It has poorer security guarantees, but
doesn't require domain sockets. You can add a line of code to the
failing junit tests to skip them on Solaris.
2. Use a separate "timer wheel" thread which implements coarse-grained
timeouts by calling shutdown() on domain sockets that have been active
for too long. This thread could be global (one per JVM).
3. Implement the poll/select loop you discussed earlier. As Steve
commented, it would be easier to do this by adding new functions,
rather than by changing existing ones. I don't think "ifdef skid
marks" are necessary since poll and select are supported on Linux and
so forth as well as Solaris. You would just need some code in
DomainSocket.java to select the appropriate implementation at runtime
based on the OS.
Since you commented that Solaris is implementing timeout support in
the future, approaches #1 or #2 could be placeholders until that's
finished.
I agree that there is no formal libhadoop.so compatibility policy and
that is frustrating. This has been an issue for those who want to run
jars compiled against multiple different versions of hadoop through
the same YARN instance. We've discussed it in the past, but never
really come up with a great solution. The best approach really would
be to bundle libhadoop.so inside the hadoop jar files, so that it
could be integral to the Hadoop version itself. However, nobody has
done the work to make that happen. The second-best approach would be
to include the Hadoop version in the libhadoop name itself (so we'd
have libhadoop28.so for hadoop 2.8, and so forth.) Anyway, I think we
can solve this particular issue without going down that rathole...
best,
Colin
On Mon, Oct 5, 2015 at 7:56 AM, Alan Burlison <Al...@oracle.com> wrote:
> On 05/10/2015 15:14, Steve Loughran wrote:
>
>> I don't think anyone would object for the changes, except for one big
>> caveat: a lot of us would like that binary file to be backwards
>> compatible; a Hadoop 2.6 JAR should be able to link to the 2.8+
>> libhadoop. So whatever gets changed, the old methods are still going
>> to hang around
>
>
> That's not achievable as the method signatures need to change. Even though
> they are private they need to change from static to normal methods and the
> signatures need to change as well, as I said.
>
> JNI code is intimately intertwined with the Java code it runs with. Running
> mismatching Java & JNI versions is going to be a recipe for eventual
> disaster as the JVM explicitly does *not* do any error checking between Java
> and JNI. At some point some innocuous change will be made that will just
> cause undefined behaviour.
>
> I don't actually know how you'd get a JAR/JNI mismatch as they are built and
> packaged together, so I'm struggling to understand what the potential issue
> is here.
>
> In any case the constraint you are requesting would flat-out preclude this
> change, and would also mean that most of the other JNI changes that have
> been committed recently would have to be ripped out as well . In summary,
> the bridge is already burned.
>
> --
> Alan Burlison
> --
Re: DomainSocket issues on Solaris
Posted by Alan Burlison <Al...@oracle.com>.
On 05/10/2015 15:14, Steve Loughran wrote:
> I don't think anyone would object for the changes, except for one big
> caveat: a lot of us would like that binary file to be backwards
> compatible; a Hadoop 2.6 JAR should be able to link to the 2.8+
> libhadoop. So whatever gets changed, the old methods are still going
> to hang around
That's not achievable as the method signatures need to change. Even
though they are private they need to change from static to normal
methods and the signatures need to change as well, as I said.
JNI code is intimately intertwined with the Java code it runs with.
Running mismatching Java & JNI versions is going to be a recipe for
eventual disaster as the JVM explicitly does *not* do any error checking
between Java and JNI. At some point some innocuous change will be made
that will just cause undefined behaviour.
I don't actually know how you'd get a JAR/JNI mismatch as they are built
and packaged together, so I'm struggling to understand what the
potential issue is here.
In any case the constraint you are requesting would flat-out preclude
this change, and would also mean that most of the other JNI changes that
have been committed recently would have to be ripped out as well . In
summary, the bridge is already burned.
--
Alan Burlison
--
Re: DomainSocket issues on Solaris
Posted by Steve Loughran <st...@hortonworks.com>.
I don't think anyone would object for the changes, except for one big caveat: a lot of us would like that binary file to be backwards compatible; a Hadoop 2.6 JAR should be able to link to the 2.8+ libhadoop. So whatever gets changed, the old methods are still going to hang around
> On 2 Oct 2015, at 17:46, Alan Burlison <Al...@oracle.com> wrote:
>
> On 30/09/2015 09:14, Alan Burlison wrote:
>
>> The basic idea is to add two new fields to DomainSocket.c to hold the
>> read/write timeouts. On platforms that support SO_SNDTIMEO and
>> SO_RCVTIMEO these would be unused as setsockopt() would be used to set
>> the socket timeouts. On platforms such as Solaris the JNI code would use
>> the values to implement the timeouts appropriately.
>
> Unfortunately it's not a simple as I'd hoped. For some reason I don't really understand, nearly all the JNI methods are declared as static and therefore don't get a "this" pointer and as a consequence all the class data members that are needed by the JNI code have to be passed in as parameters. That also means it's not possible to store the timeouts in the DomainSocket fields from within the JNI code. Most of the JNI methods should be instance methods rather than static ones, but making that change would require some significant surgery to DomainSocket.
>
> --
> Alan Burlison
> --
>
Re: DomainSocket issues on Solaris
Posted by Alan Burlison <Al...@oracle.com>.
On 30/09/2015 09:14, Alan Burlison wrote:
> The basic idea is to add two new fields to DomainSocket.c to hold the
> read/write timeouts. On platforms that support SO_SNDTIMEO and
> SO_RCVTIMEO these would be unused as setsockopt() would be used to set
> the socket timeouts. On platforms such as Solaris the JNI code would use
> the values to implement the timeouts appropriately.
Unfortunately it's not a simple as I'd hoped. For some reason I don't
really understand, nearly all the JNI methods are declared as static and
therefore don't get a "this" pointer and as a consequence all the class
data members that are needed by the JNI code have to be passed in as
parameters. That also means it's not possible to store the timeouts in
the DomainSocket fields from within the JNI code. Most of the JNI
methods should be instance methods rather than static ones, but making
that change would require some significant surgery to DomainSocket.
--
Alan Burlison
--
Re: DomainSocket issues on Solaris
Posted by Alan Burlison <Al...@oracle.com>.
On 30/09/2015 17:23, Chris Nauroth wrote:
> I think file descriptor sharing is a capability of Unix domain
> sockets only, and not INET sockets.
Yes, that's correct.
--
Alan Burlison
--
Re: DomainSocket issues on Solaris
Posted by Chris Nauroth <cn...@hortonworks.com>.
That's an interesting find, though I don't think we'd be able to swap in
INET sockets in this part of the code. We use Unix domain sockets to
share an open file descriptor from the DataNode process to the HDFS client
process, and then the client reads directly from that open file
descriptor. I think file descriptor sharing is a capability of Unix
domain sockets only, and not INET sockets. As you said, I wouldn't expect
throughput on the Unix domain socket to be a bottleneck, because there is
very little data transferred.
--Chris Nauroth
On 9/30/15, 9:12 AM, "Alan Burlison" <Al...@oracle.com> wrote:
>On 30/09/2015 16:56, Chris Nauroth wrote:
>
>> Alan, I also meant to say that I didn't understand the comment about "in
>> production it seems that DomainSocket is less commonly used". The
>>current
>> implementation of short-circuit read definitely utilizes DomainSocket,
>>and
>> it's very common to enable this in production clusters. The
>>documentation
>> page you mentioned includes discussion of a legacy short-circuit read
>> implementation, which did not utilize UNIX domain sockets, but the
>>legacy
>> implementation is rarely used in practice now.
>
>Oh, OK - thanks for the clarification. I couldn't find much about
>DomainSocket other than the link I posted and that didn't make it sound
>like it was used all that much. I'll make sure the JIRA reflects what
>you said above.
>
>Interestingly, INET sockets are faster than UNIX sockets on Linux as
>well as on Solaris. There's not much in it, around 10% in both cases,
>and I suspect socket throughput isn't the rate-limiting step anyway.
>
>--
>Alan Burlison
>--
>
Re: DomainSocket issues on Solaris
Posted by Alan Burlison <Al...@oracle.com>.
On 30/09/2015 16:56, Chris Nauroth wrote:
> Alan, I also meant to say that I didn't understand the comment about "in
> production it seems that DomainSocket is less commonly used". The current
> implementation of short-circuit read definitely utilizes DomainSocket, and
> it's very common to enable this in production clusters. The documentation
> page you mentioned includes discussion of a legacy short-circuit read
> implementation, which did not utilize UNIX domain sockets, but the legacy
> implementation is rarely used in practice now.
Oh, OK - thanks for the clarification. I couldn't find much about
DomainSocket other than the link I posted and that didn't make it sound
like it was used all that much. I'll make sure the JIRA reflects what
you said above.
Interestingly, INET sockets are faster than UNIX sockets on Linux as
well as on Solaris. There's not much in it, around 10% in both cases,
and I suspect socket throughput isn't the rate-limiting step anyway.
--
Alan Burlison
--
Re: DomainSocket issues on Solaris
Posted by Chris Nauroth <cn...@hortonworks.com>.
Alan, I also meant to say that I didn't understand the comment about "in
production it seems that DomainSocket is less commonly used". The current
implementation of short-circuit read definitely utilizes DomainSocket, and
it's very common to enable this in production clusters. The documentation
page you mentioned includes discussion of a legacy short-circuit read
implementation, which did not utilize UNIX domain sockets, but the legacy
implementation is rarely used in practice now.
--Chris Nauroth
On 9/30/15, 8:46 AM, "Chris Nauroth" <cn...@hortonworks.com> wrote:
>Hello Alan,
>
>I think this sounds like a reasonable approach. I recommend that you file
>a JIRA with the proposal (copy-paste the content of your email into a
>comment) and then wait a few days before starting work in earnest to see
>if anyone else wants to discuss it first. I also recommend notifying
>Colin Patrick McCabe on that JIRA. It would be good to get a second
>opinion from him, since he is the original author of much of this code.
>
>--Chris Nauroth
>
>
>
>
>On 9/30/15, 1:14 AM, "Alan Burlison" <Al...@oracle.com> wrote:
>
>>Now that the Hadoop native code builds on Solaris I've been chipping
>>away at all the test failures. About 50% of the failures involve
>>DomainSocket, either directly or indirectly. That seems to be mainly
>>because the tests use DomainSocket to do single-node testing, whereas in
>>production it seems that DomainSocket is less commonly used
>>(https://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-hdfs/Sh
>>o
>>rtCircuitLocalReads.html).
>>
>>The particular problem on Solaris is that socket read/write timeouts
>>(the SO_SNDTIMEO and SO_RCVTIMEO socket options) are not supported for
>>UNIX domain (PF_UNIX) sockets. Those options are however supported for
>>PF_INET sockets. That's because the socket implementation on Solaris is
>>split roughly into two parts, for inet sockets and for STREAMS sockets,
>>and the STREAMS implementation lacks support for SO_SNDTIMEO and
>>SO_RCVTIMEO. As an aside, performance of sockets that use loopback or
>>the host's own IP is slightly better than that of UNIX domain sockets on
>>Solaris.
>>
>>I'm investigating getting timeouts supported for PF_UNIX sockets added
>>to Solaris, but in the meantime I'm also looking how this might be
>>worked around in Hadoop. One way would be to implement timeouts by
>>wrapping all the read/write/send/recv etc calls in DomainSocket.c with
>>either poll() or select().
>>
>>The basic idea is to add two new fields to DomainSocket.c to hold the
>>read/write timeouts. On platforms that support SO_SNDTIMEO and
>>SO_RCVTIMEO these would be unused as setsockopt() would be used to set
>>the socket timeouts. On platforms such as Solaris the JNI code would use
>>the values to implement the timeouts appropriately.
>>
>>To prevent the code in DomainSocket.c becoming a #ifdef hairball, the
>>current socket IO function calls such as accept(), send(), read() etc
>>would be replaced with a macros such as HD_ACCEPT. On platforms that
>>provide timeouts these would just expand to the normal socket functions,
>>on platforms that don't support timeouts it would expand to wrappers
>>that implements timeouts for them.
>>
>>The only caveats are that all code that does anything to a PF_UNIX
>>socket would *always* have to do so via DomainSocket. As far as I can
>>tell that's not an issue, but it would have to be borne in mind if any
>>changes were made in this area.
>>
>>Before I set about doing this, does the approach seem reasonable?
>>
>>Thanks,
>>
>>--
>>Alan Burlison
>>--
>>
>
Re: DomainSocket issues on Solaris
Posted by Chris Nauroth <cn...@hortonworks.com>.
Hello Alan,
I think this sounds like a reasonable approach. I recommend that you file
a JIRA with the proposal (copy-paste the content of your email into a
comment) and then wait a few days before starting work in earnest to see
if anyone else wants to discuss it first. I also recommend notifying
Colin Patrick McCabe on that JIRA. It would be good to get a second
opinion from him, since he is the original author of much of this code.
--Chris Nauroth
On 9/30/15, 1:14 AM, "Alan Burlison" <Al...@oracle.com> wrote:
>Now that the Hadoop native code builds on Solaris I've been chipping
>away at all the test failures. About 50% of the failures involve
>DomainSocket, either directly or indirectly. That seems to be mainly
>because the tests use DomainSocket to do single-node testing, whereas in
>production it seems that DomainSocket is less commonly used
>(https://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-hdfs/Sho
>rtCircuitLocalReads.html).
>
>The particular problem on Solaris is that socket read/write timeouts
>(the SO_SNDTIMEO and SO_RCVTIMEO socket options) are not supported for
>UNIX domain (PF_UNIX) sockets. Those options are however supported for
>PF_INET sockets. That's because the socket implementation on Solaris is
>split roughly into two parts, for inet sockets and for STREAMS sockets,
>and the STREAMS implementation lacks support for SO_SNDTIMEO and
>SO_RCVTIMEO. As an aside, performance of sockets that use loopback or
>the host's own IP is slightly better than that of UNIX domain sockets on
>Solaris.
>
>I'm investigating getting timeouts supported for PF_UNIX sockets added
>to Solaris, but in the meantime I'm also looking how this might be
>worked around in Hadoop. One way would be to implement timeouts by
>wrapping all the read/write/send/recv etc calls in DomainSocket.c with
>either poll() or select().
>
>The basic idea is to add two new fields to DomainSocket.c to hold the
>read/write timeouts. On platforms that support SO_SNDTIMEO and
>SO_RCVTIMEO these would be unused as setsockopt() would be used to set
>the socket timeouts. On platforms such as Solaris the JNI code would use
>the values to implement the timeouts appropriately.
>
>To prevent the code in DomainSocket.c becoming a #ifdef hairball, the
>current socket IO function calls such as accept(), send(), read() etc
>would be replaced with a macros such as HD_ACCEPT. On platforms that
>provide timeouts these would just expand to the normal socket functions,
>on platforms that don't support timeouts it would expand to wrappers
>that implements timeouts for them.
>
>The only caveats are that all code that does anything to a PF_UNIX
>socket would *always* have to do so via DomainSocket. As far as I can
>tell that's not an issue, but it would have to be borne in mind if any
>changes were made in this area.
>
>Before I set about doing this, does the approach seem reasonable?
>
>Thanks,
>
>--
>Alan Burlison
>--
>