You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@river.apache.org by Christopher Dolan <ch...@avid.com> on 2011/03/10 22:06:44 UTC

reverse DNS timeouts and SocketPermission

The java.net.SocketPermission class uses forward and reverse DNS lookups
to ensure that we're allowed to talk to particular remote machines.
These lookups are used to canonicalize a remote host's name to ensure
that variations in that name don't lead to false negatives.

However, many people have found
(http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4975882) that if
there are configuration errors in a DNS system, the reverse DNS failures
cause very significant latency (e.g. I've seen 10-12 seconds). This
latency has widely varying affects on a djinn. In many cases, it just
causes LookupCache slowdowns which can be mitigated by delayed
deserialization techniques discussed previously on the dev@ mailing
list. But in some cases, I've seen it cause Reggie to hang up for a
while (I still don't understand where in Reggie the problem occurs,
maybe EventListeners?)

Obviously, the real solution is to properly configure DNS. But I would
like to know how other people have addressed this issue in their
deployments.

 * Do you ensure the RMI codebase URLs all use canonical hostnames, or
IP addresses?
 * Do you ensure that the TcpServerEndpoint has a consistent (perhaps
hard-coded) name?
 * Do you have monitoring or logging code to proactively detect DNS
configuration errors?
 * Do you fiddle the Java security property
"networkaddress.cache.negative.ttl"?
 * Do you use host files?
 * Do you use a non-Sun JVM?
 * Do you use wildcards or IP addresses in your security policy file?
 * Do you completely disable the socket check in your security policy
file? (yikes!)
 * Have you simply never seen this problem?  (lucky you!)

Thanks,
Chris

Re: reverse DNS timeouts and SocketPermission

Posted by Gregg Wonderly <ge...@cox.net>.

If you'd like to see how this affects things, then you can break bind 
configuration on a server running reggie to see how "reverse-DNS" delays and 
other such things that happen repeatedly will affect your service's performance. 
  A simple thing, is to just remove the contents of /etc/resolv.conf and put in 
there addresses of servers that don't exist to make sure that there is nothing 
bind can find to "talk" to.  Then empty out /etc/hosts as well (except for 
localhost, and the host machines name and address) to make sure there's nothing 
for it to cheat from in there.

Then, write a test service and client, that interact via lookup, followed by 
getting a lease, making a remote call from the client to the service and then 
canceling the lease.  Run this in a loop indefinitely, and log the times between 
calls.  When you've got it correctly "broken", you will see many seconds between 
each call, indicating huge delays as the internal java security implementation 
tries to perform reverse DNS on the clients inbound socket address.

There may be some additional things about the policy file contents that can 
extend the delay based on there being multiple host grants etc.

What happens on failure is that Java caches the DNS lookup failure for 10 
seconds.  So, 10 seconds later, the DNS lookup will have to be done again 
starting with the failing host.

This is another one of those things which people can experience as a negative 
impact on their Jini "first experience".  They'd look at the response times and 
say, crap, at this speed, I can just use paper and people on bikes to get things 
done faster!

Gregg Wonderly

On 3/15/2011 12:01 PM, Tom Hobbs wrote:
> I've not experienced this issue myself.  It's an interesting one, and
> Gregg's response is also intriguing.
>
> I know it's not that helpful to you, but I'll see what I can do about
> including something about this on the River site or wiki.
>
> Chris, if you feel this is an issue that River can/should solve then
> please create a Jira for it otherwise it'll get lost in the mists of
> time.
>
> On Tue, Mar 15, 2011 at 4:42 PM, Christopher Dolan
> <ch...@avid.com>  wrote:
>> Understood, increasing that value to something large would make me just
>> suffer that timeout once per remote machine per reboot. Is this the
>> solution most River users have employed, or have most of you simply
>> never had to deal with this problem? In my case, I may connect to
>> hundreds of remote machines via an app that wants a short startup time,
>> so this solution concerns me.
>>
>> Chris
>>
>> -----Original Message-----
>> From: Gregg Wonderly [mailto:gergg@cox.net]
>> Sent: Sunday, March 13, 2011 9:08 AM
>> To: user@river.apache.org
>> Subject: Re: reverse DNS timeouts and SocketPermission
>>
>> Dns failure ttl change is the most useful way to deal with this. 10
>> seconds is the default and a failing dns query will be longer than that.
>> So every use of the name will result in a new attempt to lookup the same
>> thing on the same failing server
>>
>> Gregg
>>
>> Sent from my iPhone
>>
>> On Mar 10, 2011, at 3:06 PM, "Christopher Dolan"
>> <ch...@avid.com>  wrote:
>>
>>> The java.net.SocketPermission class uses forward and reverse DNS
>> lookups
>>> to ensure that we're allowed to talk to particular remote machines.
>>> These lookups are used to canonicalize a remote host's name to ensure
>>> that variations in that name don't lead to false negatives.
>>>
>>> However, many people have found
>>> (http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4975882) that if
>>> there are configuration errors in a DNS system, the reverse DNS
>> failures
>>> cause very significant latency (e.g. I've seen 10-12 seconds). This
>>> latency has widely varying affects on a djinn. In many cases, it just
>>> causes LookupCache slowdowns which can be mitigated by delayed
>>> deserialization techniques discussed previously on the dev@ mailing
>>> list. But in some cases, I've seen it cause Reggie to hang up for a
>>> while (I still don't understand where in Reggie the problem occurs,
>>> maybe EventListeners?)
>>>
>>> Obviously, the real solution is to properly configure DNS. But I would
>>> like to know how other people have addressed this issue in their
>>> deployments.
>>>
>>> * Do you ensure the RMI codebase URLs all use canonical hostnames, or
>>> IP addresses?
>>> * Do you ensure that the TcpServerEndpoint has a consistent (perhaps
>>> hard-coded) name?
>>> * Do you have monitoring or logging code to proactively detect DNS
>>> configuration errors?
>>> * Do you fiddle the Java security property
>>> "networkaddress.cache.negative.ttl"?
>>> * Do you use host files?
>>> * Do you use a non-Sun JVM?
>>> * Do you use wildcards or IP addresses in your security policy file?
>>> * Do you completely disable the socket check in your security policy
>>> file? (yikes!)
>>> * Have you simply never seen this problem?  (lucky you!)
>>>
>>> Thanks,
>>> Chris
>>>
>>
>

Re: reverse DNS timeouts and SocketPermission

Posted by Tom Hobbs <tv...@googlemail.com>.

I've not experienced this issue myself.  It's an interesting one, and
Gregg's response is also intriguing.

I know it's not that helpful to you, but I'll see what I can do about
including something about this on the River site or wiki.

Chris, if you feel this is an issue that River can/should solve then
please create a Jira for it otherwise it'll get lost in the mists of
time.

On Tue, Mar 15, 2011 at 4:42 PM, Christopher Dolan
<ch...@avid.com> wrote:
> Understood, increasing that value to something large would make me just
> suffer that timeout once per remote machine per reboot. Is this the
> solution most River users have employed, or have most of you simply
> never had to deal with this problem? In my case, I may connect to
> hundreds of remote machines via an app that wants a short startup time,
> so this solution concerns me.
>
> Chris
>
> -----Original Message-----
> From: Gregg Wonderly [mailto:gergg@cox.net]
> Sent: Sunday, March 13, 2011 9:08 AM
> To: user@river.apache.org
> Subject: Re: reverse DNS timeouts and SocketPermission
>
> Dns failure ttl change is the most useful way to deal with this. 10
> seconds is the default and a failing dns query will be longer than that.
> So every use of the name will result in a new attempt to lookup the same
> thing on the same failing server
>
> Gregg
>
> Sent from my iPhone
>
> On Mar 10, 2011, at 3:06 PM, "Christopher Dolan"
> <ch...@avid.com> wrote:
>
>> The java.net.SocketPermission class uses forward and reverse DNS
> lookups
>> to ensure that we're allowed to talk to particular remote machines.
>> These lookups are used to canonicalize a remote host's name to ensure
>> that variations in that name don't lead to false negatives.
>>
>> However, many people have found
>> (http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4975882) that if
>> there are configuration errors in a DNS system, the reverse DNS
> failures
>> cause very significant latency (e.g. I've seen 10-12 seconds). This
>> latency has widely varying affects on a djinn. In many cases, it just
>> causes LookupCache slowdowns which can be mitigated by delayed
>> deserialization techniques discussed previously on the dev@ mailing
>> list. But in some cases, I've seen it cause Reggie to hang up for a
>> while (I still don't understand where in Reggie the problem occurs,
>> maybe EventListeners?)
>>
>> Obviously, the real solution is to properly configure DNS. But I would
>> like to know how other people have addressed this issue in their
>> deployments.
>>
>> * Do you ensure the RMI codebase URLs all use canonical hostnames, or
>> IP addresses?
>> * Do you ensure that the TcpServerEndpoint has a consistent (perhaps
>> hard-coded) name?
>> * Do you have monitoring or logging code to proactively detect DNS
>> configuration errors?
>> * Do you fiddle the Java security property
>> "networkaddress.cache.negative.ttl"?
>> * Do you use host files?
>> * Do you use a non-Sun JVM?
>> * Do you use wildcards or IP addresses in your security policy file?
>> * Do you completely disable the socket check in your security policy
>> file? (yikes!)
>> * Have you simply never seen this problem?  (lucky you!)
>>
>> Thanks,
>> Chris
>>
>

RE: reverse DNS timeouts and SocketPermission

Posted by Christopher Dolan <ch...@avid.com>.

Understood, increasing that value to something large would make me just
suffer that timeout once per remote machine per reboot. Is this the
solution most River users have employed, or have most of you simply
never had to deal with this problem? In my case, I may connect to
hundreds of remote machines via an app that wants a short startup time,
so this solution concerns me.

Chris

-----Original Message-----
From: Gregg Wonderly [mailto:gergg@cox.net] 
Sent: Sunday, March 13, 2011 9:08 AM
To: user@river.apache.org
Subject: Re: reverse DNS timeouts and SocketPermission

Dns failure ttl change is the most useful way to deal with this. 10
seconds is the default and a failing dns query will be longer than that.
So every use of the name will result in a new attempt to lookup the same
thing on the same failing server

Gregg

Sent from my iPhone

On Mar 10, 2011, at 3:06 PM, "Christopher Dolan"
<ch...@avid.com> wrote:

> The java.net.SocketPermission class uses forward and reverse DNS
lookups
> to ensure that we're allowed to talk to particular remote machines.
> These lookups are used to canonicalize a remote host's name to ensure
> that variations in that name don't lead to false negatives.
> 
> However, many people have found
> (http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4975882) that if
> there are configuration errors in a DNS system, the reverse DNS
failures
> cause very significant latency (e.g. I've seen 10-12 seconds). This
> latency has widely varying affects on a djinn. In many cases, it just
> causes LookupCache slowdowns which can be mitigated by delayed
> deserialization techniques discussed previously on the dev@ mailing
> list. But in some cases, I've seen it cause Reggie to hang up for a
> while (I still don't understand where in Reggie the problem occurs,
> maybe EventListeners?)
> 
> Obviously, the real solution is to properly configure DNS. But I would
> like to know how other people have addressed this issue in their
> deployments.
> 
> * Do you ensure the RMI codebase URLs all use canonical hostnames, or
> IP addresses?
> * Do you ensure that the TcpServerEndpoint has a consistent (perhaps
> hard-coded) name?
> * Do you have monitoring or logging code to proactively detect DNS
> configuration errors?
> * Do you fiddle the Java security property
> "networkaddress.cache.negative.ttl"?
> * Do you use host files?
> * Do you use a non-Sun JVM?
> * Do you use wildcards or IP addresses in your security policy file?
> * Do you completely disable the socket check in your security policy
> file? (yikes!)
> * Have you simply never seen this problem?  (lucky you!)
> 
> Thanks,
> Chris
>

Re: reverse DNS timeouts and SocketPermission

Posted by Gregg Wonderly <ge...@cox.net>.

Dns failure ttl change is the most useful way to deal with this. 10 seconds is the default and a failing dns query will be longer than that.  So every use of the name will result in a new attempt to lookup the same thing on the same failing server

Gregg

Sent from my iPhone

On Mar 10, 2011, at 3:06 PM, "Christopher Dolan" <ch...@avid.com> wrote:

> The java.net.SocketPermission class uses forward and reverse DNS lookups
> to ensure that we're allowed to talk to particular remote machines.
> These lookups are used to canonicalize a remote host's name to ensure
> that variations in that name don't lead to false negatives.
> 
> However, many people have found
> (http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4975882) that if
> there are configuration errors in a DNS system, the reverse DNS failures
> cause very significant latency (e.g. I've seen 10-12 seconds). This
> latency has widely varying affects on a djinn. In many cases, it just
> causes LookupCache slowdowns which can be mitigated by delayed
> deserialization techniques discussed previously on the dev@ mailing
> list. But in some cases, I've seen it cause Reggie to hang up for a
> while (I still don't understand where in Reggie the problem occurs,
> maybe EventListeners?)
> 
> Obviously, the real solution is to properly configure DNS. But I would
> like to know how other people have addressed this issue in their
> deployments.
> 
> * Do you ensure the RMI codebase URLs all use canonical hostnames, or
> IP addresses?
> * Do you ensure that the TcpServerEndpoint has a consistent (perhaps
> hard-coded) name?
> * Do you have monitoring or logging code to proactively detect DNS
> configuration errors?
> * Do you fiddle the Java security property
> "networkaddress.cache.negative.ttl"?
> * Do you use host files?
> * Do you use a non-Sun JVM?
> * Do you use wildcards or IP addresses in your security policy file?
> * Do you completely disable the socket check in your security policy
> file? (yikes!)
> * Have you simply never seen this problem?  (lucky you!)
> 
> Thanks,
> Chris
>