You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@river.apache.org by "Ron Mann (JIRA)" <ji...@apache.org> on 2007/08/10 16:53:42 UTC

[jira] Created: (RIVER-206) Change default load factors from 3 to 1

Change default load factors from 3 to 1
---------------------------------------

                 Key: RIVER-206
                 URL: https://issues.apache.org/jira/browse/RIVER-206
             Project: River
          Issue Type: Improvement
            Reporter: Ron Mann
            Priority: Minor


Bugtraq ID [6355743|http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6355743]
Taken from jini-users mailing list [http://archives.java.sun.com/cgi-bin/wa?A2=ind0511&L=jini-users&F=&S=&P=25095]:

This is a sad horror story about a default value for a load factor in
Mahalo that turned out to halt our software system at regular intervals,
but never in a deterministic way, leading to many lost development
hours, loss of faith and even worse.

In short what we experienced was that some operations in our software
system (includes JavaSpaces and various services that perform operations
under a distributed transaction) that should take place in parallel
took place in a serialized manner. We noticed this behavior only
occurred under some (at that time unknown) conditions. Not only
throughput was harmed but our assumptions with regard to the maximum
time in which operations should complete didn't hold any longer and
things started to fail. One can argue well that is what distributed
systems is all about, but nevertheless it is something you try to avoid,
especially when all parts seem to function properly.

We were not able to find dead-locks in our code or some other problem
that could cause this behavior. Given the large number of services,
their interaction and associated thousands of threads over multiple JVMs
and that you can't freeze-frame time for your system, this appeared as a
tricky problem to tackle. One of those moments you really regret you
started to develop a distributed application at the first place.

However a little voice told me that Mahalo must be involved in all this
trouble, this was in line with my feeling with respect to Mahalo as I
knew the code a bit (due to fitting it in Seven) and Jim Hurleys remark
at the 7th JCM "Mahalo is the weakest child of the contributed services"
or similar wording.

So I decided to assume there was a bug in Mahalo and the only way to
find out was to develop a scenario that could make that bug obvious and
to improve logging a lot (proper tracking of transactions and
participants involved). So lately I started to developed some scenario's
and none of them could reproduce a bug or explain what we saw. Until
lately I tried to experiment with transaction participants that are able
to 'take their time' in the prepare method [1]. When using random
prepare times from 3 - 10 seconds I noticed the parallism of Mahalo
and the througput of a transaction (time from client commit to
completion) varied and was no direct funtion of the prepare time. The
behavior I experienced could only be explained when the schedular of the
various internal tasks was constrained by something. Knowing the code I
suddenly realized there must have been a 'load factor' applied to the
thread pool that was used for the commit related tasks. I was rather
shocked to find out that the default was 3.0 and suddenly the mistery
became completely clear to me. Mahalo has out-of-the-box a built-in
constraint that can make the system serialize transaction related
operation in case participants really take their time to return.

So it turned out that Mahalo is a fine services after all, but that one
'freak' ;-) chose a very unfortunate default value for the load factor [2].

Load-factors for thread pools (and max limits to a lesser degree) are so
tricky to get right [3] and therefore IMHO high load factors should only
be used in case you know for sure you are dealing with bursts of tasks
with a guaranteed short duration and I think that is really something
people should tune themselves.

Maybe it was stupid of me and I should have read and understand the
Mahalo documentation better. But I would expect any system to use
out-of-the-box load-factors of 1.0 for tasks in a thread pool that
are potentially long running tasks [3], especially for something as
delecate as a transaction manager that seems to operate as the so called
spider in the web. It is better to have a system consuming too much
threads opposed to constrain it in a way that leads to problems that are
very hard to find out.

I hope this mail is seen as an RFE for a default load factor of 1.0 to
prevent from people running into similar problems as we had and as a
lesson/warning for those working with Mahalo and the risk of using
load-factors in general.

[1] in our system some service have to consult external systems when
prepare is called on them and under some conditions it can take a long
time to return from the prepare method. We are aware this is something
you want to prevent but we have requirements that mandate this.

[2] the one that gave us problems in production was Mahalo from JTSK
2.0 that didn't have the ability to specify a taskpool through the
configuration. The loadfactor of 3.0 was hardcoded (with a TODO) and not
documented at that time if I recall correctly (don't have a 2.0
distribution at hand).

[3] more and more I'm starting to believe that each task in a thread
pool should have a dead-line in which they should be assigned to a
worker thread, for this purpose we support in our thread pools a
priority constraint to attach to Runnables, see
http://www.cheiron.org/utils/nightly/api/org/cheiron/util/thread/PriorityConstraints.html.
In a discussion in the Porter mailing list I know Bob Scheifler once
said "I have in a past life been a fan of deadline scheduling.", I'm
very interested to know whether he still is a fan.

Evaluation:
Given a low priority since in 2.1 the task pool objects are user configurable. This request is to change the default setting for those objects.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (RIVER-206) Change default load factors from 3 to 1

Posted by "Robert Resendes (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/RIVER-206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12590404#action_12590404 ] 

Robert Resendes commented on RIVER-206:
---------------------------------------

Accepting and starting work on this issue.

> Change default load factors from 3 to 1
> ---------------------------------------
>
>                 Key: RIVER-206
>                 URL: https://issues.apache.org/jira/browse/RIVER-206
>             Project: River
>          Issue Type: Improvement
>            Reporter: Ron Mann
>            Assignee: Robert Resendes
>            Priority: Minor
>             Fix For: AR2
>
>
> Bugtraq ID [6355743|http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6355743]
> Taken from jini-users mailing list [http://archives.java.sun.com/cgi-bin/wa?A2=ind0511&L=jini-users&F=&S=&P=25095]:
> This is a sad horror story about a default value for a load factor in
> Mahalo that turned out to halt our software system at regular intervals,
> but never in a deterministic way, leading to many lost development
> hours, loss of faith and even worse.
> In short what we experienced was that some operations in our software
> system (includes JavaSpaces and various services that perform operations
> under a distributed transaction) that should take place in parallel
> took place in a serialized manner. We noticed this behavior only
> occurred under some (at that time unknown) conditions. Not only
> throughput was harmed but our assumptions with regard to the maximum
> time in which operations should complete didn't hold any longer and
> things started to fail. One can argue well that is what distributed
> systems is all about, but nevertheless it is something you try to avoid,
> especially when all parts seem to function properly.
> We were not able to find dead-locks in our code or some other problem
> that could cause this behavior. Given the large number of services,
> their interaction and associated thousands of threads over multiple JVMs
> and that you can't freeze-frame time for your system, this appeared as a
> tricky problem to tackle. One of those moments you really regret you
> started to develop a distributed application at the first place.
> However a little voice told me that Mahalo must be involved in all this
> trouble, this was in line with my feeling with respect to Mahalo as I
> knew the code a bit (due to fitting it in Seven) and Jim Hurleys remark
> at the 7th JCM "Mahalo is the weakest child of the contributed services"
> or similar wording.
> So I decided to assume there was a bug in Mahalo and the only way to
> find out was to develop a scenario that could make that bug obvious and
> to improve logging a lot (proper tracking of transactions and
> participants involved). So lately I started to developed some scenario's
> and none of them could reproduce a bug or explain what we saw. Until
> lately I tried to experiment with transaction participants that are able
> to 'take their time' in the prepare method [1]. When using random
> prepare times from 3 - 10 seconds I noticed the parallism of Mahalo
> and the througput of a transaction (time from client commit to
> completion) varied and was no direct funtion of the prepare time. The
> behavior I experienced could only be explained when the schedular of the
> various internal tasks was constrained by something. Knowing the code I
> suddenly realized there must have been a 'load factor' applied to the
> thread pool that was used for the commit related tasks. I was rather
> shocked to find out that the default was 3.0 and suddenly the mistery
> became completely clear to me. Mahalo has out-of-the-box a built-in
> constraint that can make the system serialize transaction related
> operation in case participants really take their time to return.
> So it turned out that Mahalo is a fine services after all, but that one
> 'freak' ;-) chose a very unfortunate default value for the load factor [2].
> Load-factors for thread pools (and max limits to a lesser degree) are so
> tricky to get right [3] and therefore IMHO high load factors should only
> be used in case you know for sure you are dealing with bursts of tasks
> with a guaranteed short duration and I think that is really something
> people should tune themselves.
> Maybe it was stupid of me and I should have read and understand the
> Mahalo documentation better. But I would expect any system to use
> out-of-the-box load-factors of 1.0 for tasks in a thread pool that
> are potentially long running tasks [3], especially for something as
> delecate as a transaction manager that seems to operate as the so called
> spider in the web. It is better to have a system consuming too much
> threads opposed to constrain it in a way that leads to problems that are
> very hard to find out.
> I hope this mail is seen as an RFE for a default load factor of 1.0 to
> prevent from people running into similar problems as we had and as a
> lesson/warning for those working with Mahalo and the risk of using
> load-factors in general.
> [1] in our system some service have to consult external systems when
> prepare is called on them and under some conditions it can take a long
> time to return from the prepare method. We are aware this is something
> you want to prevent but we have requirements that mandate this.
> [2] the one that gave us problems in production was Mahalo from JTSK
> 2.0 that didn't have the ability to specify a taskpool through the
> configuration. The loadfactor of 3.0 was hardcoded (with a TODO) and not
> documented at that time if I recall correctly (don't have a 2.0
> distribution at hand).
> [3] more and more I'm starting to believe that each task in a thread
> pool should have a dead-line in which they should be assigned to a
> worker thread, for this purpose we support in our thread pools a
> priority constraint to attach to Runnables, see
> http://www.cheiron.org/utils/nightly/api/org/cheiron/util/thread/PriorityConstraints.html.
> In a discussion in the Porter mailing list I know Bob Scheifler once
> said "I have in a past life been a fan of deadline scheduling.", I'm
> very interested to know whether he still is a fan.
> Evaluation:
> Given a low priority since in 2.1 the task pool objects are user configurable. This request is to change the default setting for those objects.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Re: [jira] Updated: (RIVER-206) Change default load factors from 3 to 1

Posted by Mark Brouwer <ma...@cheiron.org>.
Robert Resendes (JIRA) wrote:
>      [ https://issues.apache.org/jira/browse/RIVER-206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
> 
> Robert Resendes updated RIVER-206:
> ----------------------------------
> 
>     Attachment: RIVER-206.diff

OK, a long time for such a small but oh so important change ;-)
-- 
Mark

[jira] Updated: (RIVER-206) Change default load factors from 3 to 1

Posted by "Robert Resendes (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/RIVER-206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Resendes updated RIVER-206:
----------------------------------

    Attachment: RIVER-206.diff

Proposed diff-formatted change attached

> Change default load factors from 3 to 1
> ---------------------------------------
>
>                 Key: RIVER-206
>                 URL: https://issues.apache.org/jira/browse/RIVER-206
>             Project: River
>          Issue Type: Improvement
>            Reporter: Ron Mann
>            Assignee: Robert Resendes
>            Priority: Minor
>             Fix For: AR2
>
>         Attachments: RIVER-206.diff
>
>
> Bugtraq ID [6355743|http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6355743]
> Taken from jini-users mailing list [http://archives.java.sun.com/cgi-bin/wa?A2=ind0511&L=jini-users&F=&S=&P=25095]:
> This is a sad horror story about a default value for a load factor in
> Mahalo that turned out to halt our software system at regular intervals,
> but never in a deterministic way, leading to many lost development
> hours, loss of faith and even worse.
> In short what we experienced was that some operations in our software
> system (includes JavaSpaces and various services that perform operations
> under a distributed transaction) that should take place in parallel
> took place in a serialized manner. We noticed this behavior only
> occurred under some (at that time unknown) conditions. Not only
> throughput was harmed but our assumptions with regard to the maximum
> time in which operations should complete didn't hold any longer and
> things started to fail. One can argue well that is what distributed
> systems is all about, but nevertheless it is something you try to avoid,
> especially when all parts seem to function properly.
> We were not able to find dead-locks in our code or some other problem
> that could cause this behavior. Given the large number of services,
> their interaction and associated thousands of threads over multiple JVMs
> and that you can't freeze-frame time for your system, this appeared as a
> tricky problem to tackle. One of those moments you really regret you
> started to develop a distributed application at the first place.
> However a little voice told me that Mahalo must be involved in all this
> trouble, this was in line with my feeling with respect to Mahalo as I
> knew the code a bit (due to fitting it in Seven) and Jim Hurleys remark
> at the 7th JCM "Mahalo is the weakest child of the contributed services"
> or similar wording.
> So I decided to assume there was a bug in Mahalo and the only way to
> find out was to develop a scenario that could make that bug obvious and
> to improve logging a lot (proper tracking of transactions and
> participants involved). So lately I started to developed some scenario's
> and none of them could reproduce a bug or explain what we saw. Until
> lately I tried to experiment with transaction participants that are able
> to 'take their time' in the prepare method [1]. When using random
> prepare times from 3 - 10 seconds I noticed the parallism of Mahalo
> and the througput of a transaction (time from client commit to
> completion) varied and was no direct funtion of the prepare time. The
> behavior I experienced could only be explained when the schedular of the
> various internal tasks was constrained by something. Knowing the code I
> suddenly realized there must have been a 'load factor' applied to the
> thread pool that was used for the commit related tasks. I was rather
> shocked to find out that the default was 3.0 and suddenly the mistery
> became completely clear to me. Mahalo has out-of-the-box a built-in
> constraint that can make the system serialize transaction related
> operation in case participants really take their time to return.
> So it turned out that Mahalo is a fine services after all, but that one
> 'freak' ;-) chose a very unfortunate default value for the load factor [2].
> Load-factors for thread pools (and max limits to a lesser degree) are so
> tricky to get right [3] and therefore IMHO high load factors should only
> be used in case you know for sure you are dealing with bursts of tasks
> with a guaranteed short duration and I think that is really something
> people should tune themselves.
> Maybe it was stupid of me and I should have read and understand the
> Mahalo documentation better. But I would expect any system to use
> out-of-the-box load-factors of 1.0 for tasks in a thread pool that
> are potentially long running tasks [3], especially for something as
> delecate as a transaction manager that seems to operate as the so called
> spider in the web. It is better to have a system consuming too much
> threads opposed to constrain it in a way that leads to problems that are
> very hard to find out.
> I hope this mail is seen as an RFE for a default load factor of 1.0 to
> prevent from people running into similar problems as we had and as a
> lesson/warning for those working with Mahalo and the risk of using
> load-factors in general.
> [1] in our system some service have to consult external systems when
> prepare is called on them and under some conditions it can take a long
> time to return from the prepare method. We are aware this is something
> you want to prevent but we have requirements that mandate this.
> [2] the one that gave us problems in production was Mahalo from JTSK
> 2.0 that didn't have the ability to specify a taskpool through the
> configuration. The loadfactor of 3.0 was hardcoded (with a TODO) and not
> documented at that time if I recall correctly (don't have a 2.0
> distribution at hand).
> [3] more and more I'm starting to believe that each task in a thread
> pool should have a dead-line in which they should be assigned to a
> worker thread, for this purpose we support in our thread pools a
> priority constraint to attach to Runnables, see
> http://www.cheiron.org/utils/nightly/api/org/cheiron/util/thread/PriorityConstraints.html.
> In a discussion in the Porter mailing list I know Bob Scheifler once
> said "I have in a past life been a fan of deadline scheduling.", I'm
> very interested to know whether he still is a fan.
> Evaluation:
> Given a low priority since in 2.1 the task pool objects are user configurable. This request is to change the default setting for those objects.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (RIVER-206) Change default load factors from 3 to 1

Posted by "Robert Resendes (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/RIVER-206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Resendes updated RIVER-206:
----------------------------------

    Fix Version/s: AR2

> Change default load factors from 3 to 1
> ---------------------------------------
>
>                 Key: RIVER-206
>                 URL: https://issues.apache.org/jira/browse/RIVER-206
>             Project: River
>          Issue Type: Improvement
>            Reporter: Ron Mann
>            Priority: Minor
>             Fix For: AR2
>
>
> Bugtraq ID [6355743|http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6355743]
> Taken from jini-users mailing list [http://archives.java.sun.com/cgi-bin/wa?A2=ind0511&L=jini-users&F=&S=&P=25095]:
> This is a sad horror story about a default value for a load factor in
> Mahalo that turned out to halt our software system at regular intervals,
> but never in a deterministic way, leading to many lost development
> hours, loss of faith and even worse.
> In short what we experienced was that some operations in our software
> system (includes JavaSpaces and various services that perform operations
> under a distributed transaction) that should take place in parallel
> took place in a serialized manner. We noticed this behavior only
> occurred under some (at that time unknown) conditions. Not only
> throughput was harmed but our assumptions with regard to the maximum
> time in which operations should complete didn't hold any longer and
> things started to fail. One can argue well that is what distributed
> systems is all about, but nevertheless it is something you try to avoid,
> especially when all parts seem to function properly.
> We were not able to find dead-locks in our code or some other problem
> that could cause this behavior. Given the large number of services,
> their interaction and associated thousands of threads over multiple JVMs
> and that you can't freeze-frame time for your system, this appeared as a
> tricky problem to tackle. One of those moments you really regret you
> started to develop a distributed application at the first place.
> However a little voice told me that Mahalo must be involved in all this
> trouble, this was in line with my feeling with respect to Mahalo as I
> knew the code a bit (due to fitting it in Seven) and Jim Hurleys remark
> at the 7th JCM "Mahalo is the weakest child of the contributed services"
> or similar wording.
> So I decided to assume there was a bug in Mahalo and the only way to
> find out was to develop a scenario that could make that bug obvious and
> to improve logging a lot (proper tracking of transactions and
> participants involved). So lately I started to developed some scenario's
> and none of them could reproduce a bug or explain what we saw. Until
> lately I tried to experiment with transaction participants that are able
> to 'take their time' in the prepare method [1]. When using random
> prepare times from 3 - 10 seconds I noticed the parallism of Mahalo
> and the througput of a transaction (time from client commit to
> completion) varied and was no direct funtion of the prepare time. The
> behavior I experienced could only be explained when the schedular of the
> various internal tasks was constrained by something. Knowing the code I
> suddenly realized there must have been a 'load factor' applied to the
> thread pool that was used for the commit related tasks. I was rather
> shocked to find out that the default was 3.0 and suddenly the mistery
> became completely clear to me. Mahalo has out-of-the-box a built-in
> constraint that can make the system serialize transaction related
> operation in case participants really take their time to return.
> So it turned out that Mahalo is a fine services after all, but that one
> 'freak' ;-) chose a very unfortunate default value for the load factor [2].
> Load-factors for thread pools (and max limits to a lesser degree) are so
> tricky to get right [3] and therefore IMHO high load factors should only
> be used in case you know for sure you are dealing with bursts of tasks
> with a guaranteed short duration and I think that is really something
> people should tune themselves.
> Maybe it was stupid of me and I should have read and understand the
> Mahalo documentation better. But I would expect any system to use
> out-of-the-box load-factors of 1.0 for tasks in a thread pool that
> are potentially long running tasks [3], especially for something as
> delecate as a transaction manager that seems to operate as the so called
> spider in the web. It is better to have a system consuming too much
> threads opposed to constrain it in a way that leads to problems that are
> very hard to find out.
> I hope this mail is seen as an RFE for a default load factor of 1.0 to
> prevent from people running into similar problems as we had and as a
> lesson/warning for those working with Mahalo and the risk of using
> load-factors in general.
> [1] in our system some service have to consult external systems when
> prepare is called on them and under some conditions it can take a long
> time to return from the prepare method. We are aware this is something
> you want to prevent but we have requirements that mandate this.
> [2] the one that gave us problems in production was Mahalo from JTSK
> 2.0 that didn't have the ability to specify a taskpool through the
> configuration. The loadfactor of 3.0 was hardcoded (with a TODO) and not
> documented at that time if I recall correctly (don't have a 2.0
> distribution at hand).
> [3] more and more I'm starting to believe that each task in a thread
> pool should have a dead-line in which they should be assigned to a
> worker thread, for this purpose we support in our thread pools a
> priority constraint to attach to Runnables, see
> http://www.cheiron.org/utils/nightly/api/org/cheiron/util/thread/PriorityConstraints.html.
> In a discussion in the Porter mailing list I know Bob Scheifler once
> said "I have in a past life been a fan of deadline scheduling.", I'm
> very interested to know whether he still is a fan.
> Evaluation:
> Given a low priority since in 2.1 the task pool objects are user configurable. This request is to change the default setting for those objects.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Closed: (RIVER-206) Change default load factors from 3 to 1

Posted by "Peter Firmstone (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/RIVER-206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Peter Firmstone closed RIVER-206.
---------------------------------


> Change default load factors from 3 to 1
> ---------------------------------------
>
>                 Key: RIVER-206
>                 URL: https://issues.apache.org/jira/browse/RIVER-206
>             Project: River
>          Issue Type: Improvement
>            Reporter: Ron Mann
>            Assignee: Robert Resendes
>            Priority: Minor
>             Fix For: AR2
>
>         Attachments: RIVER-206.diff
>
>
> Bugtraq ID [6355743|http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6355743]
> Taken from jini-users mailing list [http://archives.java.sun.com/cgi-bin/wa?A2=ind0511&L=jini-users&F=&S=&P=25095]:
> This is a sad horror story about a default value for a load factor in
> Mahalo that turned out to halt our software system at regular intervals,
> but never in a deterministic way, leading to many lost development
> hours, loss of faith and even worse.
> In short what we experienced was that some operations in our software
> system (includes JavaSpaces and various services that perform operations
> under a distributed transaction) that should take place in parallel
> took place in a serialized manner. We noticed this behavior only
> occurred under some (at that time unknown) conditions. Not only
> throughput was harmed but our assumptions with regard to the maximum
> time in which operations should complete didn't hold any longer and
> things started to fail. One can argue well that is what distributed
> systems is all about, but nevertheless it is something you try to avoid,
> especially when all parts seem to function properly.
> We were not able to find dead-locks in our code or some other problem
> that could cause this behavior. Given the large number of services,
> their interaction and associated thousands of threads over multiple JVMs
> and that you can't freeze-frame time for your system, this appeared as a
> tricky problem to tackle. One of those moments you really regret you
> started to develop a distributed application at the first place.
> However a little voice told me that Mahalo must be involved in all this
> trouble, this was in line with my feeling with respect to Mahalo as I
> knew the code a bit (due to fitting it in Seven) and Jim Hurleys remark
> at the 7th JCM "Mahalo is the weakest child of the contributed services"
> or similar wording.
> So I decided to assume there was a bug in Mahalo and the only way to
> find out was to develop a scenario that could make that bug obvious and
> to improve logging a lot (proper tracking of transactions and
> participants involved). So lately I started to developed some scenario's
> and none of them could reproduce a bug or explain what we saw. Until
> lately I tried to experiment with transaction participants that are able
> to 'take their time' in the prepare method [1]. When using random
> prepare times from 3 - 10 seconds I noticed the parallism of Mahalo
> and the througput of a transaction (time from client commit to
> completion) varied and was no direct funtion of the prepare time. The
> behavior I experienced could only be explained when the schedular of the
> various internal tasks was constrained by something. Knowing the code I
> suddenly realized there must have been a 'load factor' applied to the
> thread pool that was used for the commit related tasks. I was rather
> shocked to find out that the default was 3.0 and suddenly the mistery
> became completely clear to me. Mahalo has out-of-the-box a built-in
> constraint that can make the system serialize transaction related
> operation in case participants really take their time to return.
> So it turned out that Mahalo is a fine services after all, but that one
> 'freak' ;-) chose a very unfortunate default value for the load factor [2].
> Load-factors for thread pools (and max limits to a lesser degree) are so
> tricky to get right [3] and therefore IMHO high load factors should only
> be used in case you know for sure you are dealing with bursts of tasks
> with a guaranteed short duration and I think that is really something
> people should tune themselves.
> Maybe it was stupid of me and I should have read and understand the
> Mahalo documentation better. But I would expect any system to use
> out-of-the-box load-factors of 1.0 for tasks in a thread pool that
> are potentially long running tasks [3], especially for something as
> delecate as a transaction manager that seems to operate as the so called
> spider in the web. It is better to have a system consuming too much
> threads opposed to constrain it in a way that leads to problems that are
> very hard to find out.
> I hope this mail is seen as an RFE for a default load factor of 1.0 to
> prevent from people running into similar problems as we had and as a
> lesson/warning for those working with Mahalo and the risk of using
> load-factors in general.
> [1] in our system some service have to consult external systems when
> prepare is called on them and under some conditions it can take a long
> time to return from the prepare method. We are aware this is something
> you want to prevent but we have requirements that mandate this.
> [2] the one that gave us problems in production was Mahalo from JTSK
> 2.0 that didn't have the ability to specify a taskpool through the
> configuration. The loadfactor of 3.0 was hardcoded (with a TODO) and not
> documented at that time if I recall correctly (don't have a 2.0
> distribution at hand).
> [3] more and more I'm starting to believe that each task in a thread
> pool should have a dead-line in which they should be assigned to a
> worker thread, for this purpose we support in our thread pools a
> priority constraint to attach to Runnables, see
> http://www.cheiron.org/utils/nightly/api/org/cheiron/util/thread/PriorityConstraints.html.
> In a discussion in the Porter mailing list I know Bob Scheifler once
> said "I have in a past life been a fan of deadline scheduling.", I'm
> very interested to know whether he still is a fan.
> Evaluation:
> Given a low priority since in 2.1 the task pool objects are user configurable. This request is to change the default setting for those objects.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (RIVER-206) Change default load factors from 3 to 1

Posted by "Robert Resendes (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/RIVER-206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Resendes resolved RIVER-206.
-----------------------------------

    Resolution: Fixed

Committed as revision 650114

> Change default load factors from 3 to 1
> ---------------------------------------
>
>                 Key: RIVER-206
>                 URL: https://issues.apache.org/jira/browse/RIVER-206
>             Project: River
>          Issue Type: Improvement
>            Reporter: Ron Mann
>            Assignee: Robert Resendes
>            Priority: Minor
>             Fix For: AR2
>
>         Attachments: RIVER-206.diff
>
>
> Bugtraq ID [6355743|http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6355743]
> Taken from jini-users mailing list [http://archives.java.sun.com/cgi-bin/wa?A2=ind0511&L=jini-users&F=&S=&P=25095]:
> This is a sad horror story about a default value for a load factor in
> Mahalo that turned out to halt our software system at regular intervals,
> but never in a deterministic way, leading to many lost development
> hours, loss of faith and even worse.
> In short what we experienced was that some operations in our software
> system (includes JavaSpaces and various services that perform operations
> under a distributed transaction) that should take place in parallel
> took place in a serialized manner. We noticed this behavior only
> occurred under some (at that time unknown) conditions. Not only
> throughput was harmed but our assumptions with regard to the maximum
> time in which operations should complete didn't hold any longer and
> things started to fail. One can argue well that is what distributed
> systems is all about, but nevertheless it is something you try to avoid,
> especially when all parts seem to function properly.
> We were not able to find dead-locks in our code or some other problem
> that could cause this behavior. Given the large number of services,
> their interaction and associated thousands of threads over multiple JVMs
> and that you can't freeze-frame time for your system, this appeared as a
> tricky problem to tackle. One of those moments you really regret you
> started to develop a distributed application at the first place.
> However a little voice told me that Mahalo must be involved in all this
> trouble, this was in line with my feeling with respect to Mahalo as I
> knew the code a bit (due to fitting it in Seven) and Jim Hurleys remark
> at the 7th JCM "Mahalo is the weakest child of the contributed services"
> or similar wording.
> So I decided to assume there was a bug in Mahalo and the only way to
> find out was to develop a scenario that could make that bug obvious and
> to improve logging a lot (proper tracking of transactions and
> participants involved). So lately I started to developed some scenario's
> and none of them could reproduce a bug or explain what we saw. Until
> lately I tried to experiment with transaction participants that are able
> to 'take their time' in the prepare method [1]. When using random
> prepare times from 3 - 10 seconds I noticed the parallism of Mahalo
> and the througput of a transaction (time from client commit to
> completion) varied and was no direct funtion of the prepare time. The
> behavior I experienced could only be explained when the schedular of the
> various internal tasks was constrained by something. Knowing the code I
> suddenly realized there must have been a 'load factor' applied to the
> thread pool that was used for the commit related tasks. I was rather
> shocked to find out that the default was 3.0 and suddenly the mistery
> became completely clear to me. Mahalo has out-of-the-box a built-in
> constraint that can make the system serialize transaction related
> operation in case participants really take their time to return.
> So it turned out that Mahalo is a fine services after all, but that one
> 'freak' ;-) chose a very unfortunate default value for the load factor [2].
> Load-factors for thread pools (and max limits to a lesser degree) are so
> tricky to get right [3] and therefore IMHO high load factors should only
> be used in case you know for sure you are dealing with bursts of tasks
> with a guaranteed short duration and I think that is really something
> people should tune themselves.
> Maybe it was stupid of me and I should have read and understand the
> Mahalo documentation better. But I would expect any system to use
> out-of-the-box load-factors of 1.0 for tasks in a thread pool that
> are potentially long running tasks [3], especially for something as
> delecate as a transaction manager that seems to operate as the so called
> spider in the web. It is better to have a system consuming too much
> threads opposed to constrain it in a way that leads to problems that are
> very hard to find out.
> I hope this mail is seen as an RFE for a default load factor of 1.0 to
> prevent from people running into similar problems as we had and as a
> lesson/warning for those working with Mahalo and the risk of using
> load-factors in general.
> [1] in our system some service have to consult external systems when
> prepare is called on them and under some conditions it can take a long
> time to return from the prepare method. We are aware this is something
> you want to prevent but we have requirements that mandate this.
> [2] the one that gave us problems in production was Mahalo from JTSK
> 2.0 that didn't have the ability to specify a taskpool through the
> configuration. The loadfactor of 3.0 was hardcoded (with a TODO) and not
> documented at that time if I recall correctly (don't have a 2.0
> distribution at hand).
> [3] more and more I'm starting to believe that each task in a thread
> pool should have a dead-line in which they should be assigned to a
> worker thread, for this purpose we support in our thread pools a
> priority constraint to attach to Runnables, see
> http://www.cheiron.org/utils/nightly/api/org/cheiron/util/thread/PriorityConstraints.html.
> In a discussion in the Porter mailing list I know Bob Scheifler once
> said "I have in a past life been a fan of deadline scheduling.", I'm
> very interested to know whether he still is a fan.
> Evaluation:
> Given a low priority since in 2.1 the task pool objects are user configurable. This request is to change the default setting for those objects.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (RIVER-206) Change default load factors from 3 to 1

Posted by "Robert Resendes (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/RIVER-206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Resendes updated RIVER-206:
----------------------------------

    Assignee: Robert Resendes

> Change default load factors from 3 to 1
> ---------------------------------------
>
>                 Key: RIVER-206
>                 URL: https://issues.apache.org/jira/browse/RIVER-206
>             Project: River
>          Issue Type: Improvement
>            Reporter: Ron Mann
>            Assignee: Robert Resendes
>            Priority: Minor
>             Fix For: AR2
>
>
> Bugtraq ID [6355743|http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6355743]
> Taken from jini-users mailing list [http://archives.java.sun.com/cgi-bin/wa?A2=ind0511&L=jini-users&F=&S=&P=25095]:
> This is a sad horror story about a default value for a load factor in
> Mahalo that turned out to halt our software system at regular intervals,
> but never in a deterministic way, leading to many lost development
> hours, loss of faith and even worse.
> In short what we experienced was that some operations in our software
> system (includes JavaSpaces and various services that perform operations
> under a distributed transaction) that should take place in parallel
> took place in a serialized manner. We noticed this behavior only
> occurred under some (at that time unknown) conditions. Not only
> throughput was harmed but our assumptions with regard to the maximum
> time in which operations should complete didn't hold any longer and
> things started to fail. One can argue well that is what distributed
> systems is all about, but nevertheless it is something you try to avoid,
> especially when all parts seem to function properly.
> We were not able to find dead-locks in our code or some other problem
> that could cause this behavior. Given the large number of services,
> their interaction and associated thousands of threads over multiple JVMs
> and that you can't freeze-frame time for your system, this appeared as a
> tricky problem to tackle. One of those moments you really regret you
> started to develop a distributed application at the first place.
> However a little voice told me that Mahalo must be involved in all this
> trouble, this was in line with my feeling with respect to Mahalo as I
> knew the code a bit (due to fitting it in Seven) and Jim Hurleys remark
> at the 7th JCM "Mahalo is the weakest child of the contributed services"
> or similar wording.
> So I decided to assume there was a bug in Mahalo and the only way to
> find out was to develop a scenario that could make that bug obvious and
> to improve logging a lot (proper tracking of transactions and
> participants involved). So lately I started to developed some scenario's
> and none of them could reproduce a bug or explain what we saw. Until
> lately I tried to experiment with transaction participants that are able
> to 'take their time' in the prepare method [1]. When using random
> prepare times from 3 - 10 seconds I noticed the parallism of Mahalo
> and the througput of a transaction (time from client commit to
> completion) varied and was no direct funtion of the prepare time. The
> behavior I experienced could only be explained when the schedular of the
> various internal tasks was constrained by something. Knowing the code I
> suddenly realized there must have been a 'load factor' applied to the
> thread pool that was used for the commit related tasks. I was rather
> shocked to find out that the default was 3.0 and suddenly the mistery
> became completely clear to me. Mahalo has out-of-the-box a built-in
> constraint that can make the system serialize transaction related
> operation in case participants really take their time to return.
> So it turned out that Mahalo is a fine services after all, but that one
> 'freak' ;-) chose a very unfortunate default value for the load factor [2].
> Load-factors for thread pools (and max limits to a lesser degree) are so
> tricky to get right [3] and therefore IMHO high load factors should only
> be used in case you know for sure you are dealing with bursts of tasks
> with a guaranteed short duration and I think that is really something
> people should tune themselves.
> Maybe it was stupid of me and I should have read and understand the
> Mahalo documentation better. But I would expect any system to use
> out-of-the-box load-factors of 1.0 for tasks in a thread pool that
> are potentially long running tasks [3], especially for something as
> delecate as a transaction manager that seems to operate as the so called
> spider in the web. It is better to have a system consuming too much
> threads opposed to constrain it in a way that leads to problems that are
> very hard to find out.
> I hope this mail is seen as an RFE for a default load factor of 1.0 to
> prevent from people running into similar problems as we had and as a
> lesson/warning for those working with Mahalo and the risk of using
> load-factors in general.
> [1] in our system some service have to consult external systems when
> prepare is called on them and under some conditions it can take a long
> time to return from the prepare method. We are aware this is something
> you want to prevent but we have requirements that mandate this.
> [2] the one that gave us problems in production was Mahalo from JTSK
> 2.0 that didn't have the ability to specify a taskpool through the
> configuration. The loadfactor of 3.0 was hardcoded (with a TODO) and not
> documented at that time if I recall correctly (don't have a 2.0
> distribution at hand).
> [3] more and more I'm starting to believe that each task in a thread
> pool should have a dead-line in which they should be assigned to a
> worker thread, for this purpose we support in our thread pools a
> priority constraint to attach to Runnables, see
> http://www.cheiron.org/utils/nightly/api/org/cheiron/util/thread/PriorityConstraints.html.
> In a discussion in the Porter mailing list I know Bob Scheifler once
> said "I have in a past life been a fan of deadline scheduling.", I'm
> very interested to know whether he still is a fan.
> Evaluation:
> Given a low priority since in 2.1 the task pool objects are user configurable. This request is to change the default setting for those objects.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Work started: (RIVER-206) Change default load factors from 3 to 1

Posted by "Robert Resendes (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/RIVER-206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Work on RIVER-206 started by Robert Resendes.

> Change default load factors from 3 to 1
> ---------------------------------------
>
>                 Key: RIVER-206
>                 URL: https://issues.apache.org/jira/browse/RIVER-206
>             Project: River
>          Issue Type: Improvement
>            Reporter: Ron Mann
>            Assignee: Robert Resendes
>            Priority: Minor
>             Fix For: AR2
>
>
> Bugtraq ID [6355743|http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6355743]
> Taken from jini-users mailing list [http://archives.java.sun.com/cgi-bin/wa?A2=ind0511&L=jini-users&F=&S=&P=25095]:
> This is a sad horror story about a default value for a load factor in
> Mahalo that turned out to halt our software system at regular intervals,
> but never in a deterministic way, leading to many lost development
> hours, loss of faith and even worse.
> In short what we experienced was that some operations in our software
> system (includes JavaSpaces and various services that perform operations
> under a distributed transaction) that should take place in parallel
> took place in a serialized manner. We noticed this behavior only
> occurred under some (at that time unknown) conditions. Not only
> throughput was harmed but our assumptions with regard to the maximum
> time in which operations should complete didn't hold any longer and
> things started to fail. One can argue well that is what distributed
> systems is all about, but nevertheless it is something you try to avoid,
> especially when all parts seem to function properly.
> We were not able to find dead-locks in our code or some other problem
> that could cause this behavior. Given the large number of services,
> their interaction and associated thousands of threads over multiple JVMs
> and that you can't freeze-frame time for your system, this appeared as a
> tricky problem to tackle. One of those moments you really regret you
> started to develop a distributed application at the first place.
> However a little voice told me that Mahalo must be involved in all this
> trouble, this was in line with my feeling with respect to Mahalo as I
> knew the code a bit (due to fitting it in Seven) and Jim Hurleys remark
> at the 7th JCM "Mahalo is the weakest child of the contributed services"
> or similar wording.
> So I decided to assume there was a bug in Mahalo and the only way to
> find out was to develop a scenario that could make that bug obvious and
> to improve logging a lot (proper tracking of transactions and
> participants involved). So lately I started to developed some scenario's
> and none of them could reproduce a bug or explain what we saw. Until
> lately I tried to experiment with transaction participants that are able
> to 'take their time' in the prepare method [1]. When using random
> prepare times from 3 - 10 seconds I noticed the parallism of Mahalo
> and the througput of a transaction (time from client commit to
> completion) varied and was no direct funtion of the prepare time. The
> behavior I experienced could only be explained when the schedular of the
> various internal tasks was constrained by something. Knowing the code I
> suddenly realized there must have been a 'load factor' applied to the
> thread pool that was used for the commit related tasks. I was rather
> shocked to find out that the default was 3.0 and suddenly the mistery
> became completely clear to me. Mahalo has out-of-the-box a built-in
> constraint that can make the system serialize transaction related
> operation in case participants really take their time to return.
> So it turned out that Mahalo is a fine services after all, but that one
> 'freak' ;-) chose a very unfortunate default value for the load factor [2].
> Load-factors for thread pools (and max limits to a lesser degree) are so
> tricky to get right [3] and therefore IMHO high load factors should only
> be used in case you know for sure you are dealing with bursts of tasks
> with a guaranteed short duration and I think that is really something
> people should tune themselves.
> Maybe it was stupid of me and I should have read and understand the
> Mahalo documentation better. But I would expect any system to use
> out-of-the-box load-factors of 1.0 for tasks in a thread pool that
> are potentially long running tasks [3], especially for something as
> delecate as a transaction manager that seems to operate as the so called
> spider in the web. It is better to have a system consuming too much
> threads opposed to constrain it in a way that leads to problems that are
> very hard to find out.
> I hope this mail is seen as an RFE for a default load factor of 1.0 to
> prevent from people running into similar problems as we had and as a
> lesson/warning for those working with Mahalo and the risk of using
> load-factors in general.
> [1] in our system some service have to consult external systems when
> prepare is called on them and under some conditions it can take a long
> time to return from the prepare method. We are aware this is something
> you want to prevent but we have requirements that mandate this.
> [2] the one that gave us problems in production was Mahalo from JTSK
> 2.0 that didn't have the ability to specify a taskpool through the
> configuration. The loadfactor of 3.0 was hardcoded (with a TODO) and not
> documented at that time if I recall correctly (don't have a 2.0
> distribution at hand).
> [3] more and more I'm starting to believe that each task in a thread
> pool should have a dead-line in which they should be assigned to a
> worker thread, for this purpose we support in our thread pools a
> priority constraint to attach to Runnables, see
> http://www.cheiron.org/utils/nightly/api/org/cheiron/util/thread/PriorityConstraints.html.
> In a discussion in the Porter mailing list I know Bob Scheifler once
> said "I have in a past life been a fan of deadline scheduling.", I'm
> very interested to know whether he still is a fan.
> Evaluation:
> Given a low priority since in 2.1 the task pool objects are user configurable. This request is to change the default setting for those objects.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.