You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@geode.apache.org by "moleske (GitHub)" <gi...@apache.org> on 2019/01/31 01:09:14 UTC

[GitHub] [geode] moleske opened pull request #3142: GEODE-6334: Change cache operation stats to longs

Co-authored-by: Michael Oleske <mo...@pivotal.io>

Thank you for submitting a contribution to Apache Geode.

In order to streamline the review of the contribution we ask you
to ensure the following steps have been taken:

### For all changes:
- [x] Is there a JIRA ticket associated with this PR? Is it referenced in the commit message?

- [x] Has your PR been rebased against the latest commit within the target branch (typically `develop`)?

- [x] Is your initial contribution a single, squashed commit?

- [x] Does `gradlew build` run cleanly?

- [x] Have you written or updated unit tests to verify your changes?

- [ ] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](http://www.apache.org/legal/resolved.html#category-a)?

### Note:
Please ensure that once the PR is submitted, you check travis-ci for build issues and
submit an update to your PR as soon as possible. If you need help, please send an
email to dev@geode.apache.org.


[ Full content available at: https://github.com/apache/geode/pull/3142 ]
This message was relayed via gitbox.apache.org for notifications@geode.apache.org

[GitHub] [geode] kirklund commented on issue #3142: GEODE-6334: Change cache operation stats to longs

Posted by "kirklund (GitHub)" <gi...@apache.org>.
This PR is showing the biggest shortcoming of StressNewTest. Nearly every IntegrationTest ever added to Geode was written with the assumption that the JVM will never be reused so they don't perform sufficient tearDown. This means that if you touch any of these older tests, you're guaranteed to see a ton of nasty failures in StressNewTest. I would ignore these failures for this PR as long as the regular IntegrationTest jobs are clean

[ Full content available at: https://github.com/apache/geode/pull/3142 ]
This message was relayed via gitbox.apache.org for notifications@geode.apache.org

[GitHub] [geode] kirklund commented on issue #3142: GEODE-6334: Change cache operation stats to longs

Posted by "kirklund (GitHub)" <gi...@apache.org>.
Three tests failing in StressTest due to insufficient tearDown in older IntegrationTests:
* GEODE-6385: TXJUnitTest - CacheExistsException
* GEODE-6386: PartitionedRegionSingleNodeOperationsJUnitTest - CacheClosedException
* GEODE-6387: MemberLevelStatsJUnitTest - IllegalStateException: A connection to a distributed system already exists in this VM

One test hanging in StressTest:
* GEODE-6388: PartitionedRegionCqQueryDUnitTest - CqQueryTestListener.waitForTotalEvents

All of the above problems are caused by tests that cannot be executed more than once within a JVM.

[ Full content available at: https://github.com/apache/geode/pull/3142 ]
This message was relayed via gitbox.apache.org for notifications@geode.apache.org

[GitHub] [geode] moleske commented on issue #3142: GEODE-6334: Change cache operation stats to longs

Posted by "moleske (GitHub)" <gi...@apache.org>.
Cool thanks for the context @kirklund!  Looks like the unit test failure is related to the red CI stuff going on.

Still going to wait til the main pipeline is green before merging, looks like the [mailing list](https://markmail.org/search/?q=list%3Aorg.apache.geode.dev+order%3Adate-backward#query:list%3Aorg.apache.geode.dev%20order%3Adate-backward+page:1+mid:2dxqmr7vt6jocawm+state:results) is still working on some redness

[ Full content available at: https://github.com/apache/geode/pull/3142 ]
This message was relayed via gitbox.apache.org for notifications@geode.apache.org

[GitHub] [geode] kirklund commented on issue #3142: GEODE-6334: Change cache operation stats to longs

Posted by "kirklund (GitHub)" <gi...@apache.org>.
This PR is showing the biggest shortcoming of StressNewTest. Nearly every IntegrationTest ever added to Geode was written with the assumption that the JVM will never be reused so they don't perform sufficient tearDown. This means that if you touch any of these older tests, you're guaranteed to see a ton of nasty failures in StressNewTest. I would ignore these failures for this PR as long as the regular IntegrationTest jobs are clean.

[ Full content available at: https://github.com/apache/geode/pull/3142 ]
This message was relayed via gitbox.apache.org for notifications@geode.apache.org

[GitHub] [geode] kirklund commented on issue #3142: GEODE-6334: Change cache operation stats to longs

Posted by "kirklund (GitHub)" <gi...@apache.org>.
All of the StressTest failures are these three bugs:
* GEODE-6385: TXJUnitTest - CacheExistsException
* GEODE-6386: PartitionedRegionSingleNodeOperationsJUnitTest - CacheClosedException
* GEODE-6387: MemberLevelStatsJUnitTest - IllegalStateException: A connection to a distributed system already exists in this VM

And one hang:
* GEODE-6388: PartitionedRegionCqQueryDUnitTest - CqQueryTestListener.waitForTotalEvents

[ Full content available at: https://github.com/apache/geode/pull/3142 ]
This message was relayed via gitbox.apache.org for notifications@geode.apache.org

[GitHub] [geode] kirklund commented on issue #3142: GEODE-6334: Change cache operation stats to longs

Posted by "kirklund (GitHub)" <gi...@apache.org>.
All of the StressTest failures are these three bugs:
* GEODE-6385: TXJUnitTest - CacheExistsException
* GEODE-6386: PartitionedRegionSingleNodeOperationsJUnitTest - CacheClosedException
* GEODE-6387: MemberLevelStatsJUnitTest - IllegalStateException: A connection to a distributed system already exists in this VM
And one hang:
* GEODE-6388: PartitionedRegionCqQueryDUnitTest - CqQueryTestListener.waitForTotalEvents

[ Full content available at: https://github.com/apache/geode/pull/3142 ]
This message was relayed via gitbox.apache.org for notifications@geode.apache.org

[GitHub] [geode] moleske commented on issue #3142: GEODE-6334: Change cache operation stats to longs

Posted by "moleske (GitHub)" <gi...@apache.org>.
@mhansonp @dhemery please take a look also

[ Full content available at: https://github.com/apache/geode/pull/3142 ]
This message was relayed via gitbox.apache.org for notifications@geode.apache.org

[GitHub] [geode] demery-pivotal commented on issue #3142: GEODE-6334: Change cache operation stats to longs

Posted by "demery-pivotal (GitHub)" <gi...@apache.org>.
Would it be unreasonable to fix the fundamental stats code so that no counter wraps to negative, but instead wraps (or perhaps resets) to zero on overflow?

[ Full content available at: https://github.com/apache/geode/pull/3142 ]
This message was relayed via gitbox.apache.org for notifications@geode.apache.org

[GitHub] [geode] moleske commented on issue #3142: GEODE-6334: Change cache operation stats to longs

Posted by "moleske (GitHub)" <gi...@apache.org>.
Looks like there are some `int` versus `long` checks that didn't get updated, will have to take a pass at it

@kirklund 

[ Full content available at: https://github.com/apache/geode/pull/3142 ]
This message was relayed via gitbox.apache.org for notifications@geode.apache.org

[GitHub] [geode] kirklund commented on issue #3142: GEODE-6334: Change cache operation stats to longs

Posted by "kirklund (GitHub)" <gi...@apache.org>.
I'm not sure if there are stats that have legitimately negative values or not. I can't think of any but there might be some. 

If we change the underlying stats framework to not allow negative values, we have two possible approaches: 1) have the code throw if a call attempts to drop the value below zero, 2) have the code ignore and decrements when the value is at zero.

For #1, I'm worried that we would create new failures that would bring down servers when the stats used to go negative.

For #2, I'm a little worried that we would suppress our ability to notice that code decrementing a stat is buggy. I don't think we want to log every time we try to dec below zero to avoid suppressing this.

A bigger problem that Michael and I noticed is that there's no type enforcement. In other words if the stat is defined as a LongCounter but some code incs or decs it as an Integer, nothing complains or fails. The ThreadLocal used for striping values has two arrays: one for long values and one for integer values. No matter which Counter type, incInt or incLong will happily update the value in corresponding array. So if a stat has some bug in which most of the code paths invoke incLong but one path still invokes incInt, it'll fail to update the proper underlying array and the call to incInt is affectively ignored for the stat. The reason this isn't easy to fix is that there's a single class (Atomic50StatisticsImpl) for all stats in a given *Stats class which may contain a mix of both long and integer stats. I think fixing this problem would be a lot of work redoing the stats framework.

I wonder if it would be better to start migrating ALL stats implementations to using Micrometer instead of reworking our existing framework. (both for dec'ing stats below zero and long vs integer type safety)

[ Full content available at: https://github.com/apache/geode/pull/3142 ]
This message was relayed via gitbox.apache.org for notifications@geode.apache.org

[GitHub] [geode] demery-pivotal commented on issue #3142: GEODE-6334: Change cache operation stats to longs

Posted by "demery-pivotal (GitHub)" <gi...@apache.org>.
Okay. Given all that, I'd say preventing negative counters is way beyond the scope of this commit.

[ Full content available at: https://github.com/apache/geode/pull/3142 ]
This message was relayed via gitbox.apache.org for notifications@geode.apache.org

[GitHub] [geode] kirklund closed pull request #3142: GEODE-6334: Change cache operation stats to longs

Posted by "kirklund (GitHub)" <gi...@apache.org>.
[ pull request closed by kirklund ]

[ Full content available at: https://github.com/apache/geode/pull/3142 ]
This message was relayed via gitbox.apache.org for notifications@geode.apache.org

[GitHub] [geode] kirklund commented on issue #3142: GEODE-6334: Change cache operation stats to longs

Posted by "kirklund (GitHub)" <gi...@apache.org>.
All of the StressTest failures are these three bugs:
* GEODE-6385: TXJUnitTest - CacheExistsException
* GEODE-6386: PartitionedRegionSingleNodeOperationsJUnitTest - CacheClosedException
* GEODE-6387: MemberLevelStatsJUnitTest - IllegalStateException: A connection to a distributed system already exists in this VM

[ Full content available at: https://github.com/apache/geode/pull/3142 ]
This message was relayed via gitbox.apache.org for notifications@geode.apache.org