You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by pnowojski <gi...@git.apache.org> on 2017/10/11 12:54:21 UTC

[GitHub] flink pull request #4801: [FLINK-7812] Log system resources metrics

GitHub user pnowojski opened a pull request:

    https://github.com/apache/flink/pull/4801

    [FLINK-7812] Log system resources metrics

    ## What is the purpose of the change
    
    This PR adds various system resources metrics, useful for analysing issues on machines/clusters for which there are no detailed external resources logging systems.
    
    ## Verifying this change
    
    This change was manually tested, since it's difficult to write automated tests for this feature.
    
    ## Does this pull request potentially affect one of the following parts:
    
      - Dependencies (does it add or upgrade a dependency): (**yes** / no)
    
    It adds `ohci` dependency.
    
    ## Documentation
    
      - Does this pull request introduce a new feature? (**yes** / no)
      - If yes, how is the feature documented? (not applicable / **docs** / **JavaDocs** / not documented)
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/pnowojski/flink resources

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/4801.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #4801
    
----
commit d337e322477486daad19fad37747e6a8898bf619
Author: Piotr Nowojski <pi...@gmail.com>
Date:   2017-10-10T15:24:21Z

    [hotfix][metrics] Replace anonymous classes with lamdas

commit 37eaf16bb315be35559d8ac7f5a98c38cfc1e9a4
Author: Piotr Nowojski <pi...@gmail.com>
Date:   2017-10-11T10:06:01Z

    [hotfix][metrics] Remove redundant TaskManager metrics initialization

commit 1b2420746a613a2af73486197d4fa3aeb63f7236
Author: Piotr Nowojski <pi...@gmail.com>
Date:   2017-10-11T07:10:49Z

    [FLINK-7812][metrics] Add system resources metrics

----


---

[GitHub] flink issue #4801: [FLINK-7812] Log system resources metrics

Posted by StephanEwen <gi...@git.apache.org>.
Github user StephanEwen commented on the issue:

    https://github.com/apache/flink/pull/4801
  
    Thanks for this addition. Few comments:
    
      - Please try to follow the common (though not enforced) code style when it comes to empty lines between class declarations, fields, methods, etc.
    
      - What is the license of the new dependency? Can you send a link so we can verify that?
    
      - If possible, let's shade that dependency, avoid increasing the dependency footprint.
    
      - Can we harmonize the config keys between the memory logger and the new system metrics?


---

[GitHub] flink pull request #4801: [FLINK-7812] Log system resources metrics

Posted by StephanEwen <gi...@git.apache.org>.
Github user StephanEwen commented on a diff in the pull request:

    https://github.com/apache/flink/pull/4801#discussion_r155503385
  
    --- Diff: flink-runtime/src/main/java/org/apache/flink/runtime/taskexecutor/utils/SystemResourcesCounter.java ---
    @@ -0,0 +1,236 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.flink.runtime.taskexecutor.utils;
    +
    +import org.apache.flink.api.common.time.Time;
    +
    +import org.slf4j.Logger;
    +import org.slf4j.LoggerFactory;
    +import oshi.SystemInfo;
    +import oshi.hardware.CentralProcessor;
    +import oshi.hardware.CentralProcessor.TickType;
    +import oshi.hardware.HardwareAbstractionLayer;
    +import oshi.hardware.NetworkIF;
    +
    +import javax.annotation.concurrent.ThreadSafe;
    +
    +import java.util.concurrent.atomic.AtomicLongArray;
    +import java.util.concurrent.atomic.AtomicReferenceArray;
    +
    +import static org.apache.flink.util.Preconditions.checkState;
    +
    +/**
    + * Daemon thread logging system resources.
    + *
    + * To accurately and consistently report CPU and network usage we have to periodically probe
    + * CPU ticks and network sent/received bytes and then convert those values to CPU usage and
    + * send/receive byte rates.
    + */
    +@ThreadSafe
    --- End diff --
    
    A Thread is ThreadSafe? ;-)


---

[GitHub] flink pull request #4801: [FLINK-7812] Log system resources metrics

Posted by pnowojski <gi...@git.apache.org>.
Github user pnowojski commented on a diff in the pull request:

    https://github.com/apache/flink/pull/4801#discussion_r156010369
  
    --- Diff: flink-runtime/src/test/java/org/apache/flink/runtime/taskexecutor/TaskManagerRunnerITCase.java ---
    @@ -0,0 +1,132 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.flink.runtime.taskexecutor;
    +
    +import org.apache.flink.configuration.Configuration;
    +import org.apache.flink.metrics.Gauge;
    +import org.apache.flink.metrics.MetricConfig;
    +import org.apache.flink.metrics.reporter.AbstractReporter;
    +import org.apache.flink.runtime.minicluster.MiniCluster;
    +import org.apache.flink.runtime.minicluster.MiniClusterConfiguration;
    +
    +import org.junit.AfterClass;
    +import org.junit.BeforeClass;
    +import org.junit.Test;
    +
    +import java.util.Collection;
    +import java.util.Map;
    +import java.util.Set;
    +import java.util.concurrent.ConcurrentHashMap;
    +
    +import static org.apache.flink.configuration.MetricOptions.REPORTERS_LIST;
    +import static org.apache.flink.configuration.TaskManagerOptions.ADDITIONAL_LOGGING;
    +import static org.junit.Assert.assertEquals;
    +import static org.junit.Assert.fail;
    +
    +/**
    + * Integration tests for proper initialization in the {@link TaskManagerRunner}.
    + */
    +public class TaskManagerRunnerITCase {
    --- End diff --
    
    Renamed to `SystemResourcesMetricsITCase`.
    
    I have added this test after some larger rebasing conflicts, because otherwise I wasn't sure that everything is starting up correctly. I think that there would have to be a handful unit tests replacing this one `ITCase` and they would be more prone to fail/cause problems during refactoring.


---

[GitHub] flink issue #4801: [FLINK-7812] Log system resources metrics

Posted by StephanEwen <gi...@git.apache.org>.
Github user StephanEwen commented on the issue:

    https://github.com/apache/flink/pull/4801
  
    Eclipse Public License is not impossible, but tricky.
    
    I am not a lawyer, but this is what I picked up over the year: EPL is weak copyleft, meaning linking is okay, but modifying not (from Apache License compatibility) . Shading the code in the library (which we do when building the flink dist jar) is a bit of an gray zone. It does not violate the spirit of the license, but a court may see that differently.
    
    Various Flink users that approached us to avoid weak copyleft as much as possible because of that uncertainty, so avoiding this dependency would be desirable.
    
    Making it an optional dependency that users explicitly have to add is possible, because then we do shade it into the Flink distribution jar.


---

[GitHub] flink issue #4801: [FLINK-7812] Log system resources metrics

Posted by pnowojski <gi...@git.apache.org>.
Github user pnowojski commented on the issue:

    https://github.com/apache/flink/pull/4801
  
    As on dev mailing list discussion, this feature uses https://github.com/oshi/oshi library licensed under EPL 1.0. It seems to be compatible with ours: https://www.apache.org/legal/resolved.html .
    
    It has minimal external dependencies. Question is whether we want to shade everything that we add?
    
    Definitely we could unify config options if we want to do that.


---

[GitHub] flink pull request #4801: [FLINK-7812] Log system resources metrics

Posted by pnowojski <gi...@git.apache.org>.
Github user pnowojski commented on a diff in the pull request:

    https://github.com/apache/flink/pull/4801#discussion_r156008613
  
    --- Diff: flink-runtime/src/main/java/org/apache/flink/runtime/taskexecutor/utils/SystemResourcesCounter.java ---
    @@ -0,0 +1,236 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.flink.runtime.taskexecutor.utils;
    +
    +import org.apache.flink.api.common.time.Time;
    +
    +import org.slf4j.Logger;
    +import org.slf4j.LoggerFactory;
    +import oshi.SystemInfo;
    +import oshi.hardware.CentralProcessor;
    +import oshi.hardware.CentralProcessor.TickType;
    +import oshi.hardware.HardwareAbstractionLayer;
    +import oshi.hardware.NetworkIF;
    +
    +import javax.annotation.concurrent.ThreadSafe;
    +
    +import java.util.concurrent.atomic.AtomicLongArray;
    +import java.util.concurrent.atomic.AtomicReferenceArray;
    +
    +import static org.apache.flink.util.Preconditions.checkState;
    +
    +/**
    + * Daemon thread logging system resources.
    + *
    + * To accurately and consistently report CPU and network usage we have to periodically probe
    + * CPU ticks and network sent/received bytes and then convert those values to CPU usage and
    + * send/receive byte rates.
    + */
    +@ThreadSafe
    --- End diff --
    
    Hmmm :) This annotation is about public getters, but maybe indeed it doesn't make a lot of sense. On the other hand, is it harmful?


---

[GitHub] flink issue #4801: [FLINK-7812] Log system resources metrics

Posted by pnowojski <gi...@git.apache.org>.
Github user pnowojski commented on the issue:

    https://github.com/apache/flink/pull/4801
  
    I have made this dependency optional.


---

[GitHub] flink pull request #4801: [FLINK-7812] Log system resources metrics

Posted by StephanEwen <gi...@git.apache.org>.
Github user StephanEwen commented on a diff in the pull request:

    https://github.com/apache/flink/pull/4801#discussion_r155501129
  
    --- Diff: flink-runtime/src/test/java/org/apache/flink/runtime/taskexecutor/TaskManagerRunnerITCase.java ---
    @@ -0,0 +1,132 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.flink.runtime.taskexecutor;
    +
    +import org.apache.flink.configuration.Configuration;
    +import org.apache.flink.metrics.Gauge;
    +import org.apache.flink.metrics.MetricConfig;
    +import org.apache.flink.metrics.reporter.AbstractReporter;
    +import org.apache.flink.runtime.minicluster.MiniCluster;
    +import org.apache.flink.runtime.minicluster.MiniClusterConfiguration;
    +
    +import org.junit.AfterClass;
    +import org.junit.BeforeClass;
    +import org.junit.Test;
    +
    +import java.util.Collection;
    +import java.util.Map;
    +import java.util.Set;
    +import java.util.concurrent.ConcurrentHashMap;
    +
    +import static org.apache.flink.configuration.MetricOptions.REPORTERS_LIST;
    +import static org.apache.flink.configuration.TaskManagerOptions.ADDITIONAL_LOGGING;
    +import static org.junit.Assert.assertEquals;
    +import static org.junit.Assert.fail;
    +
    +/**
    + * Integration tests for proper initialization in the {@link TaskManagerRunner}.
    + */
    +public class TaskManagerRunnerITCase {
    --- End diff --
    
    This test seems very specific to this logging, but is named as a generic TaskManagerRunner test. Give it a differnet name?
    
    Separate question: Does it have to be an IT case that fully starts the TM, or can it be a unit test that checks config propagation?


---

[GitHub] flink pull request #4801: [FLINK-7812] Log system resources metrics

Posted by StephanEwen <gi...@git.apache.org>.
Github user StephanEwen commented on a diff in the pull request:

    https://github.com/apache/flink/pull/4801#discussion_r155503740
  
    --- Diff: flink-runtime/src/main/java/org/apache/flink/runtime/metrics/util/MetricUtils.java ---
    @@ -102,37 +103,16 @@ public static void instantiateStatusMetrics(
     	private static void instantiateNetworkMetrics(
     		MetricGroup metrics,
     		final NetworkEnvironment network) {
    -		metrics.<Long, Gauge<Long>>gauge("TotalMemorySegments", new Gauge<Long> () {
    -			@Override
    -			public Long getValue() {
    -				return (long) network.getNetworkBufferPool().getTotalNumberOfMemorySegments();
    -			}
    -		});
     
    -		metrics.<Long, Gauge<Long>>gauge("AvailableMemorySegments", new Gauge<Long> () {
    -			@Override
    -			public Long getValue() {
    -				return (long) network.getNetworkBufferPool().getNumberOfAvailableMemorySegments();
    -			}
    -		});
    +		final NetworkBufferPool networkBufferPool = network.getNetworkBufferPool();
    +		metrics.<Long, Gauge<Long>>gauge("TotalMemorySegments", () -> (long) networkBufferPool.getTotalNumberOfMemorySegments());
    --- End diff --
    
    Replace with "Integer" Gauge and change to method reference?


---