You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pulsar.apache.org by GitBox <gi...@apache.org> on 2022/03/02 11:12:17 UTC

[GitHub] [pulsar] nicoloboschi opened a new pull request #14537: [loadbalance] remove 'Invalid argument' confusing stacktrace when NIC speed can't be detected

nicoloboschi opened a new pull request #14537:
URL: https://github.com/apache/pulsar/pull/14537


   ### Motivation
   After https://github.com/apache/pulsar/pull/14340 and https://github.com/apache/pulsar/pull/14252 you may find in the log something like that
   ```
   2022-03-02T10:49:18,037+0000 [TestNG-method=testBrokerHostUsage-1] ERROR org.apache.pulsar.broker.loadbalance.impl.LinuxBrokerHostUsageImpl - Failed to read speed for nic br-aca0e958a490, maybe you can set broker config [loadBalancerOverrideBrokerNicSpeedGbps] to override it.
   java.io.IOException: Invalid argument
   	at sun.nio.ch.FileDispatcherImpl.read0(Native Method) ~[?:1.8.0_292]
   	at sun.nio.ch.FileDispatcherImpl.read(FileDispatcherImpl.java:46) ~[?:1.8.0_292]
   	at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) ~[?:1.8.0_292]
   	at sun.nio.ch.IOUtil.read(IOUtil.java:197) ~[?:1.8.0_292]
   	at sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:159) ~[?:1.8.0_292]
   	at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:65) ~[?:1.8.0_292]
   	at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:109) ~[?:1.8.0_292]
   	at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:103) ~[?:1.8.0_292]
   	at java.nio.file.Files.read(Files.java:3105) ~[?:1.8.0_292]
   	at java.nio.file.Files.readAllBytes(Files.java:3158) ~[?:1.8.0_292]
   	at org.apache.pulsar.broker.loadbalance.impl.LinuxBrokerHostUsageImpl.lambda$null$3(LinuxBrokerHostUsageImpl.java:257) ~[classes/:?]
   	at java.util.stream.ReferencePipeline$6$1.accept(ReferencePipeline.java:244) ~[?:1.8.0_292]
   	at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1384) ~[?:1.8.0_292]
   	at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482) ~[?:1.8.0_292]
   	at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) ~[?:1.8.0_292]
   	at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708) ~[?:1.8.0_292]
   	at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) ~[?:1.8.0_292]
   	at java.util.stream.DoublePipeline.collect(DoublePipeline.java:500) ~[?:1.8.0_292]
   	at java.util.stream.DoublePipeline.sum(DoublePipeline.java:411) ~[?:1.8.0_292]
   	at org.apache.pulsar.broker.loadbalance.impl.LinuxBrokerHostUsageImpl.lambda$getTotalNicLimitKbps$4(LinuxBrokerHostUsageImpl.java:263) ~[classes/:?]
   	at java.util.Optional.orElseGet(Optional.java:267) [?:1.8.0_292]
   	at org.apache.pulsar.broker.loadbalance.impl.LinuxBrokerHostUsageImpl.getTotalNicLimitKbps(LinuxBrokerHostUsageImpl.java:253) [classes/:?]
   	at org.apache.pulsar.broker.loadbalance.impl.LinuxBrokerHostUsageImpl.calculateBrokerHostUsage(LinuxBrokerHostUsageImpl.java:104) [classes/:?]
   	at org.apache.pulsar.broker.loadbalance.impl.LinuxBrokerHostUsageImpl.<init>(LinuxBrokerHostUsageImpl.java:90) [classes/:?]
   	at org.apache.pulsar.broker.loadbalance.impl.LinuxBrokerHostUsageImpl.<init>(LinuxBrokerHostUsageImpl.java:66) [classes/:?]
   	at org.apache.pulsar.broker.loadbalance.SimpleLoadManagerImplTest.testBrokerHostUsage(SimpleLoadManagerImplTest.java:440) [test-classes/:?]
   	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_292]
   	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_292]
   	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_292]
   	at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_292]
   	at org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:132) [testng-7.3.0.jar:?]
   	at org.testng.internal.InvokeMethodRunnable.runOne(InvokeMethodRunnable.java:45) [testng-7.3.0.jar:?]
   	at org.testng.internal.InvokeMethodRunnable.call(InvokeMethodRunnable.java:73) [testng-7.3.0.jar:?]
   	at org.testng.internal.InvokeMethodRunnable.call(InvokeMethodRunnable.java:11) [testng-7.3.0.jar:?]
   	at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_292]
   	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_292]
   	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_292]
   	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_292]
   
   ```
   
   There are some NIC that are of type 1 but that doesn't expose the speed.
   The stacktrace is not useful and it's really scary.
   
   Note that since https://github.com/apache/pulsar/pull/14252 has been cherry-picked to 2.8, 2.9, 2.10 it's recommended to pick this one as well.
   
   ### Modifications
   * Check for `speed` file presence before trying to read it and only log a WARN message
   
   - [x] `no-need-doc` 
    
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] eolivelli commented on a change in pull request #14537: [loadbalance] remove 'Invalid argument' confusing stacktrace when NIC speed can't be detected

Posted by GitBox <gi...@apache.org>.
eolivelli commented on a change in pull request #14537:
URL: https://github.com/apache/pulsar/pull/14537#discussion_r817617264



##########
File path: pulsar-broker/src/main/java/org/apache/pulsar/broker/loadbalance/impl/LinuxBrokerHostUsageImpl.java
##########
@@ -294,4 +302,8 @@ private double getTotalNicUsageTxKb(List<String> nics) {
     private static long readLongFromFile(String path) throws IOException {
         return Long.parseLong(new String(Files.readAllBytes(Paths.get(path)), Charsets.UTF_8).trim());
     }
+
+    private static double readDoubleFromPath(Path path) throws IOException {
+        return Double.parseDouble(new String(Files.readAllBytes(path)));

Review comment:
       Double.parseDouble may throw a IllegalArgumentException (NumberFormatExcepton)
   we should wrap it in a IOException




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] mattisonchao commented on pull request #14537: [loadbalance] remove 'Invalid argument' confusing stacktrace when NIC speed can't be detected

Posted by GitBox <gi...@apache.org>.
mattisonchao commented on pull request #14537:
URL: https://github.com/apache/pulsar/pull/14537#issuecomment-1057508822


   > I think we should consider failing during broker start up when `loadBalancerOverrideBrokerNicSpeedGbps` is not configured and the broker cannot determine NIC speed for all NICs (and when load balancing is enabled).
   > 
   > The current design is to frequently log an error while ignoring the NIC. The only real course of action for an operator is to re-configure the broker. It seems like a better course of action to fail on start up so that operators know immediately that they need to fix the configuration instead of finding out some time later when observing the logs.
   
   Another point of view, maybe we could make ``loadBalancerOverrideBrokerNicSpeedGbps `` dynamic to avoid restarting broker?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] eolivelli commented on a change in pull request #14537: [loadbalance] remove 'Invalid argument' confusing stacktrace when NIC speed can't be detected

Posted by GitBox <gi...@apache.org>.
eolivelli commented on a change in pull request #14537:
URL: https://github.com/apache/pulsar/pull/14537#discussion_r817617997



##########
File path: pulsar-broker/src/main/java/org/apache/pulsar/broker/loadbalance/impl/LinuxBrokerHostUsageImpl.java
##########
@@ -252,10 +252,18 @@ private double getTotalNicLimitKbps(List<String> nics) {
                 .orElseGet(() -> nics.stream().mapToDouble(nicPath -> {
                     // Nic speed is in Mbits/s, return kbits/s
                     try {
-                        return Double.parseDouble(new String(Files.readAllBytes(getNicSpeedPath(nicPath))));
+                        final Path nicSpeedPath = getNicSpeedPath(nicPath);
+                        return readDoubleFromPath(nicSpeedPath);
                     } catch (IOException e) {
-                        log.error(String.format("Failed to read speed for nic %s, maybe you can set broker"
-                                + " config [loadBalancerOverrideBrokerNicSpeedGbps] to override it.", nicPath), e);
+                        if ("Invalid argument".equals(e.getMessage())) {

Review comment:
       we should add some comments here and add references to when/why/where this could happen




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] michaeljmarshall commented on pull request #14537: [loadbalance] remove 'Invalid argument' confusing stacktrace when NIC speed can't be detected

Posted by GitBox <gi...@apache.org>.
michaeljmarshall commented on pull request #14537:
URL: https://github.com/apache/pulsar/pull/14537#issuecomment-1058470330


   @mattisonchao - sure, go ahead. For context, we discussed this issue briefly at the community meeting today and there was consensus that failing on startup is the right behavior.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] michaeljmarshall commented on pull request #14537: [loadbalance] remove 'Invalid argument' confusing stacktrace when NIC speed can't be detected

Posted by GitBox <gi...@apache.org>.
michaeljmarshall commented on pull request #14537:
URL: https://github.com/apache/pulsar/pull/14537#issuecomment-1058524723


   @nicoloboschi - that's a great thing to call out. The solution is to configure the `loadBalancerOverrideBrokerNicSpeedGbps` for the CI env.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] mattisonchao commented on a change in pull request #14537: [loadbalance] remove 'Invalid argument' confusing stacktrace when NIC speed can't be detected

Posted by GitBox <gi...@apache.org>.
mattisonchao commented on a change in pull request #14537:
URL: https://github.com/apache/pulsar/pull/14537#discussion_r817714902



##########
File path: pulsar-broker/src/main/java/org/apache/pulsar/broker/loadbalance/impl/LinuxBrokerHostUsageImpl.java
##########
@@ -252,10 +252,22 @@ private double getTotalNicLimitKbps(List<String> nics) {
                 .orElseGet(() -> nics.stream().mapToDouble(nicPath -> {
                     // Nic speed is in Mbits/s, return kbits/s
                     try {
-                        return Double.parseDouble(new String(Files.readAllBytes(getNicSpeedPath(nicPath))));
+                        final Path nicSpeedPath = getNicSpeedPath(nicPath);
+                        return readDoubleFromPath(nicSpeedPath);
                     } catch (IOException e) {
-                        log.error(String.format("Failed to read speed for nic %s, maybe you can set broker"
-                                + " config [loadBalancerOverrideBrokerNicSpeedGbps] to override it.", nicPath), e);
+                        // "speed" is not supported by all the types of NIC
+                        // https://www.kernel.org/doc/Documentation/ABI/testing/sysfs-class-net
+                        // Linux raises "Invalid argument" message error in case "speed" file doesn't exist or
+                        // is not readable
+                        if (e.getCause() == null && "Invalid argument".equals(e.getMessage())) {
+                            log.warn("Failed to read speed for nic {}, the OS raised a 'Invalid argument' error." +
+                                    " Maybe you can set broker config [loadBalancerOverrideBrokerNicSpeedGbps] " +
+                                    "to override it.", nicPath);
+                        } else {
+                            log.error(String.format("Failed to read speed for nic %s, maybe you can set broker"

Review comment:
       Maybe it's better to print the error message?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] nicoloboschi commented on pull request #14537: [loadbalance] remove 'Invalid argument' confusing stacktrace when NIC speed can't be detected

Posted by GitBox <gi...@apache.org>.
nicoloboschi commented on pull request #14537:
URL: https://github.com/apache/pulsar/pull/14537#issuecomment-1058520567


   @mattisonchao @michaeljmarshall please note that some CI environments could have particular NIC and throwing blocking error will make them fail.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] nicoloboschi closed pull request #14537: [loadbalance] remove 'Invalid argument' confusing stacktrace when NIC speed can't be detected

Posted by GitBox <gi...@apache.org>.
nicoloboschi closed pull request #14537:
URL: https://github.com/apache/pulsar/pull/14537


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] mattisonchao removed a comment on pull request #14537: [loadbalance] remove 'Invalid argument' confusing stacktrace when NIC speed can't be detected

Posted by GitBox <gi...@apache.org>.
mattisonchao removed a comment on pull request #14537:
URL: https://github.com/apache/pulsar/pull/14537#issuecomment-1057508822


   > I think we should consider failing during broker start up when `loadBalancerOverrideBrokerNicSpeedGbps` is not configured and the broker cannot determine NIC speed for all NICs (and when load balancing is enabled).
   > 
   > The current design is to frequently log an error while ignoring the NIC. The only real course of action for an operator is to re-configure the broker. It seems like a better course of action to fail on start up so that operators know immediately that they need to fix the configuration instead of finding out some time later when observing the logs.
   
   Another point of view, maybe we could make ``loadBalancerOverrideBrokerNicSpeedGbps `` dynamic to avoid restarting broker?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] nicoloboschi commented on pull request #14537: [loadbalance] remove 'Invalid argument' confusing stacktrace when NIC speed can't be detected

Posted by GitBox <gi...@apache.org>.
nicoloboschi commented on pull request #14537:
URL: https://github.com/apache/pulsar/pull/14537#issuecomment-1056807149


   @mattisonchao @merlimat @michaeljmarshall  PTAL 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] nicoloboschi commented on pull request #14537: [loadbalance] remove 'Invalid argument' confusing stacktrace when NIC speed can't be detected

Posted by GitBox <gi...@apache.org>.
nicoloboschi commented on pull request #14537:
URL: https://github.com/apache/pulsar/pull/14537#issuecomment-1057714486


   @mattisonchao thanks, please go ahead 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] mattisonchao commented on pull request #14537: [loadbalance] remove 'Invalid argument' confusing stacktrace when NIC speed can't be detected

Posted by GitBox <gi...@apache.org>.
mattisonchao commented on pull request #14537:
URL: https://github.com/apache/pulsar/pull/14537#issuecomment-1057524752


   @nicoloboschi  @michaeljmarshall 
   
   I very much agree with Michael, Would you mind thinking about this approach?
   If you don't have time, I'd like to do that.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pulsar.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org