You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Sebastian Nagel (Jira)" <ji...@apache.org> on 2022/01/15 14:20:00 UTC

[jira] [Commented] (NUTCH-2936) Early registration of URL stream handlers provided by plugins may fail Hadoop jobs running in distributed mode

    [ https://issues.apache.org/jira/browse/NUTCH-2936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17476615#comment-17476615 ] 

Sebastian Nagel commented on NUTCH-2936:
----------------------------------------

Using protocol-okhttp causes parsechecker to raise the following error (also related to java.security):
{noformat}
$> bin/nutch parsechecker -Dplugin.includes='protocol-okhttp|parse-tika' https://example.com/
...
2022-01-15 15:18:06,016 WARN o.a.n.p.PluginRepository [main] Could not find org.apache.nutch.protocol.okhttp.OkHttp
java.lang.NullPointerException: Parameter specified as non-null is null: method okhttp3.OkHttpClient$Builder.sslSocketFactory, parameter sslSocketFactory
        at okhttp3.OkHttpClient$Builder.sslSocketFactory(OkHttpClient.kt) ~[?:?]
        at org.apache.nutch.protocol.okhttp.OkHttp.setConf(OkHttp.java:129) ~[?:?]
        at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:175) ~[apache-nutch-1.19-SNAPSHOT.jar:?]
        at org.apache.nutch.plugin.PluginRepository.createURLStreamHandler(PluginRepository.java:597) [apache-nutch-1.19-SNAPSHOT.jar:?]
        at org.apache.nutch.plugin.URLStreamHandlerFactory.createURLStreamHandler(URLStreamHandlerFactory.java:95) [apache-nutch-1.19-SNAPSHOT.jar:?]
        at java.net.URL.getURLStreamHandler(URL.java:1432) [?:?]
        at java.net.URL.<init>(URL.java:651) [?:?]
        at java.net.URL.<init>(URL.java:541) [?:?]
        at java.net.URL.<init>(URL.java:488) [?:?]
        at javax.crypto.JceSecurity.<clinit>(JceSecurity.java:239) [?:?]
        at javax.crypto.Cipher.getInstance(Cipher.java:540) [?:?]
        at sun.security.ssl.JsseJce.getCipher(JsseJce.java:190) [?:?]
        at sun.security.ssl.SSLCipher.isTransformationAvailable(SSLCipher.java:509) [?:?]
        at sun.security.ssl.SSLCipher.<init>(SSLCipher.java:498) [?:?]
        at sun.security.ssl.SSLCipher.<clinit>(SSLCipher.java:81) [?:?]
        at sun.security.ssl.CipherSuite.<clinit>(CipherSuite.java:69) [?:?]
        at sun.security.ssl.SSLContextImpl.getApplicableSupportedCipherSuites(SSLContextImpl.java:348) [?:?]
        at sun.security.ssl.SSLContextImpl$AbstractTLSContext.<clinit>(SSLContextImpl.java:580) [?:?]
        at java.lang.Class.forName0(Native Method) ~[?:?]
        at java.lang.Class.forName(Class.java:315) [?:?]
        at java.security.Provider$Service.getImplClass(Provider.java:1918) [?:?]
        at java.security.Provider$Service.newInstance(Provider.java:1894) [?:?]
        at sun.security.jca.GetInstance.getInstance(GetInstance.java:236) [?:?]
        at sun.security.jca.GetInstance.getInstance(GetInstance.java:164) [?:?]
        at javax.net.ssl.SSLContext.getInstance(SSLContext.java:168) [?:?]
        at org.apache.nutch.protocol.okhttp.OkHttp.<clinit>(OkHttp.java:94) [protocol-okhttp.jar:?]
        at jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) ~[?:?]
        at jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) [?:?]
        at jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) [?:?]
        at java.lang.reflect.Constructor.newInstance(Constructor.java:490) [?:?]
        at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:170) [apache-nutch-1.19-SNAPSHOT.jar:?]
        at org.apache.nutch.plugin.PluginRepository.createURLStreamHandler(PluginRepository.java:597) [apache-nutch-1.19-SNAPSHOT.jar:?]
        at org.apache.nutch.plugin.URLStreamHandlerFactory.createURLStreamHandler(URLStreamHandlerFactory.java:95) [apache-nutch-1.19-SNAPSHOT.jar:?]
        at java.net.URL.getURLStreamHandler(URL.java:1432) [?:?]
        at java.net.URL.<init>(URL.java:651) [?:?]
        at java.net.URL.<init>(URL.java:541) [?:?]
        at java.net.URL.<init>(URL.java:488) [?:?]
        at org.apache.nutch.protocol.ProtocolFactory.getProtocol(ProtocolFactory.java:109) [apache-nutch-1.19-SNAPSHOT.jar:?]
        at org.apache.nutch.util.AbstractChecker.getProtocolOutput(AbstractChecker.java:196) [apache-nutch-1.19-SNAPSHOT.jar:?]
        at org.apache.nutch.parse.ParserChecker.process(ParserChecker.java:185) [apache-nutch-1.19-SNAPSHOT.jar:?]
        at org.apache.nutch.util.AbstractChecker.processSingle(AbstractChecker.java:87) [apache-nutch-1.19-SNAPSHOT.jar:?]
        at org.apache.nutch.parse.ParserChecker.run(ParserChecker.java:150) [apache-nutch-1.19-SNAPSHOT.jar:?]
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) [hadoop-common-3.1.3.jar:?]
        at org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:307) [apache-nutch-1.19-SNAPSHOT.jar:?]
{noformat}

> Early registration of URL stream handlers provided by plugins may fail Hadoop jobs running in distributed mode
> --------------------------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-2936
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2936
>             Project: Nutch
>          Issue Type: Bug
>          Components: plugin, protocol
>    Affects Versions: 1.19
>            Reporter: Sebastian Nagel
>            Priority: Major
>             Fix For: 1.19
>
>
> After merging NUTCH-2429 I've observed that Nutch jobs running in distributed mode may fail early with the following dubious error:
> {noformat}
> 2022-01-14 13:11:45,751 ERROR crawl.DedupRedirectsJob: DeduplicationJob: java.io.IOException: Error generating shuffle secret key
>         at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:182)
>         at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1565)
>         at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1562)
>         at java.base/java.security.AccessController.doPrivileged(Native Method)
>         at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
>         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
>         at org.apache.hadoop.mapreduce.Job.submit(Job.java:1562)
>         at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1583)
>         at org.apache.nutch.crawl.DedupRedirectsJob.run(DedupRedirectsJob.java:301)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
>         at org.apache.nutch.crawl.DedupRedirectsJob.main(DedupRedirectsJob.java:379)
>         at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>         at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.base/java.lang.reflect.Method.invoke(Method.java:566)
>         at org.apache.hadoop.util.RunJar.run(RunJar.java:323)
>         at org.apache.hadoop.util.RunJar.main(RunJar.java:236)
> Caused by: java.security.NoSuchAlgorithmException: HmacSHA1 KeyGenerator not available
>         at java.base/javax.crypto.KeyGenerator.<init>(KeyGenerator.java:177)
>         at java.base/javax.crypto.KeyGenerator.getInstance(KeyGenerator.java:244)
>         at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:179)
>         ... 16 more
> {noformat}
> After removing the early registration of URL stream handlers (see NUTCH-2429) in NutchJob and NutchTool, the job starts without errors.
> Notes:
> - the job this error was observed a [custom de-duplication job|https://github.com/commoncrawl/nutch/blob/cc/src/java/org/apache/nutch/crawl/DedupRedirectsJob.java] to flag redirects pointing to the same target URL. But I'll try to reproduce it with a standard Nutch job and in pseudo-distributed mode.
> - should also verify whether registering URL stream handlers works at all in distributed mode. Tasks are launched differently, not as NutchJob or NutchTool.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)