You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Sebastian Nagel (Jira)" <ji...@apache.org> on 2022/01/14 14:05:00 UTC
[jira] [Created] (NUTCH-2936) Early registration of URL stream handlers provided by plugins may fail Hadoop jobs running in distributed mode

Sebastian Nagel created NUTCH-2936:
--------------------------------------

             Summary: Early registration of URL stream handlers provided by plugins may fail Hadoop jobs running in distributed mode
                 Key: NUTCH-2936
                 URL: https://issues.apache.org/jira/browse/NUTCH-2936
             Project: Nutch
          Issue Type: Bug
          Components: plugin, protocol
    Affects Versions: 1.19
            Reporter: Sebastian Nagel
             Fix For: 1.19


After merging NUTCH-2429 I've observed that Nutch jobs running in distributed mode may fail early with the following dubious error:
{noformat}
2022-01-14 13:11:45,751 ERROR crawl.DedupRedirectsJob: DeduplicationJob: java.io.IOException: Error generating shuffle secret key
        at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:182)
        at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1565)
        at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1562)
        at java.base/java.security.AccessController.doPrivileged(Native Method)
        at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
        at org.apache.hadoop.mapreduce.Job.submit(Job.java:1562)
        at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1583)
        at org.apache.nutch.crawl.DedupRedirectsJob.run(DedupRedirectsJob.java:301)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
        at org.apache.nutch.crawl.DedupRedirectsJob.main(DedupRedirectsJob.java:379)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
        at org.apache.hadoop.util.RunJar.run(RunJar.java:323)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:236)
Caused by: java.security.NoSuchAlgorithmException: HmacSHA1 KeyGenerator not available
        at java.base/javax.crypto.KeyGenerator.<init>(KeyGenerator.java:177)
        at java.base/javax.crypto.KeyGenerator.getInstance(KeyGenerator.java:244)
        at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:179)
        ... 16 more
{noformat}

After removing the early registration of URL stream handlers (see NUTCH-2429) in NutchJob and NutchTool, the job starts without errors.

Notes:
- the job this error was observed a [custom de-duplication job|https://github.com/commoncrawl/nutch/blob/cc/src/java/org/apache/nutch/crawl/DedupRedirectsJob.java] to flag redirects pointing to the same target URL. But I'll try to reproduce it with a standard Nutch job and in pseudo-distributed mode.
- should also verify whether registering URL stream handlers works at all in distributed mode. Tasks are launched differently, not as NutchJob or NutchTool.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)