You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Markus Jelsma (JIRA)" <ji...@apache.org> on 2011/07/04 19:14:21 UTC
[jira] [Created] (NUTCH-1029) Readdb throws EOFException
Readdb throws EOFException
--------------------------
Key: NUTCH-1029
URL: https://issues.apache.org/jira/browse/NUTCH-1029
Project: Nutch
Issue Type: Bug
Components: linkdb
Affects Versions: 1.4
Environment: Hadoop 0.20.203.0
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Fix For: 1.4, 2.0
Readdb -stats on a crawldb with 1 record exits with EOFError on Hadoop-0.20.203.0.
{code}
Exception in thread "main" java.io.EOFException
at java.io.DataInputStream.readFully(DataInputStream.java:180)
at java.io.DataInputStream.readFully(DataInputStream.java:152)
at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1450)
at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1428)
at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1417)
at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1412)
at org.apache.hadoop.mapred.SequenceFileOutputFormat.getReaders(SequenceFileOutputFormat.java:93)
at org.apache.nutch.crawl.CrawlDbReader.processStatJob(CrawlDbReader.java:320)
at org.apache.nutch.crawl.CrawlDbReader.main(CrawlDbReader.java:502)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
{code}
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1029) Readdb throws EOFException
Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-1029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Markus Jelsma updated NUTCH-1029:
---------------------------------
Attachment: NUTCH-1029-1.4-1.patch
The assumption was correct. Here's a patch for 1.4 that disables the creation of the _SUCCESS file for the stat job. I haven't tested topN and dump jobs.
> Readdb throws EOFException
> --------------------------
>
> Key: NUTCH-1029
> URL: https://issues.apache.org/jira/browse/NUTCH-1029
> Project: Nutch
> Issue Type: Bug
> Components: linkdb
> Affects Versions: 1.4
> Environment: Hadoop 0.20.203.0
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Fix For: 1.4, 2.0
>
> Attachments: NUTCH-1029-1.4-1.patch
>
>
> Readdb -stats on a crawldb with 1 record exits with EOFError on Hadoop-0.20.203.0.
> {code}
> Exception in thread "main" java.io.EOFException
> at java.io.DataInputStream.readFully(DataInputStream.java:180)
> at java.io.DataInputStream.readFully(DataInputStream.java:152)
> at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1450)
> at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1428)
> at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1417)
> at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1412)
> at org.apache.hadoop.mapred.SequenceFileOutputFormat.getReaders(SequenceFileOutputFormat.java:93)
> at org.apache.nutch.crawl.CrawlDbReader.processStatJob(CrawlDbReader.java:320)
> at org.apache.nutch.crawl.CrawlDbReader.main(CrawlDbReader.java:502)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> {code}
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1029) Readdb throws EOFException
Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-1029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Markus Jelsma updated NUTCH-1029:
---------------------------------
Priority: Critical (was: Major)
Patch Info: [Patch Available]
> Readdb throws EOFException
> --------------------------
>
> Key: NUTCH-1029
> URL: https://issues.apache.org/jira/browse/NUTCH-1029
> Project: Nutch
> Issue Type: Bug
> Components: linkdb
> Affects Versions: 1.4
> Environment: Hadoop 0.20.203.0
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Priority: Critical
> Fix For: 1.4, 2.0
>
> Attachments: NUTCH-1029-1.4-1.patch
>
>
> Readdb -stats on a crawldb with 1 record exits with EOFError on Hadoop-0.20.203.0.
> {code}
> Exception in thread "main" java.io.EOFException
> at java.io.DataInputStream.readFully(DataInputStream.java:180)
> at java.io.DataInputStream.readFully(DataInputStream.java:152)
> at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1450)
> at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1428)
> at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1417)
> at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1412)
> at org.apache.hadoop.mapred.SequenceFileOutputFormat.getReaders(SequenceFileOutputFormat.java:93)
> at org.apache.nutch.crawl.CrawlDbReader.processStatJob(CrawlDbReader.java:320)
> at org.apache.nutch.crawl.CrawlDbReader.main(CrawlDbReader.java:502)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> {code}
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1029) Readdb throws EOFException
Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-1029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Markus Jelsma updated NUTCH-1029:
---------------------------------
Fix Version/s: (was: 2.0)
Was incorrectly marked fix/for 2.0. Will commit for 1.4 shortly.
> Readdb throws EOFException
> --------------------------
>
> Key: NUTCH-1029
> URL: https://issues.apache.org/jira/browse/NUTCH-1029
> Project: Nutch
> Issue Type: Bug
> Components: linkdb
> Affects Versions: 1.4
> Environment: Hadoop 0.20.203.0
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Priority: Critical
> Fix For: 1.4
>
> Attachments: NUTCH-1029-1.4-1.patch
>
>
> Readdb -stats on a crawldb with 1 record exits with EOFError on Hadoop-0.20.203.0.
> {code}
> Exception in thread "main" java.io.EOFException
> at java.io.DataInputStream.readFully(DataInputStream.java:180)
> at java.io.DataInputStream.readFully(DataInputStream.java:152)
> at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1450)
> at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1428)
> at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1417)
> at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1412)
> at org.apache.hadoop.mapred.SequenceFileOutputFormat.getReaders(SequenceFileOutputFormat.java:93)
> at org.apache.nutch.crawl.CrawlDbReader.processStatJob(CrawlDbReader.java:320)
> at org.apache.nutch.crawl.CrawlDbReader.main(CrawlDbReader.java:502)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> {code}
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Issue Comment Edited] (NUTCH-1029) Readdb throws
EOFException
Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-1029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13062630#comment-13062630 ]
Markus Jelsma edited comment on NUTCH-1029 at 7/9/11 9:22 PM:
--------------------------------------------------------------
The assumption was correct. Here's a patch for 1.4 that disables the creation of the _SUCCESS file for the stat job. I haven't tested topN and dump jobs.
By the way: having a _SUCCESS file in the current crawl db will also throw errors for the -url job. Yesterday i copied over a crawldb from production hdfs and had to remove the file as well before reading it locally.
was (Author: markus17):
The assumption was correct. Here's a patch for 1.4 that disables the creation of the _SUCCESS file for the stat job. I haven't tested topN and dump jobs.
> Readdb throws EOFException
> --------------------------
>
> Key: NUTCH-1029
> URL: https://issues.apache.org/jira/browse/NUTCH-1029
> Project: Nutch
> Issue Type: Bug
> Components: linkdb
> Affects Versions: 1.4
> Environment: Hadoop 0.20.203.0
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Priority: Critical
> Fix For: 1.4, 2.0
>
> Attachments: NUTCH-1029-1.4-1.patch
>
>
> Readdb -stats on a crawldb with 1 record exits with EOFError on Hadoop-0.20.203.0.
> {code}
> Exception in thread "main" java.io.EOFException
> at java.io.DataInputStream.readFully(DataInputStream.java:180)
> at java.io.DataInputStream.readFully(DataInputStream.java:152)
> at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1450)
> at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1428)
> at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1417)
> at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1412)
> at org.apache.hadoop.mapred.SequenceFileOutputFormat.getReaders(SequenceFileOutputFormat.java:93)
> at org.apache.nutch.crawl.CrawlDbReader.processStatJob(CrawlDbReader.java:320)
> at org.apache.nutch.crawl.CrawlDbReader.main(CrawlDbReader.java:502)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> {code}
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (NUTCH-1029) Readdb throws EOFException
Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-1029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Markus Jelsma resolved NUTCH-1029.
----------------------------------
Resolution: Fixed
Committed for 1.4 in rev. 1147615.
> Readdb throws EOFException
> --------------------------
>
> Key: NUTCH-1029
> URL: https://issues.apache.org/jira/browse/NUTCH-1029
> Project: Nutch
> Issue Type: Bug
> Components: linkdb
> Affects Versions: 1.4
> Environment: Hadoop 0.20.203.0
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Priority: Critical
> Fix For: 1.4
>
> Attachments: NUTCH-1029-1.4-1.patch
>
>
> Readdb -stats on a crawldb with 1 record exits with EOFError on Hadoop-0.20.203.0.
> {code}
> Exception in thread "main" java.io.EOFException
> at java.io.DataInputStream.readFully(DataInputStream.java:180)
> at java.io.DataInputStream.readFully(DataInputStream.java:152)
> at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1450)
> at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1428)
> at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1417)
> at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1412)
> at org.apache.hadoop.mapred.SequenceFileOutputFormat.getReaders(SequenceFileOutputFormat.java:93)
> at org.apache.nutch.crawl.CrawlDbReader.processStatJob(CrawlDbReader.java:320)
> at org.apache.nutch.crawl.CrawlDbReader.main(CrawlDbReader.java:502)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> {code}
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-1029) Readdb throws EOFException
Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-1029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Markus Jelsma closed NUTCH-1029.
--------------------------------
> Readdb throws EOFException
> --------------------------
>
> Key: NUTCH-1029
> URL: https://issues.apache.org/jira/browse/NUTCH-1029
> Project: Nutch
> Issue Type: Bug
> Components: linkdb
> Affects Versions: 1.4
> Environment: Hadoop 0.20.203.0
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Priority: Critical
> Fix For: 1.4
>
> Attachments: NUTCH-1029-1.4-1.patch
>
>
> Readdb -stats on a crawldb with 1 record exits with EOFError on Hadoop-0.20.203.0.
> {code}
> Exception in thread "main" java.io.EOFException
> at java.io.DataInputStream.readFully(DataInputStream.java:180)
> at java.io.DataInputStream.readFully(DataInputStream.java:152)
> at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1450)
> at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1428)
> at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1417)
> at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1412)
> at org.apache.hadoop.mapred.SequenceFileOutputFormat.getReaders(SequenceFileOutputFormat.java:93)
> at org.apache.nutch.crawl.CrawlDbReader.processStatJob(CrawlDbReader.java:320)
> at org.apache.nutch.crawl.CrawlDbReader.main(CrawlDbReader.java:502)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> {code}
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1029) Readdb throws EOFException
Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-1029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13062489#comment-13062489 ]
Markus Jelsma commented on NUTCH-1029:
--------------------------------------
It seems this error is caused due to the _SUCCESS file in crawldb. This file is saved after a successful job completion because of MAPREDUCE-947. The crawldb reader attempts to read the file, which it can't and thus throws the above exception.
The reader job writes and then reads a temporary stat_tmp1234567 directory. The following read seems to choke on the _SUCCESS file.
> Readdb throws EOFException
> --------------------------
>
> Key: NUTCH-1029
> URL: https://issues.apache.org/jira/browse/NUTCH-1029
> Project: Nutch
> Issue Type: Bug
> Components: linkdb
> Affects Versions: 1.4
> Environment: Hadoop 0.20.203.0
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Fix For: 1.4, 2.0
>
>
> Readdb -stats on a crawldb with 1 record exits with EOFError on Hadoop-0.20.203.0.
> {code}
> Exception in thread "main" java.io.EOFException
> at java.io.DataInputStream.readFully(DataInputStream.java:180)
> at java.io.DataInputStream.readFully(DataInputStream.java:152)
> at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1450)
> at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1428)
> at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1417)
> at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1412)
> at org.apache.hadoop.mapred.SequenceFileOutputFormat.getReaders(SequenceFileOutputFormat.java:93)
> at org.apache.nutch.crawl.CrawlDbReader.processStatJob(CrawlDbReader.java:320)
> at org.apache.nutch.crawl.CrawlDbReader.main(CrawlDbReader.java:502)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> {code}
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1029) Readdb throws EOFException
Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-1029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13063267#comment-13063267 ]
Markus Jelsma commented on NUTCH-1029:
--------------------------------------
Are there objections to this fix?
> Readdb throws EOFException
> --------------------------
>
> Key: NUTCH-1029
> URL: https://issues.apache.org/jira/browse/NUTCH-1029
> Project: Nutch
> Issue Type: Bug
> Components: linkdb
> Affects Versions: 1.4
> Environment: Hadoop 0.20.203.0
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Priority: Critical
> Fix For: 1.4, 2.0
>
> Attachments: NUTCH-1029-1.4-1.patch
>
>
> Readdb -stats on a crawldb with 1 record exits with EOFError on Hadoop-0.20.203.0.
> {code}
> Exception in thread "main" java.io.EOFException
> at java.io.DataInputStream.readFully(DataInputStream.java:180)
> at java.io.DataInputStream.readFully(DataInputStream.java:152)
> at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1450)
> at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1428)
> at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1417)
> at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1412)
> at org.apache.hadoop.mapred.SequenceFileOutputFormat.getReaders(SequenceFileOutputFormat.java:93)
> at org.apache.nutch.crawl.CrawlDbReader.processStatJob(CrawlDbReader.java:320)
> at org.apache.nutch.crawl.CrawlDbReader.main(CrawlDbReader.java:502)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> {code}
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira