You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Lucas Pauchard (Jira)" <ji...@apache.org> on 2019/12/05 12:44:00 UTC

[jira] [Updated] (NUTCH-2756) Segment Part problem with HDFS on distibuted mode

     [ https://issues.apache.org/jira/browse/NUTCH-2756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lucas Pauchard updated NUTCH-2756:
----------------------------------
    Description: 
During the parsing, it happens sometimes that parts of the data on the HDFS is missing after the parsing.
When I take a look at our HDFS, I've got this file with 0 bytes (see attachments).




After that the CrawlDB complains about this specific (corrupted?) part:
{panel:title=log_crawl}
2019-12-04 22:25:57,454 INFO mapreduce.Job: Task Id : attempt_1575479127636_0047_m_000017_2, Status : FAILED
Error: java.io.EOFException: hdfs://jobmaster:9000/user/hadoop/crawlmultiokhttp/segment/20191204221308/crawl_parse/part-r-00004 not a SequenceFile
        at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1964)
        at org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1923)
        at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1872)
        at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1886)
        at org.apache.hadoop.mapreduce.lib.input.SequenceFileRecordReader.initialize(SequenceFileRecordReader.java:54)
        at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:560)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:798)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:347)
        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:174)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:168)~
{panel}

When I check the namenode logs, I don't see any error during the writing of the segment part but one hour later, I've got the following log:
{panel:title=log_namenode}
2019-12-04 23:23:13,750 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering [Lease.  Holder: DFSClient_attempt_1575479127636_0046_r_000004_1_1307945884_1, pending creates: 2], src=/user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_data/part-r-00004/index
2019-12-04 23:23:13,750 WARN org.apache.hadoop.hdfs.StateChange: BLOCK* internalReleaseLease: All existing blocks are COMPLETE, lease removed, file /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_data/part-r-00004/index closed.
2019-12-04 23:23:13,750 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering [Lease.  Holder: DFSClient_attempt_1575479127636_0046_r_000004_1_1307945884_1, pending creates: 1], src=/user/hadoop/crawlmultiokhttp/segment/20191204221308/crawl_parse/part-r-00004
2019-12-04 23:23:13,750 WARN org.apache.hadoop.hdfs.StateChange: BLOCK* internalReleaseLease: All existing blocks are COMPLETE, lease removed, file /user/hadoop/crawlmultiokhttp/segment/20191204221308/crawl_parse/part-r-00004 closed.
{panel}

This issue is hard to reproduce and I can't figure out what are the preconditions. It seems that it just happens randomly.
Maybe the problem is coming from a bad management when we close the file.

  was:
During the parsing, it happens sometimes that parts of the data on the HDFS is missing after the parsing.
When I take a look at our HDFS, I've got this file with 0 bytes (see attachments).

After that the CrawlDB complains about this specific (corrupted?) part:
~2019-12-04 22:25:57,454 INFO mapreduce.Job: Task Id : attempt_1575479127636_0047_m_000017_2, Status : FAILED
Error: java.io.EOFException: hdfs://jobmaster:9000/user/hadoop/crawlmultiokhttp/segment/20191204221308/crawl_parse/part-r-00004 not a SequenceFile
        at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1964)
        at org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1923)
        at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1872)
        at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1886)
        at org.apache.hadoop.mapreduce.lib.input.SequenceFileRecordReader.initialize(SequenceFileRecordReader.java:54)
        at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:560)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:798)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:347)
        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:174)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:168)~

When I check the namenode logs, I don't see any error during the writing of the segment part but one hour later, I've got the following log:
~2019-12-04 23:23:13,750 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering [Lease.  Holder: DFSClient_attempt_1575479127636_0046_r_000004_1_1307945884_1, pending creates: 2], src=/user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_data/part-r-00004/index
2019-12-04 23:23:13,750 WARN org.apache.hadoop.hdfs.StateChange: BLOCK* internalReleaseLease: All existing blocks are COMPLETE, lease removed, file /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_data/part-r-00004/index closed.
2019-12-04 23:23:13,750 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering [Lease.  Holder: DFSClient_attempt_1575479127636_0046_r_000004_1_1307945884_1, pending creates: 1], src=/user/hadoop/crawlmultiokhttp/segment/20191204221308/crawl_parse/part-r-00004
2019-12-04 23:23:13,750 WARN org.apache.hadoop.hdfs.StateChange: BLOCK* internalReleaseLease: All existing blocks are COMPLETE, lease removed, file /user/hadoop/crawlmultiokhttp/segment/20191204221308/crawl_parse/part-r-00004 closed.
~

This issue is hard to reproduce and I can't figure out what are the preconditions. It seems that it just happens randomly.
Maybe the problem is coming from a bad management when we close the file.


> Segment Part problem with HDFS on distibuted mode
> -------------------------------------------------
>
>                 Key: NUTCH-2756
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2756
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.15
>            Reporter: Lucas Pauchard
>            Priority: Major
>         Attachments: 0_byte_file_screenshot.PNG
>
>
> During the parsing, it happens sometimes that parts of the data on the HDFS is missing after the parsing.
> When I take a look at our HDFS, I've got this file with 0 bytes (see attachments).
> After that the CrawlDB complains about this specific (corrupted?) part:
> {panel:title=log_crawl}
> 2019-12-04 22:25:57,454 INFO mapreduce.Job: Task Id : attempt_1575479127636_0047_m_000017_2, Status : FAILED
> Error: java.io.EOFException: hdfs://jobmaster:9000/user/hadoop/crawlmultiokhttp/segment/20191204221308/crawl_parse/part-r-00004 not a SequenceFile
>         at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1964)
>         at org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1923)
>         at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1872)
>         at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1886)
>         at org.apache.hadoop.mapreduce.lib.input.SequenceFileRecordReader.initialize(SequenceFileRecordReader.java:54)
>         at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:560)
>         at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:798)
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:347)
>         at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:174)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
>         at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:168)~
> {panel}
> When I check the namenode logs, I don't see any error during the writing of the segment part but one hour later, I've got the following log:
> {panel:title=log_namenode}
> 2019-12-04 23:23:13,750 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering [Lease.  Holder: DFSClient_attempt_1575479127636_0046_r_000004_1_1307945884_1, pending creates: 2], src=/user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_data/part-r-00004/index
> 2019-12-04 23:23:13,750 WARN org.apache.hadoop.hdfs.StateChange: BLOCK* internalReleaseLease: All existing blocks are COMPLETE, lease removed, file /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_data/part-r-00004/index closed.
> 2019-12-04 23:23:13,750 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering [Lease.  Holder: DFSClient_attempt_1575479127636_0046_r_000004_1_1307945884_1, pending creates: 1], src=/user/hadoop/crawlmultiokhttp/segment/20191204221308/crawl_parse/part-r-00004
> 2019-12-04 23:23:13,750 WARN org.apache.hadoop.hdfs.StateChange: BLOCK* internalReleaseLease: All existing blocks are COMPLETE, lease removed, file /user/hadoop/crawlmultiokhttp/segment/20191204221308/crawl_parse/part-r-00004 closed.
> {panel}
> This issue is hard to reproduce and I can't figure out what are the preconditions. It seems that it just happens randomly.
> Maybe the problem is coming from a bad management when we close the file.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)