You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Mitesh Singh Jat (JIRA)" <ji...@apache.org> on 2013/09/18 17:52:00 UTC

[jira] [Comment Edited] (NUTCH-1640) OOM in ParseSegment Phase

    [ https://issues.apache.org/jira/browse/NUTCH-1640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13770902#comment-13770902 ] 

Mitesh Singh Jat edited comment on NUTCH-1640 at 9/18/13 3:51 PM:
------------------------------------------------------------------

When a test run done with 65312 records, ParseSegment MR Job status

{noformat}
13/09/18 21:02:13 INFO mapred.JobClient: Job complete: job_201308301130_0517
13/09/18 21:02:13 INFO mapred.JobClient: Counters: 32
13/09/18 21:02:13 INFO mapred.JobClient:   ParserStatus
13/09/18 21:02:13 INFO mapred.JobClient:     failed=96
13/09/18 21:02:13 INFO mapred.JobClient:     success=52393
13/09/18 21:02:13 INFO mapred.JobClient:   Job Counters 
13/09/18 21:02:13 INFO mapred.JobClient:     Launched reduce tasks=1
13/09/18 21:02:13 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=402103
13/09/18 21:02:13 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
13/09/18 21:02:13 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
13/09/18 21:02:13 INFO mapred.JobClient:     Launched map tasks=3
13/09/18 21:02:13 INFO mapred.JobClient:     Data-local map tasks=3
13/09/18 21:02:13 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=125910
13/09/18 21:02:13 INFO mapred.JobClient:   File Input Format Counters 
13/09/18 21:02:13 INFO mapred.JobClient:     Bytes Read=169842629
13/09/18 21:02:13 INFO mapred.JobClient:   File Output Format Counters 
13/09/18 21:02:13 INFO mapred.JobClient:     Bytes Written=0
13/09/18 21:02:13 INFO mapred.JobClient:   FileSystemCounters
13/09/18 21:02:13 INFO mapred.JobClient:     FILE_BYTES_READ=79839439
13/09/18 21:02:13 INFO mapred.JobClient:     HDFS_BYTES_READ=169847735
13/09/18 21:02:13 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=137342604
13/09/18 21:02:13 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=67039214
13/09/18 21:02:13 INFO mapred.JobClient:   Map-Reduce Framework
13/09/18 21:02:13 INFO mapred.JobClient:     Map output materialized bytes=57153030
13/09/18 21:02:13 INFO mapred.JobClient:     Map input records=65312
13/09/18 21:02:13 INFO mapred.JobClient:     Reduce shuffle bytes=57153030
13/09/18 21:02:13 INFO mapred.JobClient:     Spilled Records=125780
13/09/18 21:02:13 INFO mapred.JobClient:     Map output bytes=202984117
13/09/18 21:02:13 INFO mapred.JobClient:     Total committed heap usage (bytes)=682557440
13/09/18 21:02:13 INFO mapred.JobClient:     CPU time spent (ms)=398190
13/09/18 21:02:13 INFO mapred.JobClient:     Map input bytes=169837445
13/09/18 21:02:13 INFO mapred.JobClient:     SPLIT_RAW_BYTES=435
13/09/18 21:02:13 INFO mapred.JobClient:     Combine input records=0
13/09/18 21:02:13 INFO mapred.JobClient:     Reduce input records=52489
13/09/18 21:02:13 INFO mapred.JobClient:     Reduce input groups=52489
13/09/18 21:02:13 INFO mapred.JobClient:     Combine output records=0
13/09/18 21:02:13 INFO mapred.JobClient:     Physical memory (bytes) snapshot=1939349504
13/09/18 21:02:13 INFO mapred.JobClient:     Reduce output records=52489
13/09/18 21:02:13 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=32182784000
13/09/18 21:02:13 INFO mapred.JobClient:     Map output records=52489
13/09/18 21:02:13 INFO parse.ParseSegment: ParseSegment: finished at 2013-09-18 21:02:13, elapsed: 00:05:04
{noformat}

jstat gccause results
{noformat}
Timestamp         S0     S1     E      O      P     YGC     YGCT    FGC    FGCT     GCT    LGCC                 GCC                 
...
           10.1  50.30   2.15 100.00  75.54  96.81      7    0.079     0    0.000    0.079 unknown GCCause      Allocation Failure  
           32.3   0.00  12.16  93.16  80.23  99.30     32    0.468     0    0.000    0.468 unknown GCCause      Allocation Failure  
           38.1  34.72   0.00  98.54  81.65  99.99     43    0.709     0    0.000    0.709 unknown GCCause      Allocation Failure  
           40.7  33.33   0.00 100.00  82.19  99.35     46    0.813     0    0.000    0.813 unknown GCCause      Allocation Failure  
           43.9   0.00  37.50  98.02  82.88  99.48     52    0.956     0    0.000    0.956 unknown GCCause      Allocation Failure  
           45.7  35.00  45.45 100.00  83.28  99.57     54    1.014     0    0.000    1.014 unknown GCCause      Allocation Failure  
           49.2   0.00  58.34  97.81  83.67  99.64     58    1.167     0    0.000    1.167 unknown GCCause      Allocation Failure  
           53.2  75.00   0.00  99.88  84.31  99.75     62    1.349     0    0.000    1.349 unknown GCCause      Allocation Failure  
           55.2   0.00  96.49  97.67  84.71  99.98     66    1.460     0    0.000    1.460 unknown GCCause      Allocation Failure  
           57.7  69.45  38.89 100.00  85.26  99.35     69    1.569     0    0.000    1.569 unknown GCCause      Allocation Failure  
           59.7  44.44  69.45 100.00  85.68  99.40     72    1.685     0    0.000    1.685 unknown GCCause      Allocation Failure  
           61.4   0.00  91.67  96.20  85.81  99.44     74    1.773     0    0.000    1.773 unknown GCCause      Allocation Failure  
           63.7   0.00  68.75  98.69  86.08  99.47     76    1.861     0    0.000    1.861 unknown GCCause      Allocation Failure  
           66.5   0.00  56.25 100.00  86.65  99.53     80    2.028     0    0.000    2.028 unknown GCCause      Allocation Failure  
           69.0  43.75  62.50 100.00  87.04  99.78     82    2.125     0    0.000    2.125 unknown GCCause      Allocation Failure  
           71.0  78.13   0.00  99.98  87.34  99.81     85    2.271     0    0.000    2.271 unknown GCCause      Allocation Failure  
           74.7   0.00  64.29 100.00  88.05  99.87     90    2.548     0    0.000    2.548 unknown GCCause      Allocation Failure  
           76.2   0.00  57.14  99.99  88.33  99.90     92    2.652     0    0.000    2.652 unknown GCCause      Allocation Failure  
           77.5  57.14  75.00 100.00  88.73  99.93     94    2.744     0    0.000    2.744 unknown GCCause      Allocation Failure  
           89.4  42.86  71.43 100.00  90.75  99.41    110    3.681     0    0.000    3.681 unknown GCCause      Allocation Failure  
           90.2   0.00  71.43   0.00  91.05  99.42    111    3.768     1    0.000    3.768 Allocation Failure   unknown GCCause     
           90.4   0.00  71.43   0.00  91.05  99.42    111    3.768     1    0.000    3.768 Allocation Failure   unknown GCCause     
           91.2   0.00   0.00 100.00  91.23  49.74    111    3.768     2    0.492    4.261 Allocation Failure   unknown GCCause     
           91.9   0.00   0.00 100.00  91.46  44.44    111    3.768     3    0.856    4.624 Allocation Failure   unknown GCCause     
           92.2   0.00   0.00 100.00  91.46  44.44    111    3.768     3    0.856    4.624 Allocation Failure   unknown GCCause     
           93.0   0.00   0.00 100.00  91.60  45.88    111    3.768     4    1.181    4.949 Allocation Failure   unknown GCCause     
           93.2   0.00   0.00 100.00  91.60  45.27    111    3.768     4    1.181    4.949 Allocation Failure   unknown GCCause     
           94.0  50.00   0.00 100.00  87.67  47.83    112    3.768     4    1.571    5.340 unknown GCCause      Allocation Failure  
           95.2   0.00 100.00  87.74  87.73  47.86    114    3.907     4    1.571    5.478 unknown GCCause      Allocation Failure  
           96.0  75.00   0.00  99.23  87.96  47.89    115    3.961     4    1.571    5.533 unknown GCCause      Allocation Failure  
           98.3  97.50   0.00 100.00  88.26  47.93    119    4.191     4    1.571    5.763 unknown GCCause      Allocation Failure  
           98.8  77.27  72.73 100.00  88.52  47.94    120    4.246     4    1.571    5.817 unknown GCCause      Allocation Failure  
          102.2 100.00   0.00  99.96  88.78  47.97    125    4.572     4    1.571    6.144 unknown GCCause      Allocation Failure  
          102.8   0.00  98.53  96.96  88.87  47.98    126    4.631     4    1.571    6.202 unknown GCCause      Allocation Failure  
          103.3  92.65   0.00  96.77  89.11  47.98    127    4.687     4    1.571    6.259 unknown GCCause      Allocation Failure  
          105.0  92.59  98.91 100.00  89.11  48.01    130    4.852     4    1.571    6.424 unknown GCCause      Allocation Failure  
          106.0  92.42  98.40 100.00  89.11  48.02    132    4.968     4    1.571    6.540 unknown GCCause      Allocation Failure  
          107.0  88.42  97.86 100.00  89.11  48.03    134    5.073     4    1.571    6.644 unknown GCCause      Allocation Failure  
          107.7  89.63  64.89 100.00  89.11  48.03    135    5.148     4    1.571    6.719 unknown GCCause      Allocation Failure  
          115.5   0.00  62.10  95.20  90.55  48.09    148    6.034     4    1.571    7.605 unknown GCCause      Allocation Failure  
          116.0  64.41   0.00  94.67  90.68  48.09    149    6.093     4    1.571    7.665 unknown GCCause      Allocation Failure  
          116.8  60.34  62.50 100.00  90.92  48.10    150    6.145     4    1.571    7.717 unknown GCCause      Allocation Failure  
          117.0  66.67   0.00   0.00  90.93  48.10    150    6.268     5    1.571    7.839 Allocation Failure   unknown GCCause     
          117.3  66.67   0.00   0.00  90.93  48.10    150    6.268     5    1.571    7.839 Allocation Failure   unknown GCCause     
          117.5  66.67   0.00   0.00  90.93  48.10    150    6.268     5    1.571    7.839 Allocation Failure   unknown GCCause     
          118.8   0.00   0.00 100.00  92.68  49.94    150    6.268     6    2.293    8.560 Allocation Failure   unknown GCCause     
          119.6   0.00   0.00 100.00  92.76  52.56    150    6.268     7    2.673    8.941 Allocation Failure   unknown GCCause     
          119.8   0.00   0.00 100.00  92.76  52.56    150    6.268     7    2.673    8.941 Allocation Failure   unknown GCCause     
          120.3   0.00   0.00  99.90  92.70  55.38    150    6.268     8    3.015    9.282 Allocation Failure   unknown GCCause     
          120.6   0.00   0.00 100.00  92.70  55.38    150    6.268     8    3.015    9.282 Allocation Failure   unknown GCCause     
          122.3  12.08  14.75 100.00  87.68  59.13    153    6.361     8    3.366    9.727 unknown GCCause      Allocation Failure  
          122.8  17.74  95.00 100.00  87.68  59.13    154    6.422     8    3.366    9.788 unknown GCCause      Allocation Failure  
          123.3  18.55   0.00  91.70  87.68  59.14    155    6.480     8    3.366    9.846 unknown GCCause      Allocation Failure  
          127.2  52.50   0.00  96.64  87.68  59.20    161    6.841     8    3.366   10.208 unknown GCCause      Allocation Failure  
          133.6  72.12  69.81 100.00  87.93  59.27    167    7.231     8    3.366   10.597 unknown GCCause      Allocation Failure  
          136.0  67.86  66.07 100.00  88.42  59.31    171    7.469     8    3.366   10.836 unknown GCCause      Allocation Failure  
          137.7  65.45  69.20 100.00  88.64  59.33    174    7.662     8    3.366   11.028 unknown GCCause      Allocation Failure  
          138.2  70.91   0.00 100.00  88.77  59.33    175    7.731     8    3.366   11.097 unknown GCCause      Allocation Failure  
          138.7   0.00  69.09 100.00  88.89  59.34    176    7.792     8    3.366   11.158 unknown GCCause      Allocation Failure  
          140.5   4.72  71.30 100.00  89.12  59.35    178    7.910     8    3.366   11.276 unknown GCCause      Allocation Failure  
          141.7  62.50  72.64 100.00  89.36  59.35    180    8.070     8    3.366   11.437 unknown GCCause      Allocation Failure  
          144.5  79.59  76.00 100.00  90.09  59.37    185    8.398     8    3.366   11.764 unknown GCCause      Allocation Failure  
          145.0   6.00  80.00 100.00  90.09  59.38    186    8.510     8    3.366   11.876 unknown GCCause      Allocation Failure  
          145.5  83.85   0.00 100.00  90.21  59.39    187    8.571     8    3.366   11.938 unknown GCCause      Allocation Failure  
          146.0   0.00  83.67  96.51  90.33  59.39    188    8.642     8    3.366   12.008 unknown GCCause      Allocation Failure  
          148.5  85.64   0.00   0.00  90.93  59.40    192    8.976     9    3.366   12.342 Allocation Failure   unknown GCCause     
          149.5   0.00   0.00 100.00  92.77  62.11    192    8.976    10    3.729   12.704 Allocation Failure   unknown GCCause     
          150.5   0.00   0.00 100.00  92.90  65.32    192    8.976    11    4.101   13.076 Allocation Failure   unknown GCCause     
          150.7   0.00   0.00 100.00  92.90  65.32    192    8.976    11    4.101   13.076 Allocation Failure   unknown GCCause     
          151.2   0.00   0.00 100.00  92.89  68.47    192    8.976    12    4.438   13.414 Allocation Failure   unknown GCCause     
          151.5   0.00   0.00 100.00  92.89  68.47    192    8.976    12    4.438   13.414 Allocation Failure   unknown GCCause     
          152.0   0.00   5.21 100.00  87.95  71.52    193    8.976    12    4.756   13.732 unknown GCCause      Allocation Failure  
          154.2  21.55   0.00  93.09  87.95  71.66    197    9.204    12    4.756   13.960 unknown GCCause      Allocation Failure
...
          165.8   0.00  22.03  23.28  87.95  72.33    197    9.265    12    4.756   14.021 unknown GCCause      No GC               
          166.0   0.00  22.03  35.03  87.95  72.54    197    9.265    12    4.756   14.021 unknown GCCause      No GC
{noformat}


When the same set of records given to the patched ParseSegment
{noformat}
13/09/18 21:16:27 INFO mapred.JobClient: Job complete: job_201308301130_0519
13/09/18 21:16:27 INFO mapred.JobClient: Counters: 32
13/09/18 21:16:27 INFO mapred.JobClient:   ParserStatus
13/09/18 21:16:27 INFO mapred.JobClient:     failed=96
13/09/18 21:16:27 INFO mapred.JobClient:     success=52393
13/09/18 21:16:27 INFO mapred.JobClient:   Job Counters 
13/09/18 21:16:27 INFO mapred.JobClient:     Launched reduce tasks=1
13/09/18 21:16:27 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=228650
13/09/18 21:16:27 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
13/09/18 21:16:27 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
13/09/18 21:16:27 INFO mapred.JobClient:     Launched map tasks=3
13/09/18 21:16:27 INFO mapred.JobClient:     Data-local map tasks=3
13/09/18 21:16:27 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=94147
13/09/18 21:16:27 INFO mapred.JobClient:   File Input Format Counters 
13/09/18 21:16:27 INFO mapred.JobClient:     Bytes Read=169842629
13/09/18 21:16:27 INFO mapred.JobClient:   File Output Format Counters 
13/09/18 21:16:27 INFO mapred.JobClient:     Bytes Written=0
13/09/18 21:16:27 INFO mapred.JobClient:   FileSystemCounters
13/09/18 21:16:27 INFO mapred.JobClient:     FILE_BYTES_READ=79839439
13/09/18 21:16:27 INFO mapred.JobClient:     HDFS_BYTES_READ=169847735
13/09/18 21:16:27 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=137342636
13/09/18 21:16:27 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=67039060
13/09/18 21:16:27 INFO mapred.JobClient:   Map-Reduce Framework
13/09/18 21:16:27 INFO mapred.JobClient:     Map output materialized bytes=57153030
13/09/18 21:16:27 INFO mapred.JobClient:     Map input records=65312
13/09/18 21:16:27 INFO mapred.JobClient:     Reduce shuffle bytes=57153030
13/09/18 21:16:27 INFO mapred.JobClient:     Spilled Records=125780
13/09/18 21:16:27 INFO mapred.JobClient:     Map output bytes=202984117
13/09/18 21:16:27 INFO mapred.JobClient:     Total committed heap usage (bytes)=679673856
13/09/18 21:16:27 INFO mapred.JobClient:     CPU time spent (ms)=271210
13/09/18 21:16:27 INFO mapred.JobClient:     Map input bytes=169837445
13/09/18 21:16:27 INFO mapred.JobClient:     SPLIT_RAW_BYTES=435
13/09/18 21:16:27 INFO mapred.JobClient:     Combine input records=0
13/09/18 21:16:27 INFO mapred.JobClient:     Reduce input records=52489
13/09/18 21:16:27 INFO mapred.JobClient:     Reduce input groups=52489
13/09/18 21:16:27 INFO mapred.JobClient:     Combine output records=0
13/09/18 21:16:27 INFO mapred.JobClient:     Physical memory (bytes) snapshot=1009508352
13/09/18 21:16:27 INFO mapred.JobClient:     Reduce output records=52489
13/09/18 21:16:27 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=4003405824
13/09/18 21:16:27 INFO mapred.JobClient:     Map output records=52489
13/09/18 21:16:27 INFO parse.ParseSegment: ParseSegment: finished at 2013-09-18 21:16:27, elapsed: 00:03:20
{noformat}

jstat gccause (only one line without "No GC")
{noformat}
Timestamp         S0     S1     E      O      P     YGC     YGCT    FGC    FGCT     GCT    LGCC                 GCC 
...                
           29.1   0.00  75.00  99.16  77.81  99.76     20    0.197     0    0.000    0.197 unknown GCCause      Allocation Failure 
...
           94.4   0.00  50.00  94.38  79.28  99.63    199    1.065     0    0.000    1.065 unknown GCCause      No GC               
           94.4   0.00  50.00  94.38  79.28  99.68    199    1.065     0    0.000    1.065 unknown GCCause      No GC
{noformat}


                
      was (Author: miteshsjat):
    When a test run done with 65312 records, ParseSegment MR Job status

{noformat}
13/09/18 21:02:13 INFO mapred.JobClient: Job complete: job_201308301130_0517
13/09/18 21:02:13 INFO mapred.JobClient: Counters: 32
13/09/18 21:02:13 INFO mapred.JobClient:   ParserStatus
13/09/18 21:02:13 INFO mapred.JobClient:     failed=96
13/09/18 21:02:13 INFO mapred.JobClient:     success=52393
13/09/18 21:02:13 INFO mapred.JobClient:   Job Counters 
13/09/18 21:02:13 INFO mapred.JobClient:     Launched reduce tasks=1
13/09/18 21:02:13 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=402103
13/09/18 21:02:13 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
13/09/18 21:02:13 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
13/09/18 21:02:13 INFO mapred.JobClient:     Launched map tasks=3
13/09/18 21:02:13 INFO mapred.JobClient:     Data-local map tasks=3
13/09/18 21:02:13 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=125910
13/09/18 21:02:13 INFO mapred.JobClient:   File Input Format Counters 
13/09/18 21:02:13 INFO mapred.JobClient:     Bytes Read=169842629
13/09/18 21:02:13 INFO mapred.JobClient:   File Output Format Counters 
13/09/18 21:02:13 INFO mapred.JobClient:     Bytes Written=0
13/09/18 21:02:13 INFO mapred.JobClient:   FileSystemCounters
13/09/18 21:02:13 INFO mapred.JobClient:     FILE_BYTES_READ=79839439
13/09/18 21:02:13 INFO mapred.JobClient:     HDFS_BYTES_READ=169847735
13/09/18 21:02:13 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=137342604
13/09/18 21:02:13 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=67039214
13/09/18 21:02:13 INFO mapred.JobClient:   Map-Reduce Framework
13/09/18 21:02:13 INFO mapred.JobClient:     Map output materialized bytes=57153030
13/09/18 21:02:13 INFO mapred.JobClient:     Map input records=65312
13/09/18 21:02:13 INFO mapred.JobClient:     Reduce shuffle bytes=57153030
13/09/18 21:02:13 INFO mapred.JobClient:     Spilled Records=125780
13/09/18 21:02:13 INFO mapred.JobClient:     Map output bytes=202984117
13/09/18 21:02:13 INFO mapred.JobClient:     Total committed heap usage (bytes)=682557440
13/09/18 21:02:13 INFO mapred.JobClient:     CPU time spent (ms)=398190
13/09/18 21:02:13 INFO mapred.JobClient:     Map input bytes=169837445
13/09/18 21:02:13 INFO mapred.JobClient:     SPLIT_RAW_BYTES=435
13/09/18 21:02:13 INFO mapred.JobClient:     Combine input records=0
13/09/18 21:02:13 INFO mapred.JobClient:     Reduce input records=52489
13/09/18 21:02:13 INFO mapred.JobClient:     Reduce input groups=52489
13/09/18 21:02:13 INFO mapred.JobClient:     Combine output records=0
13/09/18 21:02:13 INFO mapred.JobClient:     Physical memory (bytes) snapshot=1939349504
13/09/18 21:02:13 INFO mapred.JobClient:     Reduce output records=52489
13/09/18 21:02:13 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=32182784000
13/09/18 21:02:13 INFO mapred.JobClient:     Map output records=52489
13/09/18 21:02:13 INFO parse.ParseSegment: ParseSegment: finished at 2013-09-18 21:02:13, elapsed: 00:05:04
{noformat}

jstat gccause results
{noformat}
Timestamp         S0     S1     E      O      P     YGC     YGCT    FGC    FGCT     GCT    LGCC                 GCC                 
...
           10.1  50.30   2.15 100.00  75.54  96.81      7    0.079     0    0.000    0.079 unknown GCCause      Allocation Failure  
           32.3   0.00  12.16  93.16  80.23  99.30     32    0.468     0    0.000    0.468 unknown GCCause      Allocation Failure  
           38.1  34.72   0.00  98.54  81.65  99.99     43    0.709     0    0.000    0.709 unknown GCCause      Allocation Failure  
           40.7  33.33   0.00 100.00  82.19  99.35     46    0.813     0    0.000    0.813 unknown GCCause      Allocation Failure  
           43.9   0.00  37.50  98.02  82.88  99.48     52    0.956     0    0.000    0.956 unknown GCCause      Allocation Failure  
           45.7  35.00  45.45 100.00  83.28  99.57     54    1.014     0    0.000    1.014 unknown GCCause      Allocation Failure  
           49.2   0.00  58.34  97.81  83.67  99.64     58    1.167     0    0.000    1.167 unknown GCCause      Allocation Failure  
           53.2  75.00   0.00  99.88  84.31  99.75     62    1.349     0    0.000    1.349 unknown GCCause      Allocation Failure  
           55.2   0.00  96.49  97.67  84.71  99.98     66    1.460     0    0.000    1.460 unknown GCCause      Allocation Failure  
           57.7  69.45  38.89 100.00  85.26  99.35     69    1.569     0    0.000    1.569 unknown GCCause      Allocation Failure  
           59.7  44.44  69.45 100.00  85.68  99.40     72    1.685     0    0.000    1.685 unknown GCCause      Allocation Failure  
           61.4   0.00  91.67  96.20  85.81  99.44     74    1.773     0    0.000    1.773 unknown GCCause      Allocation Failure  
           63.7   0.00  68.75  98.69  86.08  99.47     76    1.861     0    0.000    1.861 unknown GCCause      Allocation Failure  
           66.5   0.00  56.25 100.00  86.65  99.53     80    2.028     0    0.000    2.028 unknown GCCause      Allocation Failure  
           69.0  43.75  62.50 100.00  87.04  99.78     82    2.125     0    0.000    2.125 unknown GCCause      Allocation Failure  
           71.0  78.13   0.00  99.98  87.34  99.81     85    2.271     0    0.000    2.271 unknown GCCause      Allocation Failure  
           74.7   0.00  64.29 100.00  88.05  99.87     90    2.548     0    0.000    2.548 unknown GCCause      Allocation Failure  
           76.2   0.00  57.14  99.99  88.33  99.90     92    2.652     0    0.000    2.652 unknown GCCause      Allocation Failure  
           77.5  57.14  75.00 100.00  88.73  99.93     94    2.744     0    0.000    2.744 unknown GCCause      Allocation Failure  
           89.4  42.86  71.43 100.00  90.75  99.41    110    3.681     0    0.000    3.681 unknown GCCause      Allocation Failure  
           90.2   0.00  71.43   0.00  91.05  99.42    111    3.768     1    0.000    3.768 Allocation Failure   unknown GCCause     
           90.4   0.00  71.43   0.00  91.05  99.42    111    3.768     1    0.000    3.768 Allocation Failure   unknown GCCause     
           91.2   0.00   0.00 100.00  91.23  49.74    111    3.768     2    0.492    4.261 Allocation Failure   unknown GCCause     
           91.9   0.00   0.00 100.00  91.46  44.44    111    3.768     3    0.856    4.624 Allocation Failure   unknown GCCause     
           92.2   0.00   0.00 100.00  91.46  44.44    111    3.768     3    0.856    4.624 Allocation Failure   unknown GCCause     
           93.0   0.00   0.00 100.00  91.60  45.88    111    3.768     4    1.181    4.949 Allocation Failure   unknown GCCause     
           93.2   0.00   0.00 100.00  91.60  45.27    111    3.768     4    1.181    4.949 Allocation Failure   unknown GCCause     
           94.0  50.00   0.00 100.00  87.67  47.83    112    3.768     4    1.571    5.340 unknown GCCause      Allocation Failure  
           95.2   0.00 100.00  87.74  87.73  47.86    114    3.907     4    1.571    5.478 unknown GCCause      Allocation Failure  
           96.0  75.00   0.00  99.23  87.96  47.89    115    3.961     4    1.571    5.533 unknown GCCause      Allocation Failure  
           98.3  97.50   0.00 100.00  88.26  47.93    119    4.191     4    1.571    5.763 unknown GCCause      Allocation Failure  
           98.8  77.27  72.73 100.00  88.52  47.94    120    4.246     4    1.571    5.817 unknown GCCause      Allocation Failure  
          102.2 100.00   0.00  99.96  88.78  47.97    125    4.572     4    1.571    6.144 unknown GCCause      Allocation Failure  
          102.8   0.00  98.53  96.96  88.87  47.98    126    4.631     4    1.571    6.202 unknown GCCause      Allocation Failure  
          103.3  92.65   0.00  96.77  89.11  47.98    127    4.687     4    1.571    6.259 unknown GCCause      Allocation Failure  
          105.0  92.59  98.91 100.00  89.11  48.01    130    4.852     4    1.571    6.424 unknown GCCause      Allocation Failure  
          106.0  92.42  98.40 100.00  89.11  48.02    132    4.968     4    1.571    6.540 unknown GCCause      Allocation Failure  
          107.0  88.42  97.86 100.00  89.11  48.03    134    5.073     4    1.571    6.644 unknown GCCause      Allocation Failure  
          107.7  89.63  64.89 100.00  89.11  48.03    135    5.148     4    1.571    6.719 unknown GCCause      Allocation Failure  
          115.5   0.00  62.10  95.20  90.55  48.09    148    6.034     4    1.571    7.605 unknown GCCause      Allocation Failure  
          116.0  64.41   0.00  94.67  90.68  48.09    149    6.093     4    1.571    7.665 unknown GCCause      Allocation Failure  
          116.8  60.34  62.50 100.00  90.92  48.10    150    6.145     4    1.571    7.717 unknown GCCause      Allocation Failure  
          117.0  66.67   0.00   0.00  90.93  48.10    150    6.268     5    1.571    7.839 Allocation Failure   unknown GCCause     
          117.3  66.67   0.00   0.00  90.93  48.10    150    6.268     5    1.571    7.839 Allocation Failure   unknown GCCause     
          117.5  66.67   0.00   0.00  90.93  48.10    150    6.268     5    1.571    7.839 Allocation Failure   unknown GCCause     
          118.8   0.00   0.00 100.00  92.68  49.94    150    6.268     6    2.293    8.560 Allocation Failure   unknown GCCause     
          119.6   0.00   0.00 100.00  92.76  52.56    150    6.268     7    2.673    8.941 Allocation Failure   unknown GCCause     
          119.8   0.00   0.00 100.00  92.76  52.56    150    6.268     7    2.673    8.941 Allocation Failure   unknown GCCause     
          120.3   0.00   0.00  99.90  92.70  55.38    150    6.268     8    3.015    9.282 Allocation Failure   unknown GCCause     
          120.6   0.00   0.00 100.00  92.70  55.38    150    6.268     8    3.015    9.282 Allocation Failure   unknown GCCause     
          122.3  12.08  14.75 100.00  87.68  59.13    153    6.361     8    3.366    9.727 unknown GCCause      Allocation Failure  
          122.8  17.74  95.00 100.00  87.68  59.13    154    6.422     8    3.366    9.788 unknown GCCause      Allocation Failure  
          123.3  18.55   0.00  91.70  87.68  59.14    155    6.480     8    3.366    9.846 unknown GCCause      Allocation Failure  
          127.2  52.50   0.00  96.64  87.68  59.20    161    6.841     8    3.366   10.208 unknown GCCause      Allocation Failure  
          133.6  72.12  69.81 100.00  87.93  59.27    167    7.231     8    3.366   10.597 unknown GCCause      Allocation Failure  
          136.0  67.86  66.07 100.00  88.42  59.31    171    7.469     8    3.366   10.836 unknown GCCause      Allocation Failure  
          137.7  65.45  69.20 100.00  88.64  59.33    174    7.662     8    3.366   11.028 unknown GCCause      Allocation Failure  
          138.2  70.91   0.00 100.00  88.77  59.33    175    7.731     8    3.366   11.097 unknown GCCause      Allocation Failure  
          138.7   0.00  69.09 100.00  88.89  59.34    176    7.792     8    3.366   11.158 unknown GCCause      Allocation Failure  
          140.5   4.72  71.30 100.00  89.12  59.35    178    7.910     8    3.366   11.276 unknown GCCause      Allocation Failure  
          141.7  62.50  72.64 100.00  89.36  59.35    180    8.070     8    3.366   11.437 unknown GCCause      Allocation Failure  
          144.5  79.59  76.00 100.00  90.09  59.37    185    8.398     8    3.366   11.764 unknown GCCause      Allocation Failure  
          145.0   6.00  80.00 100.00  90.09  59.38    186    8.510     8    3.366   11.876 unknown GCCause      Allocation Failure  
          145.5  83.85   0.00 100.00  90.21  59.39    187    8.571     8    3.366   11.938 unknown GCCause      Allocation Failure  
          146.0   0.00  83.67  96.51  90.33  59.39    188    8.642     8    3.366   12.008 unknown GCCause      Allocation Failure  
          148.5  85.64   0.00   0.00  90.93  59.40    192    8.976     9    3.366   12.342 Allocation Failure   unknown GCCause     
          149.5   0.00   0.00 100.00  92.77  62.11    192    8.976    10    3.729   12.704 Allocation Failure   unknown GCCause     
          150.5   0.00   0.00 100.00  92.90  65.32    192    8.976    11    4.101   13.076 Allocation Failure   unknown GCCause     
          150.7   0.00   0.00 100.00  92.90  65.32    192    8.976    11    4.101   13.076 Allocation Failure   unknown GCCause     
          151.2   0.00   0.00 100.00  92.89  68.47    192    8.976    12    4.438   13.414 Allocation Failure   unknown GCCause     
          151.5   0.00   0.00 100.00  92.89  68.47    192    8.976    12    4.438   13.414 Allocation Failure   unknown GCCause     
          152.0   0.00   5.21 100.00  87.95  71.52    193    8.976    12    4.756   13.732 unknown GCCause      Allocation Failure  
          154.2  21.55   0.00  93.09  87.95  71.66    197    9.204    12    4.756   13.960 unknown GCCause      Allocation Failure
...
          165.8   0.00  22.03  23.28  87.95  72.33    197    9.265    12    4.756   14.021 unknown GCCause      No GC               
          166.0   0.00  22.03  35.03  87.95  72.54    197    9.265    12    4.756   14.021 unknown GCCause      No GC
{noformat}


When the same set of records given to the patched ParseSegment
{noformat}
13/09/18 21:16:27 INFO mapred.JobClient: Job complete: job_201308301130_0519
13/09/18 21:16:27 INFO mapred.JobClient: Counters: 32
13/09/18 21:16:27 INFO mapred.JobClient:   ParserStatus
13/09/18 21:16:27 INFO mapred.JobClient:     failed=96
13/09/18 21:16:27 INFO mapred.JobClient:     success=52393
13/09/18 21:16:27 INFO mapred.JobClient:   Job Counters 
13/09/18 21:16:27 INFO mapred.JobClient:     Launched reduce tasks=1
13/09/18 21:16:27 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=228650
13/09/18 21:16:27 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
13/09/18 21:16:27 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
13/09/18 21:16:27 INFO mapred.JobClient:     Launched map tasks=3
13/09/18 21:16:27 INFO mapred.JobClient:     Data-local map tasks=3
13/09/18 21:16:27 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=94147
13/09/18 21:16:27 INFO mapred.JobClient:   File Input Format Counters 
13/09/18 21:16:27 INFO mapred.JobClient:     Bytes Read=169842629
13/09/18 21:16:27 INFO mapred.JobClient:   File Output Format Counters 
13/09/18 21:16:27 INFO mapred.JobClient:     Bytes Written=0
13/09/18 21:16:27 INFO mapred.JobClient:   FileSystemCounters
13/09/18 21:16:27 INFO mapred.JobClient:     FILE_BYTES_READ=79839439
13/09/18 21:16:27 INFO mapred.JobClient:     HDFS_BYTES_READ=169847735
13/09/18 21:16:27 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=137342636
13/09/18 21:16:27 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=67039060
13/09/18 21:16:27 INFO mapred.JobClient:   Map-Reduce Framework
13/09/18 21:16:27 INFO mapred.JobClient:     Map output materialized bytes=57153030
13/09/18 21:16:27 INFO mapred.JobClient:     Map input records=65312
13/09/18 21:16:27 INFO mapred.JobClient:     Reduce shuffle bytes=57153030
13/09/18 21:16:27 INFO mapred.JobClient:     Spilled Records=125780
13/09/18 21:16:27 INFO mapred.JobClient:     Map output bytes=202984117
13/09/18 21:16:27 INFO mapred.JobClient:     Total committed heap usage (bytes)=679673856
13/09/18 21:16:27 INFO mapred.JobClient:     CPU time spent (ms)=271210
13/09/18 21:16:27 INFO mapred.JobClient:     Map input bytes=169837445
13/09/18 21:16:27 INFO mapred.JobClient:     SPLIT_RAW_BYTES=435
13/09/18 21:16:27 INFO mapred.JobClient:     Combine input records=0
13/09/18 21:16:27 INFO mapred.JobClient:     Reduce input records=52489
13/09/18 21:16:27 INFO mapred.JobClient:     Reduce input groups=52489
13/09/18 21:16:27 INFO mapred.JobClient:     Combine output records=0
13/09/18 21:16:27 INFO mapred.JobClient:     Physical memory (bytes) snapshot=1009508352
13/09/18 21:16:27 INFO mapred.JobClient:     Reduce output records=52489
13/09/18 21:16:27 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=4003405824
13/09/18 21:16:27 INFO mapred.JobClient:     Map output records=52489
13/09/18 21:16:27 INFO crawler.NISParseSegment: ParseSegment: finished at 2013-09-18 21:16:27, elapsed: 00:03:20
{noformat}

jstat gccause (only one line without "No GC")
{noformat}
Timestamp         S0     S1     E      O      P     YGC     YGCT    FGC    FGCT     GCT    LGCC                 GCC 
...                
           29.1   0.00  75.00  99.16  77.81  99.76     20    0.197     0    0.000    0.197 unknown GCCause      Allocation Failure 
...
           94.4   0.00  50.00  94.38  79.28  99.63    199    1.065     0    0.000    1.065 unknown GCCause      No GC               
           94.4   0.00  50.00  94.38  79.28  99.68    199    1.065     0    0.000    1.065 unknown GCCause      No GC
{noformat}


                  
> OOM in ParseSegment Phase
> -------------------------
>
>                 Key: NUTCH-1640
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1640
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.7
>         Environment: RHEL 6.2 x86_64
>            Reporter: Mitesh Singh Jat
>         Attachments: NUTCH-1640.patch
>
>
> The nutch ParseSegment phase fails after 2 runs on same TaskTracker, with the following Exception:
> {noformat}
> Exception in thread "main" org.apache.hadoop.ipc.RemoteException: java.io.IOException: java.lang.OutOfMemoryError: unable to create new native thread
> 	at java.lang.Thread.start0(Native Method)
> 	at java.lang.Thread.start(Thread.java:640)
> 	at org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.kill(JvmManager.java:553)
> 	at org.apache.hadoop.mapred.JvmManager$JvmManagerForType.killJvmRunner(JvmManager.java:317)
> 	at org.apache.hadoop.mapred.JvmManager$JvmManagerForType.killJvm(JvmManager.java:297)
> 	at org.apache.hadoop.mapred.JvmManager$JvmManagerForType.taskKilled(JvmManager.java:289)
> 	at org.apache.hadoop.mapred.JvmManager.taskKilled(JvmManager.java:158)
> 	at org.apache.hadoop.mapred.TaskRunner.kill(TaskRunner.java:802)
> 	at org.apache.hadoop.mapred.TaskTracker$TaskInProgress.kill(TaskTracker.java:3315)
> 	at org.apache.hadoop.mapred.TaskTracker$TaskInProgress.jobHasFinished(TaskTracker.java:3287)
> 	at org.apache.hadoop.mapred.TaskTracker.purgeTask(TaskTracker.java:2316)
> 	at org.apache.hadoop.mapred.TaskTracker.fatalError(TaskTracker.java:3710)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:587)
> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1444)
> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1440)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:396)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
> 	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1438)
> 	at org.apache.hadoop.ipc.Client.call(Client.java:1118)
> 	at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:229)
> 	at $Proxy1.fatalError(Unknown Source)
> 	at org.apache.hadoop.mapred.Child.main(Child.java:310)
> {noformat}
> Whereas similar parsing when done in Nutch Fetcher Phase (fetcher.parse=true, fetcher.store.content=false) does not give such issue.
> Hence, on analysing the code of Fetcher and ParseSegment, it seems the issue
> should be related to creation parseResult foreach url in ParseSegment.java.
> {code}
>  95     ParseResult parseResult = null;
>  96     try {
>  97       parseResult = new ParseUtil(getConf()).parse(content); // <*****
>  98     } catch (Exception e) {
>  99       LOG.warn("Error parsing: " + key + ": " + StringUtils.stringifyException(e));
> 100       return;
> 101     }
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira