You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Chris A. Mattmann (JIRA)" <ji...@apache.org> on 2010/07/14 22:50:54 UTC
[jira] Commented: (NUTCH-677) Segment merge filering based on segment content

    [ https://issues.apache.org/jira/browse/NUTCH-677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12888530#action_12888530 ] 

Chris A. Mattmann commented on NUTCH-677:
-----------------------------------------

Hi Marcin,

I applied your patch, and was unit testing it, all ready to commit, when I ran into this:

{noformat}
    [junit] Test org.apache.nutch.segment.TestSegmentMerger FAILED
    [junit] Running org.apache.nutch.util.TestEncodingDetector
    [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 2.408 sec
    [junit] Running org.apache.nutch.util.TestGZIPUtils
    [junit] Tests run: 4, Failures: 0, Errors: 0, Time elapsed: 2.521 sec
    [junit] Running org.apache.nutch.util.TestNodeWalker
    [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.593 sec
    [junit] Running org.apache.nutch.util.TestPrefixStringMatcher
    [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.452 sec
    [junit] Running org.apache.nutch.util.TestStringUtil
    [junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 0.076 sec
    [junit] Running org.apache.nutch.util.TestSuffixStringMatcher
    [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.321 sec
    [junit] Running org.apache.nutch.util.TestURLUtil
    [junit] Tests run: 4, Failures: 0, Errors: 0, Time elapsed: 2.009 sec

BUILD FAILED
/Users/mattmann/src/nutch/build.xml:258: Tests failed!

Total time: 8 minutes 44 seconds
[chipotle:~/src/nutch] mattmann%
{noformat}

The root cause of the SegmentMerger test error is: (from build/test/TEST-org.apache.nutch.segment.TestSegmentMerger.txt):

{noformat}
432
2010-07-14 13:45:33,085 INFO  mapred.LocalJobRunner (LocalJobRunner.java:statusUpdate(276)) - file:/tmp/hadoop-mattmann/merge-1279140109299/seg1/parse_text/part-00000/data:0+3355
4432
2010-07-14 13:45:33,445 INFO  mapred.MapTask (MapTask.java:flush(1115)) - Starting flush of map output
2010-07-14 13:45:35,101 INFO  mapred.MapTask (MapTask.java:sortAndSpill(1295)) - Finished spill 2
2010-07-14 13:45:35,107 WARN  mapred.LocalJobRunner (LocalJobRunner.java:run(256)) - job_local_0001
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find taskTracker/jobcache/job_local_0001/attempt_local_0001_m_000000_0/output/spill0.out in any of the configured
 local directories
        at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathToRead(LocalDirAllocator.java:389)
        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathToRead(LocalDirAllocator.java:138)
        at org.apache.hadoop.mapred.MapOutputFile.getSpillFile(MapOutputFile.java:94)
        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1443)
        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1154)
        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:359)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
        at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
2010-07-14 13:45:35,879 INFO  mapred.JobClient (JobClient.java:monitorAndPrintJob(1343)) - Job complete: job_local_0001
2010-07-14 13:45:35,883 INFO  mapred.JobClient (Counters.java:log(514)) - Counters: 9
2010-07-14 13:45:35,884 INFO  mapred.JobClient (Counters.java:log(516)) -   FileSystemCounters
2010-07-14 13:45:35,884 INFO  mapred.JobClient (Counters.java:log(518)) -     FILE_BYTES_READ=68360507
2010-07-14 13:45:35,885 INFO  mapred.JobClient (Counters.java:log(518)) -     FILE_BYTES_WRITTEN=229824559
2010-07-14 13:45:35,885 INFO  mapred.JobClient (Counters.java:log(516)) -   Map-Reduce Framework
2010-07-14 13:45:35,885 INFO  mapred.JobClient (Counters.java:log(518)) -     Combine output records=0
2010-07-14 13:45:35,886 INFO  mapred.JobClient (Counters.java:log(518)) -     Map input records=703319
2010-07-14 13:45:35,886 INFO  mapred.JobClient (Counters.java:log(518)) -     Spilled Records=524287
2010-07-14 13:45:35,887 INFO  mapred.JobClient (Counters.java:log(518)) -     Map output bytes=42791349
2010-07-14 13:45:35,888 INFO  mapred.JobClient (Counters.java:log(518)) -     Map input bytes=0
2010-07-14 13:45:35,888 INFO  mapred.JobClient (Counters.java:log(518)) -     Map output records=703319
2010-07-14 13:45:35,889 INFO  mapred.JobClient (Counters.java:log(518)) -     Combine input records=0
------------- ---------------- ---------------
------------- Standard Error -----------------
Creating large segment 1...
 - done: 1677722 records.
Creating large segment 2...
 - done: 1677722 records.
------------- ---------------- ---------------

Testcase: testLargeMerge took 227.804 sec
        Caused an ERROR
Job failed!
java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)
        at org.apache.nutch.segment.SegmentMerger.merge(SegmentMerger.java:639)
        at org.apache.nutch.segment.TestSegmentMerger.testLargeMerge(TestSegmentMerger.java:87)
{noformat}

Any ideas? I'd be happy to commit this, provided we can get it to past regression....

Cheers,
Chris
 


> Segment merge filering based on segment content
> -----------------------------------------------
>
>                 Key: NUTCH-677
>                 URL: https://issues.apache.org/jira/browse/NUTCH-677
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 0.9.0
>            Reporter: Marcin Okraszewski
>            Assignee: Chris A. Mattmann
>         Attachments: MergeFilter.patch, MergeFilter_for_1.0.patch, SegmentMergeFilter.java, SegmentMergeFilter.java, SegmentMergeFilters.java, SegmentMergeFilters.java
>
>
> I needed a segment filtering based on meta data detected during parse phase. Unfortunately current URL based filtering does not allow for this. So I have created a new SegmentMergeFilter extension which receives segment entry which is being merged and decides if it should be included or not. Even though I needed only ParseData for my purpose I have done it a bit more general purpose, so the filter receives all merged data.
> The attached patch is for version 0.9 which I use. Unfortunately I didn't have time to check how it fits to trunk version. Sorry :(

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.