You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Chris A. Mattmann (JIRA)" <ji...@apache.org> on 2010/07/14 22:50:54 UTC
[jira] Commented: (NUTCH-677) Segment merge filering based on
segment content
[ https://issues.apache.org/jira/browse/NUTCH-677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12888530#action_12888530 ]
Chris A. Mattmann commented on NUTCH-677:
-----------------------------------------
Hi Marcin,
I applied your patch, and was unit testing it, all ready to commit, when I ran into this:
{noformat}
[junit] Test org.apache.nutch.segment.TestSegmentMerger FAILED
[junit] Running org.apache.nutch.util.TestEncodingDetector
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 2.408 sec
[junit] Running org.apache.nutch.util.TestGZIPUtils
[junit] Tests run: 4, Failures: 0, Errors: 0, Time elapsed: 2.521 sec
[junit] Running org.apache.nutch.util.TestNodeWalker
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.593 sec
[junit] Running org.apache.nutch.util.TestPrefixStringMatcher
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.452 sec
[junit] Running org.apache.nutch.util.TestStringUtil
[junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 0.076 sec
[junit] Running org.apache.nutch.util.TestSuffixStringMatcher
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.321 sec
[junit] Running org.apache.nutch.util.TestURLUtil
[junit] Tests run: 4, Failures: 0, Errors: 0, Time elapsed: 2.009 sec
BUILD FAILED
/Users/mattmann/src/nutch/build.xml:258: Tests failed!
Total time: 8 minutes 44 seconds
[chipotle:~/src/nutch] mattmann%
{noformat}
The root cause of the SegmentMerger test error is: (from build/test/TEST-org.apache.nutch.segment.TestSegmentMerger.txt):
{noformat}
432
2010-07-14 13:45:33,085 INFO mapred.LocalJobRunner (LocalJobRunner.java:statusUpdate(276)) - file:/tmp/hadoop-mattmann/merge-1279140109299/seg1/parse_text/part-00000/data:0+3355
4432
2010-07-14 13:45:33,445 INFO mapred.MapTask (MapTask.java:flush(1115)) - Starting flush of map output
2010-07-14 13:45:35,101 INFO mapred.MapTask (MapTask.java:sortAndSpill(1295)) - Finished spill 2
2010-07-14 13:45:35,107 WARN mapred.LocalJobRunner (LocalJobRunner.java:run(256)) - job_local_0001
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find taskTracker/jobcache/job_local_0001/attempt_local_0001_m_000000_0/output/spill0.out in any of the configured
local directories
at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathToRead(LocalDirAllocator.java:389)
at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathToRead(LocalDirAllocator.java:138)
at org.apache.hadoop.mapred.MapOutputFile.getSpillFile(MapOutputFile.java:94)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1443)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1154)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:359)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
2010-07-14 13:45:35,879 INFO mapred.JobClient (JobClient.java:monitorAndPrintJob(1343)) - Job complete: job_local_0001
2010-07-14 13:45:35,883 INFO mapred.JobClient (Counters.java:log(514)) - Counters: 9
2010-07-14 13:45:35,884 INFO mapred.JobClient (Counters.java:log(516)) - FileSystemCounters
2010-07-14 13:45:35,884 INFO mapred.JobClient (Counters.java:log(518)) - FILE_BYTES_READ=68360507
2010-07-14 13:45:35,885 INFO mapred.JobClient (Counters.java:log(518)) - FILE_BYTES_WRITTEN=229824559
2010-07-14 13:45:35,885 INFO mapred.JobClient (Counters.java:log(516)) - Map-Reduce Framework
2010-07-14 13:45:35,885 INFO mapred.JobClient (Counters.java:log(518)) - Combine output records=0
2010-07-14 13:45:35,886 INFO mapred.JobClient (Counters.java:log(518)) - Map input records=703319
2010-07-14 13:45:35,886 INFO mapred.JobClient (Counters.java:log(518)) - Spilled Records=524287
2010-07-14 13:45:35,887 INFO mapred.JobClient (Counters.java:log(518)) - Map output bytes=42791349
2010-07-14 13:45:35,888 INFO mapred.JobClient (Counters.java:log(518)) - Map input bytes=0
2010-07-14 13:45:35,888 INFO mapred.JobClient (Counters.java:log(518)) - Map output records=703319
2010-07-14 13:45:35,889 INFO mapred.JobClient (Counters.java:log(518)) - Combine input records=0
------------- ---------------- ---------------
------------- Standard Error -----------------
Creating large segment 1...
- done: 1677722 records.
Creating large segment 2...
- done: 1677722 records.
------------- ---------------- ---------------
Testcase: testLargeMerge took 227.804 sec
Caused an ERROR
Job failed!
java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)
at org.apache.nutch.segment.SegmentMerger.merge(SegmentMerger.java:639)
at org.apache.nutch.segment.TestSegmentMerger.testLargeMerge(TestSegmentMerger.java:87)
{noformat}
Any ideas? I'd be happy to commit this, provided we can get it to past regression....
Cheers,
Chris
> Segment merge filering based on segment content
> -----------------------------------------------
>
> Key: NUTCH-677
> URL: https://issues.apache.org/jira/browse/NUTCH-677
> Project: Nutch
> Issue Type: Improvement
> Affects Versions: 0.9.0
> Reporter: Marcin Okraszewski
> Assignee: Chris A. Mattmann
> Attachments: MergeFilter.patch, MergeFilter_for_1.0.patch, SegmentMergeFilter.java, SegmentMergeFilter.java, SegmentMergeFilters.java, SegmentMergeFilters.java
>
>
> I needed a segment filtering based on meta data detected during parse phase. Unfortunately current URL based filtering does not allow for this. So I have created a new SegmentMergeFilter extension which receives segment entry which is being merged and decides if it should be included or not. Even though I needed only ParseData for my purpose I have done it a bit more general purpose, so the filter receives all merged data.
> The attached patch is for version 0.9 which I use. Unfortunately I didn't have time to check how it fits to trunk version. Sorry :(
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.