You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Ensheng Wang <nu...@yahoo.com.cn> on 2006/05/10 19:50:47 UTC

dedup error,help me!!!

I got a error when I run nutch dedup,help me pls!
  My segments about 20G,too big? or some data is bad?
  I don't know, help me pls.
  thanks a lot!
   
   
  [wangensh@pc110 crawl]$ nutch dedup segments/ tmp
060511 013951 parsing file:/home/wangensh/nutch-0.7.2/conf/nutch-default.xml
060511 013951 parsing file:/home/wangensh/nutch-0.7.2/conf/nutch-site.xml
060511 013951 No FS indicated, using default:local
060511 013951 Clearing old deletions in segments/20060507135947/index(segments/20060507135947/index)
060511 013951 Clearing old deletions in segments/20060501150328/index(segments/20060501150328/index)
060511 013951 Clearing old deletions in segments/20060501232333/index(segments/20060501232333/index)
060511 013951 Clearing old deletions in segments/20060502161811/index(segments/20060502161811/index)
060511 013951 Clearing old deletions in segments/20060427204139/index(segments/20060427204139/index)
060511 013951 Clearing old deletions in segments/20060502215251/index(segments/20060502215251/index)
060511 013951 Clearing old deletions in segments/20060428074316/index(segments/20060428074316/index)
060511 013951 Clearing old deletions in segments/20060428153029/index(segments/20060428153029/index)
060511 013951 Clearing old deletions in segments/20060428235858/index(segments/20060428235858/index)
060511 013951 Clearing old deletions in segments/20060429051429/index(segments/20060429051429/index)
060511 013951 Clearing old deletions in segments/20060503043601/index(segments/20060503043601/index)
060511 013951 Clearing old deletions in segments/20060429113057/index(segments/20060429113057/index)
060511 013951 Clearing old deletions in segments/20060429180029/index(segments/20060429180029/index)
060511 013951 Clearing old deletions in segments/20060430010104/index(segments/20060430010104/index)
060511 013951 Clearing old deletions in segments/20060430055919/index(segments/20060430055919/index)
060511 013951 Clearing old deletions in segments/20060430111242/index(segments/20060430111242/index)
060511 013951 Clearing old deletions in segments/20060430201343/index(segments/20060430201343/index)
060511 013951 Clearing old deletions in segments/20060501025132/index(segments/20060501025132/index)
060511 013951 Clearing old deletions in segments/20060503125346/index(segments/20060503125346/index)
060511 013951 Clearing old deletions in segments/20060503185355/index(segments/20060503185355/index)
060511 013951 Clearing old deletions in segments/20060504001824/index(segments/20060504001824/index)
060511 013951 Clearing old deletions in segments/20060504091608/index(segments/20060504091608/index)
060511 013951 Clearing old deletions in segments/20060504174715/index(segments/20060504174715/index)
060511 013951 Clearing old deletions in segments/20060505012951/index(segments/20060505012951/index)
060511 013951 Clearing old deletions in segments/20060505110206/index(segments/20060505110206/index)
060511 013951 Clearing old deletions in segments/20060505171002/index(segments/20060505171002/index)
060511 013951 Clearing old deletions in segments/20060506001003/index(segments/20060506001003/index)
060511 013951 Clearing old deletions in segments/20060507144825/index(segments/20060507144825/index)
060511 013951 Reading url hashes...
060511 014026 Sorting url hashes...
060511 014032 Deleting url duplicates...
060511 014033 Deleted 147805 url duplicates.
060511 014033 Reading content hashes...
Exception in thread "Main Thread" java.lang.RuntimeException: Not a hex character: g
        at org.apache.nutch.io.MD5Hash.charToNibble(MD5Hash.java:194)
        at org.apache.nutch.io.MD5Hash.setDigest(MD5Hash.java:180)
        at org.apache.nutch.indexer.DeleteDuplicates$1.updateHash(DeleteDuplicates.java:163)
        at org.apache.nutch.indexer.DeleteDuplicates.computeHashes(DeleteDuplicates.java:226)
        at org.apache.nutch.indexer.DeleteDuplicates.deleteContentDuplicates(DeleteDuplicates.java:160)
        at org.apache.nutch.indexer.DeleteDuplicates.main(DeleteDuplicates.java:350)


		
---------------------------------
抢注雅虎免费邮箱-3.5G容量,20M附件!