You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Alex Basa <al...@yahoo.com> on 2010/01/12 19:01:09 UTC

mergecrawls.sh

Does anyone know of any bug fixes to mergecrawls.sh?  I have two working indexes that I try to merge and it seems to work but when it's done, the index is corrupt.

before the merge, both indexes have
crawldb     index       linkdb      newindexes  segments

after the merge, the newindexes directory is gone
crawldb   index     linkdb    segments

I didn't log the output so I'll re-run it again and look at the output.

Thanks,

Alex


      


Re: mergecrawls.sh

Posted by Alex Basa <al...@yahoo.com>.
It seems like when the indexer gets a 'Job failed' it doesn't back up one directory so in the next phase where it does the dedup, it won't find the newindexes directory since it's looking for it under index.  Anyone know of a fix to Indexer for this?  I'm running Nutch 0.9

As always, thanks in advance

 Indexing [http://www.plataformaarquitectura.cl/2009/06/28/summer-show-2009-barl
ett-school-of-architecture-ucl/100_7191/] with analyzer org.apache.nutch.analysi
s.NutchDocumentAnalyzer@799e11a1 (null)
Indexer: java.io.IOException: Job failed!
        at org.apache.hadoop.mapred..JobClient.runJob(JobClient.java:604)
        at org.apache.nutch.indexer..Indexer.index(Indexer.java:307)
        at org.apache.nutch.indexer.Indexer.run(Indexer.java:329)
        at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
        at org.apache.nutch.indexer.Indexer.main(Indexer.java:312)

log4j:ERROR Failed to flush writer,
java.io.InterruptedIOException
        at java.io.FileOutputStream.writeBytes(Native Method)
        at java.io.FileOutputStream.write(FileOutputStream.java:260)
        at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:202)
        at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:272)
        at sun.nio.cs.StreamEncoder.implFlush(StreamEncoder.java:276)
        at sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:122)
        at java.io.OutputStreamWriter.flush(OutputStreamWriter.java:212)
        at org.apache.log4j.helpers.QuietWriter.flush(QuietWriter.java:57)
        at org.apache.log4j.WriterAppender.subAppend(WriterAppender.java:315)
        at org.apache.log4j.DailyRollingFileAppender.subAppend(DailyRollingFileA
ppender.java:358)
        at org.apache.log4j.WriterAppender.append(WriterAppender.java:159)
        at org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:230)
        at org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders
(AppenderAttachableImpl.java:65)
        at org.apache.log4j.Category.callAppenders(Category.java:203)
        at org.apache.log4j.Category.forcedLog(Category.java:388)
        at org.apache.log4j.Category.log(Category.java:853)
        at org..apache.commons.logging.impl.Log4JLogger.warn(Log4JLogger.java:169
)
        at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:1
66)
De-duplicate indexes
Dedup: starting
Dedup: adding indexes in: /database/Nutch/index/newindexes
org.apache.hadoop.mapred.InvalidInputException: Input path doesnt exist : /database/Nutch/index/newindexes
        at org.apache.hadoop.mapred.InputFormatBase.validateInput(InputFormatBas
e.java:138)
        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:326)
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:543)
        at org.apache.nutch.indexer.DeleteDuplicates..dedup(DeleteDuplicates.java
:603)
        at org.apache.nutch.indexer.DeleteDuplicates.run(DeleteDuplicates.java:6
74)
        at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
        at org.apache.nutch.indexer.DeleteDuplicates.main(DeleteDuplicates.java:
658)
DeleteDuplicates: java.io.IOException: org.apache.hadoop.mapred.InvalidInputException: Input path doesnt exist : /database/Nutch/index.uchi/newindexes
        at org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java
:653)
        at org.apache.nutch.indexer.DeleteDuplicates.run(DeleteDuplicates.java:6
74)
        at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
        at org.apache.nutch.indexer.DeleteDuplicates.main(DeleteDuplicates.java:
658)

Merge indexes


--- On Tue, 1/12/10, Alex Basa <al...@yahoo.com> wrote:

> From: Alex Basa <al...@yahoo.com>
> Subject: mergecrawls.sh
> To: nutch-user@lucene.apache.org
> Date: Tuesday, January 12, 2010, 12:01 PM
> Does anyone know of any bug fixes to
> mergecrawls.sh?  I have two working indexes that I try
> to merge and it seems to work but when it's done, the index
> is corrupt.
> 
> before the merge, both indexes have
> crawldb     index   
>    linkdb     
> newindexes  segments
> 
> after the merge, the newindexes directory is gone
> crawldb   index 
>    linkdb    segments
> 
> I didn't log the output so I'll re-run it again and look at
> the output.
> 
> Thanks,
> 
> Alex
> 
> 
>       
> 
>