You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Alex Basa <al...@yahoo.com> on 2010/01/12 19:01:09 UTC
mergecrawls.sh
Does anyone know of any bug fixes to mergecrawls.sh? I have two working indexes that I try to merge and it seems to work but when it's done, the index is corrupt.
before the merge, both indexes have
crawldb index linkdb newindexes segments
after the merge, the newindexes directory is gone
crawldb index linkdb segments
I didn't log the output so I'll re-run it again and look at the output.
Thanks,
Alex
Re: mergecrawls.sh
Posted by Alex Basa <al...@yahoo.com>.
It seems like when the indexer gets a 'Job failed' it doesn't back up one directory so in the next phase where it does the dedup, it won't find the newindexes directory since it's looking for it under index. Anyone know of a fix to Indexer for this? I'm running Nutch 0.9
As always, thanks in advance
Indexing [http://www.plataformaarquitectura.cl/2009/06/28/summer-show-2009-barl
ett-school-of-architecture-ucl/100_7191/] with analyzer org.apache.nutch.analysi
s.NutchDocumentAnalyzer@799e11a1 (null)
Indexer: java.io.IOException: Job failed!
at org.apache.hadoop.mapred..JobClient.runJob(JobClient.java:604)
at org.apache.nutch.indexer..Indexer.index(Indexer.java:307)
at org.apache.nutch.indexer.Indexer.run(Indexer.java:329)
at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
at org.apache.nutch.indexer.Indexer.main(Indexer.java:312)
log4j:ERROR Failed to flush writer,
java.io.InterruptedIOException
at java.io.FileOutputStream.writeBytes(Native Method)
at java.io.FileOutputStream.write(FileOutputStream.java:260)
at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:202)
at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:272)
at sun.nio.cs.StreamEncoder.implFlush(StreamEncoder.java:276)
at sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:122)
at java.io.OutputStreamWriter.flush(OutputStreamWriter.java:212)
at org.apache.log4j.helpers.QuietWriter.flush(QuietWriter.java:57)
at org.apache.log4j.WriterAppender.subAppend(WriterAppender.java:315)
at org.apache.log4j.DailyRollingFileAppender.subAppend(DailyRollingFileA
ppender.java:358)
at org.apache.log4j.WriterAppender.append(WriterAppender.java:159)
at org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:230)
at org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders
(AppenderAttachableImpl.java:65)
at org.apache.log4j.Category.callAppenders(Category.java:203)
at org.apache.log4j.Category.forcedLog(Category.java:388)
at org.apache.log4j.Category.log(Category.java:853)
at org..apache.commons.logging.impl.Log4JLogger.warn(Log4JLogger.java:169
)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:1
66)
De-duplicate indexes
Dedup: starting
Dedup: adding indexes in: /database/Nutch/index/newindexes
org.apache.hadoop.mapred.InvalidInputException: Input path doesnt exist : /database/Nutch/index/newindexes
at org.apache.hadoop.mapred.InputFormatBase.validateInput(InputFormatBas
e.java:138)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:326)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:543)
at org.apache.nutch.indexer.DeleteDuplicates..dedup(DeleteDuplicates.java
:603)
at org.apache.nutch.indexer.DeleteDuplicates.run(DeleteDuplicates.java:6
74)
at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
at org.apache.nutch.indexer.DeleteDuplicates.main(DeleteDuplicates.java:
658)
DeleteDuplicates: java.io.IOException: org.apache.hadoop.mapred.InvalidInputException: Input path doesnt exist : /database/Nutch/index.uchi/newindexes
at org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java
:653)
at org.apache.nutch.indexer.DeleteDuplicates.run(DeleteDuplicates.java:6
74)
at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
at org.apache.nutch.indexer.DeleteDuplicates.main(DeleteDuplicates.java:
658)
Merge indexes
--- On Tue, 1/12/10, Alex Basa <al...@yahoo.com> wrote:
> From: Alex Basa <al...@yahoo.com>
> Subject: mergecrawls.sh
> To: nutch-user@lucene.apache.org
> Date: Tuesday, January 12, 2010, 12:01 PM
> Does anyone know of any bug fixes to
> mergecrawls.sh? I have two working indexes that I try
> to merge and it seems to work but when it's done, the index
> is corrupt.
>
> before the merge, both indexes have
> crawldb index
> linkdb
> newindexes segments
>
> after the merge, the newindexes directory is gone
> crawldb index
> linkdb segments
>
> I didn't log the output so I'll re-run it again and look at
> the output.
>
> Thanks,
>
> Alex
>
>
>
>
>