You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Phạm Hải Thanh <ph...@vasc.com.vn> on 2007/06/21 11:49:57 UTC

Problem with merge-output

Hi all,

After recrawl several times, I have problem with the directory: merge-output. I have digged into mail archive and found some clue: you should use a new dir name for the new merge, e.g., merge-output_new, then mv merge-output_new to merge-output.

 

Anyone can show me exactly how to do this ?

Thanks a lot

 

============================================================================

After refetching database during index merging I get following error.

 

2007-04-27 15:58:37,787 FATAL indexer.IndexMerger - IndexMerger: 

java.io.IOException: Target /usr/local/nutch/nutchdb/index/merge-output already 

exists

        at org.apache.hadoop.fs.FileUtil.checkDest(FileUtil.java:230)

        at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:70)

        at 

org.apache.hadoop.fs.LocalFileSystem.copyFromLocalFile(LocalFileSystem.java:49)

        at 

org.apache.hadoop.fs.FileSystem.moveFromLocalFile(FileSystem.java:750)

        at 

org.apache.hadoop.fs.ChecksumFileSystem.completeLocalOutput(ChecksumFileSystem.java:622)

        at org.apache.nutch.indexer.IndexMerger.merge(IndexMerger.java:104)

        at org.apache.nutch.indexer.IndexMerger.run(IndexMerger.java:150)

        at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)

        at org.apache.nutch.indexer.IndexMerger.main(IndexMerger.java:113)


RE: Problem with merge-output

Posted by Phạm Hải Thanh <ph...@vasc.com.vn>.
Hi Pal, hi all,
The merge-output appears while merging indexes, not merge segments. I attach the recrawl script, for clear.
In several first time of recrawl, it works without problem, but after that, the merge-output dir appears  and it says

	...merge-output already exists...

Anyone know exactly to resolve this ?
Thanks a lot,

-----Original Message-----
From: Susam Pal [mailto:susam.pal@gmail.com] 
Sent: 21 tháng sáu 2007 5:00 Chiều
To: nutch-user@lucene.apache.org
Subject: Re: Problem with merge-output

This is something I usually do:-

$NUTCH_HOME/bin/nutch mergesegs crawl/MERGEDsegments crawl/segments/*
rm -rf crawl/segments/*
mv crawl/MERGEDsegments/* crawl/segments

You might want to replace the second statement with a 'mv' statement
to backup the segments.

Regards,
Susam Pal
http://susam.in/

On 6/21/07, Phạm Hải Thanh <ph...@vasc.com.vn> wrote:
> Hi all,
>
> After recrawl several times, I have problem with the directory: merge-output. I have digged into mail archive and found some clue: you should use a new dir name for the new merge, e.g., merge-output_new, then mv merge-output_new to merge-output.
>
>
>
> Anyone can show me exactly how to do this ?
>
> Thanks a lot
>
>
>
> ============================================================================
>
> After refetching database during index merging I get following error.
>
>
>
> 2007-04-27 15:58:37,787 FATAL indexer.IndexMerger - IndexMerger:
>
> java.io.IOException: Target /usr/local/nutch/nutchdb/index/merge-output already
>
> exists
>
>         at org.apache.hadoop.fs.FileUtil.checkDest(FileUtil.java:230)
>
>         at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:70)
>
>         at
>
> org.apache.hadoop.fs.LocalFileSystem.copyFromLocalFile(LocalFileSystem.java:49)
>
>         at
>
> org.apache.hadoop.fs.FileSystem.moveFromLocalFile(FileSystem.java:750)
>
>         at
>
> org.apache.hadoop.fs.ChecksumFileSystem.completeLocalOutput(ChecksumFileSystem.java:622)
>
>         at org.apache.nutch.indexer.IndexMerger.merge(IndexMerger.java:104)
>
>         at org.apache.nutch.indexer.IndexMerger.run(IndexMerger.java:150)
>
>         at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
>
>         at org.apache.nutch.indexer.IndexMerger.main(IndexMerger.java:113)
>
>

Re: Problem with merge-output

Posted by Susam Pal <su...@gmail.com>.
This is something I usually do:-

$NUTCH_HOME/bin/nutch mergesegs crawl/MERGEDsegments crawl/segments/*
rm -rf crawl/segments/*
mv crawl/MERGEDsegments/* crawl/segments

You might want to replace the second statement with a 'mv' statement
to backup the segments.

Regards,
Susam Pal
http://susam.in/

On 6/21/07, Phạm Hải Thanh <ph...@vasc.com.vn> wrote:
> Hi all,
>
> After recrawl several times, I have problem with the directory: merge-output. I have digged into mail archive and found some clue: you should use a new dir name for the new merge, e.g., merge-output_new, then mv merge-output_new to merge-output.
>
>
>
> Anyone can show me exactly how to do this ?
>
> Thanks a lot
>
>
>
> ============================================================================
>
> After refetching database during index merging I get following error.
>
>
>
> 2007-04-27 15:58:37,787 FATAL indexer.IndexMerger - IndexMerger:
>
> java.io.IOException: Target /usr/local/nutch/nutchdb/index/merge-output already
>
> exists
>
>         at org.apache.hadoop.fs.FileUtil.checkDest(FileUtil.java:230)
>
>         at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:70)
>
>         at
>
> org.apache.hadoop.fs.LocalFileSystem.copyFromLocalFile(LocalFileSystem.java:49)
>
>         at
>
> org.apache.hadoop.fs.FileSystem.moveFromLocalFile(FileSystem.java:750)
>
>         at
>
> org.apache.hadoop.fs.ChecksumFileSystem.completeLocalOutput(ChecksumFileSystem.java:622)
>
>         at org.apache.nutch.indexer.IndexMerger.merge(IndexMerger.java:104)
>
>         at org.apache.nutch.indexer.IndexMerger.run(IndexMerger.java:150)
>
>         at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
>
>         at org.apache.nutch.indexer.IndexMerger.main(IndexMerger.java:113)
>
>

Re: Problem with merge-output

Posted by chris sleeman <ch...@gmail.com>.
My apologies..The dedup error was "mea culpa".... I was using an older
version of the lucene jar..

However, I still am getting the -
"IndexMerger: java.io.IOException: Target crawl-test/index/merge-output
already exists" exception.

Once a crawl is completed successfully, can I simply delete the merge-output
dir for my next crawl, or is merge-output used elsewhere?

Regards,
Chris


On 7/9/07, chris sleeman <ch...@gmail.com> wrote:
>
> Hi,
> I am also facing this same problem. Have you figured out a solution to
> this yet?
>
> Also i keep getting the following error every time i recrawl -
>
> DeleteDuplicates: java.io.IOException: Job failed!
> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
> at org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java
> :491)
> at org.apache.nutch.indexer.DeleteDuplicates.run(DeleteDuplicates.java:515)
> at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
> at org.apache.nutch.indexer.DeleteDuplicates.main(DeleteDuplicates.java
> :499)
>
> Can anyone please help me out with these problems?
>
> Thanks,
> -Chris
>
>
>
>
> On 6/21/07, Phạm Hải Thanh <ph...@vasc.com.vn> wrote:
> >
> > Hi all,
> >
> > After recrawl several times, I have problem with the directory:
> > merge-output. I have digged into mail archive and found some clue: you
> > should use a new dir name for the new merge, e.g., merge-output_new,
> > then mv merge-output_new to merge-output.
> >
> >
> >
> > Anyone can show me exactly how to do this ?
> >
> > Thanks a lot
> >
> >
> >
> >
> > ============================================================================
> >
> > After refetching database during index merging I get following error.
> >
> >
> >
> > 2007-04-27 15:58:37,787 FATAL indexer.IndexMerger - IndexMerger:
> >
> > java.io.IOException: Target /usr/local/nutch/nutchdb/index/merge-output
> > already
> >
> > exists
> >
> >         at org.apache.hadoop.fs.FileUtil.checkDest (FileUtil.java:230)
> >
> >         at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:70)
> >
> >         at
> >
> > org.apache.hadoop.fs.LocalFileSystem.copyFromLocalFile(
> > LocalFileSystem.java:49)
> >
> >         at
> >
> > org.apache.hadoop.fs.FileSystem.moveFromLocalFile(FileSystem.java:750)
> >
> >         at
> >
> > org.apache.hadoop.fs.ChecksumFileSystem.completeLocalOutput(
> > ChecksumFileSystem.java:622)
> >
> >         at org.apache.nutch.indexer.IndexMerger.merge (IndexMerger.java
> > :104)
> >
> >         at org.apache.nutch.indexer.IndexMerger.run(IndexMerger.java
> > :150)
> >
> >         at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
> >
> >         at org.apache.nutch.indexer.IndexMerger.main (IndexMerger.java
> > :113)
> >
> >
>

Re: Problem with merge-output

Posted by chris sleeman <ch...@gmail.com>.
Hi,
I am also facing this same problem. Have you figured out a solution to this
yet?

Also i keep getting the following error every time i recrawl -

DeleteDuplicates: java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
at org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java
:491)
at org.apache.nutch.indexer.DeleteDuplicates.run(DeleteDuplicates.java:515)
at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
at org.apache.nutch.indexer.DeleteDuplicates.main(DeleteDuplicates.java:499)

Can anyone please help me out with these problems?

Thanks,
-Chris




On 6/21/07, Phạm Hải Thanh <ph...@vasc.com.vn> wrote:
>
> Hi all,
>
> After recrawl several times, I have problem with the directory:
> merge-output. I have digged into mail archive and found some clue: you
> should use a new dir name for the new merge, e.g., merge-output_new, then
> mv merge-output_new to merge-output.
>
>
>
> Anyone can show me exactly how to do this ?
>
> Thanks a lot
>
>
>
>
> ============================================================================
>
> After refetching database during index merging I get following error.
>
>
>
> 2007-04-27 15:58:37,787 FATAL indexer.IndexMerger - IndexMerger:
>
> java.io.IOException: Target /usr/local/nutch/nutchdb/index/merge-output
> already
>
> exists
>
>         at org.apache.hadoop.fs.FileUtil.checkDest(FileUtil.java:230)
>
>         at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:70)
>
>         at
>
> org.apache.hadoop.fs.LocalFileSystem.copyFromLocalFile(
> LocalFileSystem.java:49)
>
>         at
>
> org.apache.hadoop.fs.FileSystem.moveFromLocalFile(FileSystem.java:750)
>
>         at
>
> org.apache.hadoop.fs.ChecksumFileSystem.completeLocalOutput(
> ChecksumFileSystem.java:622)
>
>         at org.apache.nutch.indexer.IndexMerger.merge(IndexMerger.java
> :104)
>
>         at org.apache.nutch.indexer.IndexMerger.run(IndexMerger.java:150)
>
>         at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
>
>         at org.apache.nutch.indexer.IndexMerger.main(IndexMerger.java:113)
>
>