You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Phạm Hải Thanh <ph...@vasc.com.vn> on 2007/06/21 11:49:57 UTC
Problem with merge-output
Hi all,
After recrawl several times, I have problem with the directory: merge-output. I have digged into mail archive and found some clue: you should use a new dir name for the new merge, e.g., merge-output_new, then mv merge-output_new to merge-output.
Anyone can show me exactly how to do this ?
Thanks a lot
============================================================================
After refetching database during index merging I get following error.
2007-04-27 15:58:37,787 FATAL indexer.IndexMerger - IndexMerger:
java.io.IOException: Target /usr/local/nutch/nutchdb/index/merge-output already
exists
at org.apache.hadoop.fs.FileUtil.checkDest(FileUtil.java:230)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:70)
at
org.apache.hadoop.fs.LocalFileSystem.copyFromLocalFile(LocalFileSystem.java:49)
at
org.apache.hadoop.fs.FileSystem.moveFromLocalFile(FileSystem.java:750)
at
org.apache.hadoop.fs.ChecksumFileSystem.completeLocalOutput(ChecksumFileSystem.java:622)
at org.apache.nutch.indexer.IndexMerger.merge(IndexMerger.java:104)
at org.apache.nutch.indexer.IndexMerger.run(IndexMerger.java:150)
at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
at org.apache.nutch.indexer.IndexMerger.main(IndexMerger.java:113)
RE: Problem with merge-output
Posted by Phạm Hải Thanh <ph...@vasc.com.vn>.
Hi Pal, hi all,
The merge-output appears while merging indexes, not merge segments. I attach the recrawl script, for clear.
In several first time of recrawl, it works without problem, but after that, the merge-output dir appears and it says
...merge-output already exists...
Anyone know exactly to resolve this ?
Thanks a lot,
-----Original Message-----
From: Susam Pal [mailto:susam.pal@gmail.com]
Sent: 21 tháng sáu 2007 5:00 Chiều
To: nutch-user@lucene.apache.org
Subject: Re: Problem with merge-output
This is something I usually do:-
$NUTCH_HOME/bin/nutch mergesegs crawl/MERGEDsegments crawl/segments/*
rm -rf crawl/segments/*
mv crawl/MERGEDsegments/* crawl/segments
You might want to replace the second statement with a 'mv' statement
to backup the segments.
Regards,
Susam Pal
http://susam.in/
On 6/21/07, Phạm Hải Thanh <ph...@vasc.com.vn> wrote:
> Hi all,
>
> After recrawl several times, I have problem with the directory: merge-output. I have digged into mail archive and found some clue: you should use a new dir name for the new merge, e.g., merge-output_new, then mv merge-output_new to merge-output.
>
>
>
> Anyone can show me exactly how to do this ?
>
> Thanks a lot
>
>
>
> ============================================================================
>
> After refetching database during index merging I get following error.
>
>
>
> 2007-04-27 15:58:37,787 FATAL indexer.IndexMerger - IndexMerger:
>
> java.io.IOException: Target /usr/local/nutch/nutchdb/index/merge-output already
>
> exists
>
> at org.apache.hadoop.fs.FileUtil.checkDest(FileUtil.java:230)
>
> at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:70)
>
> at
>
> org.apache.hadoop.fs.LocalFileSystem.copyFromLocalFile(LocalFileSystem.java:49)
>
> at
>
> org.apache.hadoop.fs.FileSystem.moveFromLocalFile(FileSystem.java:750)
>
> at
>
> org.apache.hadoop.fs.ChecksumFileSystem.completeLocalOutput(ChecksumFileSystem.java:622)
>
> at org.apache.nutch.indexer.IndexMerger.merge(IndexMerger.java:104)
>
> at org.apache.nutch.indexer.IndexMerger.run(IndexMerger.java:150)
>
> at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
>
> at org.apache.nutch.indexer.IndexMerger.main(IndexMerger.java:113)
>
>
Re: Problem with merge-output
Posted by Susam Pal <su...@gmail.com>.
This is something I usually do:-
$NUTCH_HOME/bin/nutch mergesegs crawl/MERGEDsegments crawl/segments/*
rm -rf crawl/segments/*
mv crawl/MERGEDsegments/* crawl/segments
You might want to replace the second statement with a 'mv' statement
to backup the segments.
Regards,
Susam Pal
http://susam.in/
On 6/21/07, Phạm Hải Thanh <ph...@vasc.com.vn> wrote:
> Hi all,
>
> After recrawl several times, I have problem with the directory: merge-output. I have digged into mail archive and found some clue: you should use a new dir name for the new merge, e.g., merge-output_new, then mv merge-output_new to merge-output.
>
>
>
> Anyone can show me exactly how to do this ?
>
> Thanks a lot
>
>
>
> ============================================================================
>
> After refetching database during index merging I get following error.
>
>
>
> 2007-04-27 15:58:37,787 FATAL indexer.IndexMerger - IndexMerger:
>
> java.io.IOException: Target /usr/local/nutch/nutchdb/index/merge-output already
>
> exists
>
> at org.apache.hadoop.fs.FileUtil.checkDest(FileUtil.java:230)
>
> at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:70)
>
> at
>
> org.apache.hadoop.fs.LocalFileSystem.copyFromLocalFile(LocalFileSystem.java:49)
>
> at
>
> org.apache.hadoop.fs.FileSystem.moveFromLocalFile(FileSystem.java:750)
>
> at
>
> org.apache.hadoop.fs.ChecksumFileSystem.completeLocalOutput(ChecksumFileSystem.java:622)
>
> at org.apache.nutch.indexer.IndexMerger.merge(IndexMerger.java:104)
>
> at org.apache.nutch.indexer.IndexMerger.run(IndexMerger.java:150)
>
> at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
>
> at org.apache.nutch.indexer.IndexMerger.main(IndexMerger.java:113)
>
>
Re: Problem with merge-output
Posted by chris sleeman <ch...@gmail.com>.
My apologies..The dedup error was "mea culpa".... I was using an older
version of the lucene jar..
However, I still am getting the -
"IndexMerger: java.io.IOException: Target crawl-test/index/merge-output
already exists" exception.
Once a crawl is completed successfully, can I simply delete the merge-output
dir for my next crawl, or is merge-output used elsewhere?
Regards,
Chris
On 7/9/07, chris sleeman <ch...@gmail.com> wrote:
>
> Hi,
> I am also facing this same problem. Have you figured out a solution to
> this yet?
>
> Also i keep getting the following error every time i recrawl -
>
> DeleteDuplicates: java.io.IOException: Job failed!
> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
> at org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java
> :491)
> at org.apache.nutch.indexer.DeleteDuplicates.run(DeleteDuplicates.java:515)
> at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
> at org.apache.nutch.indexer.DeleteDuplicates.main(DeleteDuplicates.java
> :499)
>
> Can anyone please help me out with these problems?
>
> Thanks,
> -Chris
>
>
>
>
> On 6/21/07, Phạm Hải Thanh <ph...@vasc.com.vn> wrote:
> >
> > Hi all,
> >
> > After recrawl several times, I have problem with the directory:
> > merge-output. I have digged into mail archive and found some clue: you
> > should use a new dir name for the new merge, e.g., merge-output_new,
> > then mv merge-output_new to merge-output.
> >
> >
> >
> > Anyone can show me exactly how to do this ?
> >
> > Thanks a lot
> >
> >
> >
> >
> > ============================================================================
> >
> > After refetching database during index merging I get following error.
> >
> >
> >
> > 2007-04-27 15:58:37,787 FATAL indexer.IndexMerger - IndexMerger:
> >
> > java.io.IOException: Target /usr/local/nutch/nutchdb/index/merge-output
> > already
> >
> > exists
> >
> > at org.apache.hadoop.fs.FileUtil.checkDest (FileUtil.java:230)
> >
> > at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:70)
> >
> > at
> >
> > org.apache.hadoop.fs.LocalFileSystem.copyFromLocalFile(
> > LocalFileSystem.java:49)
> >
> > at
> >
> > org.apache.hadoop.fs.FileSystem.moveFromLocalFile(FileSystem.java:750)
> >
> > at
> >
> > org.apache.hadoop.fs.ChecksumFileSystem.completeLocalOutput(
> > ChecksumFileSystem.java:622)
> >
> > at org.apache.nutch.indexer.IndexMerger.merge (IndexMerger.java
> > :104)
> >
> > at org.apache.nutch.indexer.IndexMerger.run(IndexMerger.java
> > :150)
> >
> > at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
> >
> > at org.apache.nutch.indexer.IndexMerger.main (IndexMerger.java
> > :113)
> >
> >
>
Re: Problem with merge-output
Posted by chris sleeman <ch...@gmail.com>.
Hi,
I am also facing this same problem. Have you figured out a solution to this
yet?
Also i keep getting the following error every time i recrawl -
DeleteDuplicates: java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
at org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java
:491)
at org.apache.nutch.indexer.DeleteDuplicates.run(DeleteDuplicates.java:515)
at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
at org.apache.nutch.indexer.DeleteDuplicates.main(DeleteDuplicates.java:499)
Can anyone please help me out with these problems?
Thanks,
-Chris
On 6/21/07, Phạm Hải Thanh <ph...@vasc.com.vn> wrote:
>
> Hi all,
>
> After recrawl several times, I have problem with the directory:
> merge-output. I have digged into mail archive and found some clue: you
> should use a new dir name for the new merge, e.g., merge-output_new, then
> mv merge-output_new to merge-output.
>
>
>
> Anyone can show me exactly how to do this ?
>
> Thanks a lot
>
>
>
>
> ============================================================================
>
> After refetching database during index merging I get following error.
>
>
>
> 2007-04-27 15:58:37,787 FATAL indexer.IndexMerger - IndexMerger:
>
> java.io.IOException: Target /usr/local/nutch/nutchdb/index/merge-output
> already
>
> exists
>
> at org.apache.hadoop.fs.FileUtil.checkDest(FileUtil.java:230)
>
> at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:70)
>
> at
>
> org.apache.hadoop.fs.LocalFileSystem.copyFromLocalFile(
> LocalFileSystem.java:49)
>
> at
>
> org.apache.hadoop.fs.FileSystem.moveFromLocalFile(FileSystem.java:750)
>
> at
>
> org.apache.hadoop.fs.ChecksumFileSystem.completeLocalOutput(
> ChecksumFileSystem.java:622)
>
> at org.apache.nutch.indexer.IndexMerger.merge(IndexMerger.java
> :104)
>
> at org.apache.nutch.indexer.IndexMerger.run(IndexMerger.java:150)
>
> at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
>
> at org.apache.nutch.indexer.IndexMerger.main(IndexMerger.java:113)
>
>