You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Sudip Datta <pi...@gmail.com> on 2012/03/16 20:46:30 UTC

Running CrawlDbReader: _SUCCESS/data does not exist

Hi,

I am using Nutch 1.4 and trying to run CrawlDbReader (with -url argument)
to find the status of specific urls. Run in isolation the code crashes with
the following stack trace:

*Exception in thread "main" java.io.FileNotFoundException: File
file:/mnt/babble1/data/babble_index/store_crawls/crawldb/current/_SUCCESS/data
does not exist.
at
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:361)
at
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245)
at org.apache.hadoop.fs.FileSystem.getLength(FileSystem.java:676)
at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1417)
at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1412)
at
org.apache.hadoop.io.MapFile$Reader.createDataFileReader(MapFile.java:302)
at org.apache.hadoop.io.MapFile$Reader.open(MapFile.java:284)
at org.apache.hadoop.io.MapFile$Reader.<init>(MapFile.java:273)
at org.apache.hadoop.io.MapFile$Reader.<init>(MapFile.java:260)
at org.apache.hadoop.io.MapFile$Reader.<init>(MapFile.java:253)
at
org.apache.hadoop.mapred.MapFileOutputFormat.getReaders(MapFileOutputFormat.java:93)
at org.apache.nutch.crawl.CrawlDbReader.openReaders(CrawlDbReader.java:81)
at org.apache.nutch.crawl.CrawlDbReader.get(CrawlDbReader.java:382)
at org.apache.nutch.crawl.CrawlDbReader.readUrl(CrawlDbReader.java:389)
at org.apache.nutch.crawl.CrawlDbReader.main(CrawlDbReader.java:514)
*
There exists a _SUCCESS file at the crawldb/current/. I am trying to run
CrawlDbReader to eventually remove some urls from the Solr index, according
to some criteria available in crawldb.
Any clues on what might be going wrong will be helpful.

Thanks,

--Sudip.

Re: Running CrawlDbReader: _SUCCESS/data does not exist

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Mmmm there are some issues logged on Jira for similar (maybe identical)
stuff.

If you look for keywords SUCCESS and nutch in the Jira search box, it'll
pull them up for you.

Lewis

On Fri, Mar 16, 2012 at 8:29 PM, Adriana Farina
<ad...@gmail.com>wrote:

> I have the same doubt, since _SUCCESS is a file automatically generated by
> nutch. I suppose it's something nutch generates when it successfully
> creates each segment, but I think I'm wrong, it sounds too naive.
>
>
>
> Inviato da iPhone
>
> Il giorno 16/mar/2012, alle ore 21:08, Sudip Datta <pi...@gmail.com> ha
> scritto:
>
> > Yes, it does seem to work, though I wonder if its a bug or that's how it
> is
> > supposed to behave.
> >
> > Thanks.
> >
> > On Sat, Mar 17, 2012 at 1:30 AM, Adriana Farina
> > <ad...@gmail.com>wrote:
> >
> >> Hi.
> >>
> >> I had the same problem using nutch 1.3. I solved it removing _SUCCESS
> when
> >> I run the CrawlDbReader.
> >>
> >>
> >>
> >> Inviato da iPhone
> >>
> >> Il giorno 16/mar/2012, alle ore 20:46, Sudip Datta <pi...@gmail.com>
> ha
> >> scritto:
> >>
> >>> Hi,
> >>>
> >>> I am using Nutch 1.4 and trying to run CrawlDbReader (with -url
> argument)
> >>> to find the status of specific urls. Run in isolation the code crashes
> >> with
> >>> the following stack trace:
> >>>
> >>> *Exception in thread "main" java.io.FileNotFoundException: File
> >>>
> >>
> file:/mnt/babble1/data/babble_index/store_crawls/crawldb/current/_SUCCESS/data
> >>> does not exist.
> >>> at
> >>>
> >>
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:361)
> >>> at
> >>>
> >>
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245)
> >>> at org.apache.hadoop.fs.FileSystem.getLength(FileSystem.java:676)
> >>> at
> >> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1417)
> >>> at
> >> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1412)
> >>> at
> >>>
> >>
> org.apache.hadoop.io.MapFile$Reader.createDataFileReader(MapFile.java:302)
> >>> at org.apache.hadoop.io.MapFile$Reader.open(MapFile.java:284)
> >>> at org.apache.hadoop.io.MapFile$Reader.<init>(MapFile.java:273)
> >>> at org.apache.hadoop.io.MapFile$Reader.<init>(MapFile.java:260)
> >>> at org.apache.hadoop.io.MapFile$Reader.<init>(MapFile.java:253)
> >>> at
> >>>
> >>
> org.apache.hadoop.mapred.MapFileOutputFormat.getReaders(MapFileOutputFormat.java:93)
> >>> at
> >> org.apache.nutch.crawl.CrawlDbReader.openReaders(CrawlDbReader.java:81)
> >>> at org.apache.nutch.crawl.CrawlDbReader.get(CrawlDbReader.java:382)
> >>> at org.apache.nutch.crawl.CrawlDbReader.readUrl(CrawlDbReader.java:389)
> >>> at org.apache.nutch.crawl.CrawlDbReader.main(CrawlDbReader.java:514)
> >>> *
> >>> There exists a _SUCCESS file at the crawldb/current/. I am trying to
> run
> >>> CrawlDbReader to eventually remove some urls from the Solr index,
> >> according
> >>> to some criteria available in crawldb.
> >>> Any clues on what might be going wrong will be helpful.
> >>>
> >>> Thanks,
> >>>
> >>> --Sudip.
> >>
>



-- 
*Lewis*

Re: Running CrawlDbReader: _SUCCESS/data does not exist

Posted by Adriana Farina <ad...@gmail.com>.

I have the same doubt, since _SUCCESS is a file automatically generated by nutch. I suppose it's something nutch generates when it successfully creates each segment, but I think I'm wrong, it sounds too naive.



Inviato da iPhone

Il giorno 16/mar/2012, alle ore 21:08, Sudip Datta <pi...@gmail.com> ha scritto:

> Yes, it does seem to work, though I wonder if its a bug or that's how it is
> supposed to behave.
> 
> Thanks.
> 
> On Sat, Mar 17, 2012 at 1:30 AM, Adriana Farina
> <ad...@gmail.com>wrote:
> 
>> Hi.
>> 
>> I had the same problem using nutch 1.3. I solved it removing _SUCCESS when
>> I run the CrawlDbReader.
>> 
>> 
>> 
>> Inviato da iPhone
>> 
>> Il giorno 16/mar/2012, alle ore 20:46, Sudip Datta <pi...@gmail.com> ha
>> scritto:
>> 
>>> Hi,
>>> 
>>> I am using Nutch 1.4 and trying to run CrawlDbReader (with -url argument)
>>> to find the status of specific urls. Run in isolation the code crashes
>> with
>>> the following stack trace:
>>> 
>>> *Exception in thread "main" java.io.FileNotFoundException: File
>>> 
>> file:/mnt/babble1/data/babble_index/store_crawls/crawldb/current/_SUCCESS/data
>>> does not exist.
>>> at
>>> 
>> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:361)
>>> at
>>> 
>> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245)
>>> at org.apache.hadoop.fs.FileSystem.getLength(FileSystem.java:676)
>>> at
>> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1417)
>>> at
>> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1412)
>>> at
>>> 
>> org.apache.hadoop.io.MapFile$Reader.createDataFileReader(MapFile.java:302)
>>> at org.apache.hadoop.io.MapFile$Reader.open(MapFile.java:284)
>>> at org.apache.hadoop.io.MapFile$Reader.<init>(MapFile.java:273)
>>> at org.apache.hadoop.io.MapFile$Reader.<init>(MapFile.java:260)
>>> at org.apache.hadoop.io.MapFile$Reader.<init>(MapFile.java:253)
>>> at
>>> 
>> org.apache.hadoop.mapred.MapFileOutputFormat.getReaders(MapFileOutputFormat.java:93)
>>> at
>> org.apache.nutch.crawl.CrawlDbReader.openReaders(CrawlDbReader.java:81)
>>> at org.apache.nutch.crawl.CrawlDbReader.get(CrawlDbReader.java:382)
>>> at org.apache.nutch.crawl.CrawlDbReader.readUrl(CrawlDbReader.java:389)
>>> at org.apache.nutch.crawl.CrawlDbReader.main(CrawlDbReader.java:514)
>>> *
>>> There exists a _SUCCESS file at the crawldb/current/. I am trying to run
>>> CrawlDbReader to eventually remove some urls from the Solr index,
>> according
>>> to some criteria available in crawldb.
>>> Any clues on what might be going wrong will be helpful.
>>> 
>>> Thanks,
>>> 
>>> --Sudip.
>>

Re: Running CrawlDbReader: _SUCCESS/data does not exist

Posted by Sudip Datta <pi...@gmail.com>.

Yes, it does seem to work, though I wonder if its a bug or that's how it is
supposed to behave.

Thanks.

On Sat, Mar 17, 2012 at 1:30 AM, Adriana Farina
<ad...@gmail.com>wrote:

> Hi.
>
> I had the same problem using nutch 1.3. I solved it removing _SUCCESS when
> I run the CrawlDbReader.
>
>
>
> Inviato da iPhone
>
> Il giorno 16/mar/2012, alle ore 20:46, Sudip Datta <pi...@gmail.com> ha
> scritto:
>
> > Hi,
> >
> > I am using Nutch 1.4 and trying to run CrawlDbReader (with -url argument)
> > to find the status of specific urls. Run in isolation the code crashes
> with
> > the following stack trace:
> >
> > *Exception in thread "main" java.io.FileNotFoundException: File
> >
> file:/mnt/babble1/data/babble_index/store_crawls/crawldb/current/_SUCCESS/data
> > does not exist.
> > at
> >
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:361)
> > at
> >
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245)
> > at org.apache.hadoop.fs.FileSystem.getLength(FileSystem.java:676)
> > at
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1417)
> > at
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1412)
> > at
> >
> org.apache.hadoop.io.MapFile$Reader.createDataFileReader(MapFile.java:302)
> > at org.apache.hadoop.io.MapFile$Reader.open(MapFile.java:284)
> > at org.apache.hadoop.io.MapFile$Reader.<init>(MapFile.java:273)
> > at org.apache.hadoop.io.MapFile$Reader.<init>(MapFile.java:260)
> > at org.apache.hadoop.io.MapFile$Reader.<init>(MapFile.java:253)
> > at
> >
> org.apache.hadoop.mapred.MapFileOutputFormat.getReaders(MapFileOutputFormat.java:93)
> > at
> org.apache.nutch.crawl.CrawlDbReader.openReaders(CrawlDbReader.java:81)
> > at org.apache.nutch.crawl.CrawlDbReader.get(CrawlDbReader.java:382)
> > at org.apache.nutch.crawl.CrawlDbReader.readUrl(CrawlDbReader.java:389)
> > at org.apache.nutch.crawl.CrawlDbReader.main(CrawlDbReader.java:514)
> > *
> > There exists a _SUCCESS file at the crawldb/current/. I am trying to run
> > CrawlDbReader to eventually remove some urls from the Solr index,
> according
> > to some criteria available in crawldb.
> > Any clues on what might be going wrong will be helpful.
> >
> > Thanks,
> >
> > --Sudip.
>

Re: Running CrawlDbReader: _SUCCESS/data does not exist

Posted by Adriana Farina <ad...@gmail.com>.

Hi.

I had the same problem using nutch 1.3. I solved it removing _SUCCESS when I run the CrawlDbReader.



Inviato da iPhone

Il giorno 16/mar/2012, alle ore 20:46, Sudip Datta <pi...@gmail.com> ha scritto:

> Hi,
> 
> I am using Nutch 1.4 and trying to run CrawlDbReader (with -url argument)
> to find the status of specific urls. Run in isolation the code crashes with
> the following stack trace:
> 
> *Exception in thread "main" java.io.FileNotFoundException: File
> file:/mnt/babble1/data/babble_index/store_crawls/crawldb/current/_SUCCESS/data
> does not exist.
> at
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:361)
> at
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245)
> at org.apache.hadoop.fs.FileSystem.getLength(FileSystem.java:676)
> at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1417)
> at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1412)
> at
> org.apache.hadoop.io.MapFile$Reader.createDataFileReader(MapFile.java:302)
> at org.apache.hadoop.io.MapFile$Reader.open(MapFile.java:284)
> at org.apache.hadoop.io.MapFile$Reader.<init>(MapFile.java:273)
> at org.apache.hadoop.io.MapFile$Reader.<init>(MapFile.java:260)
> at org.apache.hadoop.io.MapFile$Reader.<init>(MapFile.java:253)
> at
> org.apache.hadoop.mapred.MapFileOutputFormat.getReaders(MapFileOutputFormat.java:93)
> at org.apache.nutch.crawl.CrawlDbReader.openReaders(CrawlDbReader.java:81)
> at org.apache.nutch.crawl.CrawlDbReader.get(CrawlDbReader.java:382)
> at org.apache.nutch.crawl.CrawlDbReader.readUrl(CrawlDbReader.java:389)
> at org.apache.nutch.crawl.CrawlDbReader.main(CrawlDbReader.java:514)
> *
> There exists a _SUCCESS file at the crawldb/current/. I am trying to run
> CrawlDbReader to eventually remove some urls from the Solr index, according
> to some criteria available in crawldb.
> Any clues on what might be going wrong will be helpful.
> 
> Thanks,
> 
> --Sudip.