You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by afan0804 <wi...@hotmail.com> on 2008/09/06 01:14:08 UTC

Nutch searcher keeps reading CVS directories

Hi All,

My problem occurs when this code is called: 
Summary[] summaries = nbean.getSummary(details, query);
where nbean is a Nutchbean, query being a Query object, and details being
HitDetails[].

I get this message:
[9/5/08 16:37:07:203 MDT] 00000034 SystemErr     R 08/09/05 16:37:07 FATAL
searcher.FetchedSegments: java.io.FileNotFoundException: C:/[path to crawl
folder]/segments/20080828123423/parse_text/CVS/data

Since this code is being submitted onto CVS, each level contains an
auto-generate CVS directory.  My guess is that Nutch is reading those CVS
directories as part of the segment and searching for the "data" file, which
does not exist in the CVS directory.

I wish to ignore those CVS directory instead of removing them (since they
are needed for CVS).

It seems that the path to the segment sub-directory is processed in:
org.apache.nutch.searcher.FetchedSegments
    private MapFile.Reader[] getReaders(String subDir) throws IOException {
      return MapFileOutputFormat.getReaders(fs, new Path(segmentDir,
subDir), this.conf);
    }

I have tried passing in C:/[path to crawl
folder]/segments/20080828123423/parse_text/part-00000, but then the error
becomes 
[9/5/08 14:34:08:453 MDT] 0000002a SystemErr     R 08/09/05 14:34:08 FATAL
searcher.FetchedSegments: java.io.FileNotFoundException: C:/[path to crawl
folder]/segments/20080828123423/parse_text/part-00000/CVS/data

Any ideas?  Is it possible to get Hadoop to ignore directories named "CVS"? 
Or is there a way I can point directly to the data file?

Thank you very much,
Angela Fan
-- 
View this message in context: http://www.nabble.com/Nutch-searcher-keeps-reading-CVS-directories-tp19341093p19341093.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: Nutch searcher keeps reading CVS directories

Posted by afan0804 <wi...@hotmail.com>.
Alright I shall try that.  Thank you very much for your help!

Angela



Dennis Kubes-2 wrote:
> 
> It looks like your segments (data) is in CVS as well?  Is that what you 
> really want?  Maybe so I guess it depends on the project.  The error 
> though is a tricky one as you would have to change hadoop code, 
> specifically the MapFileOutputFormat.getReaders method to use 
> listStatus(ArrayList<FileStatus> results, Path f, PathFilter filter) 
> instead of the currently fs.listStatus(dir).  So it is doable but
> difficult.
> 
> Dennis
> 
> afan0804 wrote:
>> Hi All,
>> 
>> My problem occurs when this code is called: 
>> Summary[] summaries = nbean.getSummary(details, query);
>> where nbean is a Nutchbean, query being a Query object, and details being
>> HitDetails[].
>> 
>> I get this message:
>> [9/5/08 16:37:07:203 MDT] 00000034 SystemErr     R 08/09/05 16:37:07
>> FATAL
>> searcher.FetchedSegments: java.io.FileNotFoundException: C:/[path to
>> crawl
>> folder]/segments/20080828123423/parse_text/CVS/data
>> 
>> Since this code is being submitted onto CVS, each level contains an
>> auto-generate CVS directory.  My guess is that Nutch is reading those CVS
>> directories as part of the segment and searching for the "data" file,
>> which
>> does not exist in the CVS directory.
>> 
>> I wish to ignore those CVS directory instead of removing them (since they
>> are needed for CVS).
>> 
>> It seems that the path to the segment sub-directory is processed in:
>> org.apache.nutch.searcher.FetchedSegments
>>     private MapFile.Reader[] getReaders(String subDir) throws IOException
>> {
>>       return MapFileOutputFormat.getReaders(fs, new Path(segmentDir,
>> subDir), this.conf);
>>     }
>> 
>> I have tried passing in C:/[path to crawl
>> folder]/segments/20080828123423/parse_text/part-00000, but then the error
>> becomes 
>> [9/5/08 14:34:08:453 MDT] 0000002a SystemErr     R 08/09/05 14:34:08
>> FATAL
>> searcher.FetchedSegments: java.io.FileNotFoundException: C:/[path to
>> crawl
>> folder]/segments/20080828123423/parse_text/part-00000/CVS/data
>> 
>> Any ideas?  Is it possible to get Hadoop to ignore directories named
>> "CVS"? 
>> Or is there a way I can point directly to the data file?
>> 
>> Thank you very much,
>> Angela Fan
> 
> 

-- 
View this message in context: http://www.nabble.com/Nutch-searcher-keeps-reading-CVS-directories-tp19341093p19380680.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: Nutch searcher keeps reading CVS directories

Posted by Dennis Kubes <ku...@apache.org>.
It looks like your segments (data) is in CVS as well?  Is that what you 
really want?  Maybe so I guess it depends on the project.  The error 
though is a tricky one as you would have to change hadoop code, 
specifically the MapFileOutputFormat.getReaders method to use 
listStatus(ArrayList<FileStatus> results, Path f, PathFilter filter) 
instead of the currently fs.listStatus(dir).  So it is doable but difficult.

Dennis

afan0804 wrote:
> Hi All,
> 
> My problem occurs when this code is called: 
> Summary[] summaries = nbean.getSummary(details, query);
> where nbean is a Nutchbean, query being a Query object, and details being
> HitDetails[].
> 
> I get this message:
> [9/5/08 16:37:07:203 MDT] 00000034 SystemErr     R 08/09/05 16:37:07 FATAL
> searcher.FetchedSegments: java.io.FileNotFoundException: C:/[path to crawl
> folder]/segments/20080828123423/parse_text/CVS/data
> 
> Since this code is being submitted onto CVS, each level contains an
> auto-generate CVS directory.  My guess is that Nutch is reading those CVS
> directories as part of the segment and searching for the "data" file, which
> does not exist in the CVS directory.
> 
> I wish to ignore those CVS directory instead of removing them (since they
> are needed for CVS).
> 
> It seems that the path to the segment sub-directory is processed in:
> org.apache.nutch.searcher.FetchedSegments
>     private MapFile.Reader[] getReaders(String subDir) throws IOException {
>       return MapFileOutputFormat.getReaders(fs, new Path(segmentDir,
> subDir), this.conf);
>     }
> 
> I have tried passing in C:/[path to crawl
> folder]/segments/20080828123423/parse_text/part-00000, but then the error
> becomes 
> [9/5/08 14:34:08:453 MDT] 0000002a SystemErr     R 08/09/05 14:34:08 FATAL
> searcher.FetchedSegments: java.io.FileNotFoundException: C:/[path to crawl
> folder]/segments/20080828123423/parse_text/part-00000/CVS/data
> 
> Any ideas?  Is it possible to get Hadoop to ignore directories named "CVS"? 
> Or is there a way I can point directly to the data file?
> 
> Thank you very much,
> Angela Fan