You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Asitang Mishra <as...@usc.edu> on 2015/03/24 09:07:57 UTC

Issue related to fetcher.parse property

Hello,

I am facing this problem please help.

With fetcher.parse enabled (no Tika installed):

-tries to parse after each fetch
-gets warnings, as parsing of the content is unsuccessful
-fetching finishes
-parsing step starts
-EXCEPTION: Exception in thread "main" java.io.IOException: *Segment
already parsed!*
-crawler exits with value 1


With fetcher.parse disabled (no Tika installed):

-fetching finishes
-parsing step starts
-gets warnings, as parsing of the content is unsuccessful
-crawl completes with no exceptions


I was looking into why it's giving such an exception. It's thrown from:

(*class*) *ParseOutputFormat* -->
(*function*) checkOutputSpecs-->
(*the code* *lines*) if (fs.exists(new Path(out,
CrawlDatum.PARSE_DIR_NAME)))
      throw new IOException("Segment already parsed!");

I tried several things in the code to make this error go away: At first I
was manually doing at every place which uses the fact that
fetcher.parse=true to false. It still gave that error. Finally, when I did :


(*class*) *FetcherOutputFormat* -->
(*function*) getRecordWriter-->
(*the code* *lines*) if (Fetcher.isParsing(job)) { //ASITANG: this line
checks if the fetcher.parse is set to true or not
            parseOut = new ParseOutputFormat().getRecordWriter(fs, job,
name,
             progress);

        }

so in the above code if I comment the lines:
//parseOut = new ParseOutputFormat().getRecordWriter(fs, job, name,
             progress);

Everything goes smooth.


In the Fetcher.java I saw the followings line:

job.setOutputFormat(*FetcherOutputFormat*.class);


Question 1: Is the stopping of the crawler with an Exception in case of
fetcher.parse an expected behavior.
Question 2: What is the use of a Record Writer.
Question 3: Apparently, it is being set in the Fetcher.java, but how is it
being used.
Question 4: Is commenting the code not letting the the fetcher to write to
the segments the parsed data. Because the parser in this case runs in the
end.