You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Ashish Mehrotra <as...@yahoo.com> on 2011/11/03 13:16:40 UTC

parse existing segments

Hi All,

I am trying to parse already crawled segments using the method --
ParseSegment.parse(seg);


seg is the Path to the existing segment.
This internally fires a new job and the error thrown is --

Exception in thread "main" java.io.IOException: Segment already parsed!
at org.apache.nutch.parse.ParseOutputFormat.checkOutputSpecs(ParseOutputFormat.java:80)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:772)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:156)

What I am trying to do here is parse the already fetched data to test my HTML Parse Filter.
Looks like the above method of ParseSegment gets called in the normal workflow of crawl, fetch, parse ...

What I have done is modified the org.apache.nutch.crawl.Crawl.run()  to call only ParseSegment and commented the injector, generator and fetcher parts. I am calling ParseSegment.parse(segment) in the run() method. I am passing the segment name in the command line.

Should I be calling some other method to test my HTML parser filter plugin without crawling again?

Any pointers should be helpful.

Thanks,
Ashish

Re: parse existing segments

Posted by Ferdy Galema <fe...@kalooga.com>.
What are you trying to achieve? The crawl command does not invocate any 
plugins of itself, it merely chains several Nutch jobs together. The 
Nutch jobs themselves - or more specifically the mappers and reducers - 
make use of the plugin repository.

On 11/03/2011 01:47 PM, Ashish M wrote:
> What method in crawl.java would trigger the invocation of plugins?
>
> Sent from my iPhone. Please ignore the typos.
>
> On Nov 3, 2011, at 5:30 AM, Markus Jelsma<ma...@openindex.io>  wrote:
>
>> remove *parse* in the segment and you're good to go.
>>
>> On Thursday 03 November 2011 13:16:40 Ashish Mehrotra wrote:
>>> Hi All,
>>>
>>> I am trying to parse already crawled segments using the method --
>>> ParseSegment.parse(seg);
>>>
>>>
>>> seg is the Path to the existing segment.
>>> This internally fires a new job and the error thrown is --
>>>
>>> Exception in thread "main" java.io.IOException: Segment already parsed!
>>> at
>>> org.apache.nutch.parse.ParseOutputFormat.checkOutputSpecs(ParseOutputForma
>>> t.java:80) at
>>> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:772)
>>> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) at
>>> org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
>>> at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:156)
>>>
>>> What I am trying to do here is parse the already fetched data to test my
>>> HTML Parse Filter. Looks like the above method of ParseSegment gets called
>>> in the normal workflow of crawl, fetch, parse ...
>>>
>>> What I have done is modified the org.apache.nutch.crawl.Crawl.run()  to
>>> call only ParseSegment and commented the injector, generator and fetcher
>>> parts. I am calling ParseSegment.parse(segment) in the run() method. I am
>>> passing the segment name in the command line.
>>>
>>> Should I be calling some other method to test my HTML parser filter plugin
>>> without crawling again?
>>>
>>> Any pointers should be helpful.
>>>
>>> Thanks,
>>> Ashish
>> -- 
>> Markus Jelsma - CTO - Openindex
>> http://www.linkedin.com/in/markus17
>> 050-8536620 / 06-50258350

Re: parse existing segments

Posted by Ashish M <as...@yahoo.com>.
What method in crawl.java would trigger the invocation of plugins? 

Sent from my iPhone. Please ignore the typos.

On Nov 3, 2011, at 5:30 AM, Markus Jelsma <ma...@openindex.io> wrote:

> remove *parse* in the segment and you're good to go.
> 
> On Thursday 03 November 2011 13:16:40 Ashish Mehrotra wrote:
>> Hi All,
>> 
>> I am trying to parse already crawled segments using the method --
>> ParseSegment.parse(seg);
>> 
>> 
>> seg is the Path to the existing segment.
>> This internally fires a new job and the error thrown is --
>> 
>> Exception in thread "main" java.io.IOException: Segment already parsed!
>> at
>> org.apache.nutch.parse.ParseOutputFormat.checkOutputSpecs(ParseOutputForma
>> t.java:80) at
>> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:772)
>> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) at
>> org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
>> at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:156)
>> 
>> What I am trying to do here is parse the already fetched data to test my
>> HTML Parse Filter. Looks like the above method of ParseSegment gets called
>> in the normal workflow of crawl, fetch, parse ...
>> 
>> What I have done is modified the org.apache.nutch.crawl.Crawl.run()  to
>> call only ParseSegment and commented the injector, generator and fetcher
>> parts. I am calling ParseSegment.parse(segment) in the run() method. I am
>> passing the segment name in the command line.
>> 
>> Should I be calling some other method to test my HTML parser filter plugin
>> without crawling again?
>> 
>> Any pointers should be helpful.
>> 
>> Thanks,
>> Ashish
> 
> -- 
> Markus Jelsma - CTO - Openindex
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350

Re: parse existing segments

Posted by Markus Jelsma <ma...@openindex.io>.
remove *parse* in the segment and you're good to go.

On Thursday 03 November 2011 13:16:40 Ashish Mehrotra wrote:
> Hi All,
> 
> I am trying to parse already crawled segments using the method --
> ParseSegment.parse(seg);
> 
> 
> seg is the Path to the existing segment.
> This internally fires a new job and the error thrown is --
> 
> Exception in thread "main" java.io.IOException: Segment already parsed!
> at
> org.apache.nutch.parse.ParseOutputFormat.checkOutputSpecs(ParseOutputForma
> t.java:80) at
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:772)
> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) at
> org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
> at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:156)
> 
> What I am trying to do here is parse the already fetched data to test my
> HTML Parse Filter. Looks like the above method of ParseSegment gets called
> in the normal workflow of crawl, fetch, parse ...
> 
> What I have done is modified the org.apache.nutch.crawl.Crawl.run()  to
> call only ParseSegment and commented the injector, generator and fetcher
> parts. I am calling ParseSegment.parse(segment) in the run() method. I am
> passing the segment name in the command line.
> 
> Should I be calling some other method to test my HTML parser filter plugin
> without crawling again?
> 
> Any pointers should be helpful.
> 
> Thanks,
> Ashish

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350