You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by ML mail <ml...@yahoo.com> on 2011/11/03 11:59:06 UTC

How to deal with websites without title

Hi,

I am using Nutch 1.3 with Solr 3.4 (using nutch schema.xml) in order to crawl a few websites and create a search engine for these websites. 

I noticed that some web pages don't have the TITLE HTML element and as end effect in Solr there are no nice title to display in the search results...

Now I was wondering how you guys out there deal with this case? Do you just display something like "no title" in your search results or is there maybe a more elegant way to deal with this case?

Regards,
ML

Re: parse existing segments

Posted by Ferdy Galema <fe...@kalooga.com>.

What are you trying to achieve? The crawl command does not invocate any 
plugins of itself, it merely chains several Nutch jobs together. The 
Nutch jobs themselves - or more specifically the mappers and reducers - 
make use of the plugin repository.

On 11/03/2011 01:47 PM, Ashish M wrote:
> What method in crawl.java would trigger the invocation of plugins?
>
> Sent from my iPhone. Please ignore the typos.
>
> On Nov 3, 2011, at 5:30 AM, Markus Jelsma<ma...@openindex.io>  wrote:
>
>> remove *parse* in the segment and you're good to go.
>>
>> On Thursday 03 November 2011 13:16:40 Ashish Mehrotra wrote:
>>> Hi All,
>>>
>>> I am trying to parse already crawled segments using the method --
>>> ParseSegment.parse(seg);
>>>
>>>
>>> seg is the Path to the existing segment.
>>> This internally fires a new job and the error thrown is --
>>>
>>> Exception in thread "main" java.io.IOException: Segment already parsed!
>>> at
>>> org.apache.nutch.parse.ParseOutputFormat.checkOutputSpecs(ParseOutputForma
>>> t.java:80) at
>>> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:772)
>>> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) at
>>> org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
>>> at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:156)
>>>
>>> What I am trying to do here is parse the already fetched data to test my
>>> HTML Parse Filter. Looks like the above method of ParseSegment gets called
>>> in the normal workflow of crawl, fetch, parse ...
>>>
>>> What I have done is modified the org.apache.nutch.crawl.Crawl.run()  to
>>> call only ParseSegment and commented the injector, generator and fetcher
>>> parts. I am calling ParseSegment.parse(segment) in the run() method. I am
>>> passing the segment name in the command line.
>>>
>>> Should I be calling some other method to test my HTML parser filter plugin
>>> without crawling again?
>>>
>>> Any pointers should be helpful.
>>>
>>> Thanks,
>>> Ashish
>> -- 
>> Markus Jelsma - CTO - Openindex
>> http://www.linkedin.com/in/markus17
>> 050-8536620 / 06-50258350

Re: parse existing segments

Posted by Ashish M <as...@yahoo.com>.

What method in crawl.java would trigger the invocation of plugins? 

Sent from my iPhone. Please ignore the typos.

On Nov 3, 2011, at 5:30 AM, Markus Jelsma <ma...@openindex.io> wrote:

> remove *parse* in the segment and you're good to go.
> 
> On Thursday 03 November 2011 13:16:40 Ashish Mehrotra wrote:
>> Hi All,
>> 
>> I am trying to parse already crawled segments using the method --
>> ParseSegment.parse(seg);
>> 
>> 
>> seg is the Path to the existing segment.
>> This internally fires a new job and the error thrown is --
>> 
>> Exception in thread "main" java.io.IOException: Segment already parsed!
>> at
>> org.apache.nutch.parse.ParseOutputFormat.checkOutputSpecs(ParseOutputForma
>> t.java:80) at
>> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:772)
>> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) at
>> org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
>> at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:156)
>> 
>> What I am trying to do here is parse the already fetched data to test my
>> HTML Parse Filter. Looks like the above method of ParseSegment gets called
>> in the normal workflow of crawl, fetch, parse ...
>> 
>> What I have done is modified the org.apache.nutch.crawl.Crawl.run()  to
>> call only ParseSegment and commented the injector, generator and fetcher
>> parts. I am calling ParseSegment.parse(segment) in the run() method. I am
>> passing the segment name in the command line.
>> 
>> Should I be calling some other method to test my HTML parser filter plugin
>> without crawling again?
>> 
>> Any pointers should be helpful.
>> 
>> Thanks,
>> Ashish
> 
> -- 
> Markus Jelsma - CTO - Openindex
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350

Re: parse existing segments

Posted by Markus Jelsma <ma...@openindex.io>.

remove *parse* in the segment and you're good to go.

On Thursday 03 November 2011 13:16:40 Ashish Mehrotra wrote:
> Hi All,
> 
> I am trying to parse already crawled segments using the method --
> ParseSegment.parse(seg);
> 
> 
> seg is the Path to the existing segment.
> This internally fires a new job and the error thrown is --
> 
> Exception in thread "main" java.io.IOException: Segment already parsed!
> at
> org.apache.nutch.parse.ParseOutputFormat.checkOutputSpecs(ParseOutputForma
> t.java:80) at
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:772)
> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) at
> org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
> at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:156)
> 
> What I am trying to do here is parse the already fetched data to test my
> HTML Parse Filter. Looks like the above method of ParseSegment gets called
> in the normal workflow of crawl, fetch, parse ...
> 
> What I have done is modified the org.apache.nutch.crawl.Crawl.run()  to
> call only ParseSegment and commented the injector, generator and fetcher
> parts. I am calling ParseSegment.parse(segment) in the run() method. I am
> passing the segment name in the command line.
> 
> Should I be calling some other method to test my HTML parser filter plugin
> without crawling again?
> 
> Any pointers should be helpful.
> 
> Thanks,
> Ashish

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

parse existing segments

Posted by Ashish Mehrotra <as...@yahoo.com>.

Hi All,

I am trying to parse already crawled segments using the method --
ParseSegment.parse(seg);


seg is the Path to the existing segment.
This internally fires a new job and the error thrown is --

Exception in thread "main" java.io.IOException: Segment already parsed!
at org.apache.nutch.parse.ParseOutputFormat.checkOutputSpecs(ParseOutputFormat.java:80)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:772)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:156)

What I am trying to do here is parse the already fetched data to test my HTML Parse Filter.
Looks like the above method of ParseSegment gets called in the normal workflow of crawl, fetch, parse ...

What I have done is modified the org.apache.nutch.crawl.Crawl.run()  to call only ParseSegment and commented the injector, generator and fetcher parts. I am calling ParseSegment.parse(segment) in the run() method. I am passing the segment name in the command line.

Should I be calling some other method to test my HTML parser filter plugin without crawling again?

Any pointers should be helpful.

Thanks,
Ashish

Re: How to deal with websites without title

Posted by Markus Jelsma <ma...@openindex.io>.

If there is no title we try to find a proper name in the final part of the 
URL. If nothing, then no_title.

On Thursday 03 November 2011 11:59:06 ML mail wrote:
> Hi,
> 
> I am using Nutch 1.3 with Solr 3.4 (using nutch schema.xml) in order to
> crawl a few websites and create a search engine for these websites. 
> 
> I noticed that some web pages don't have the TITLE HTML element and as end
> effect in Solr there are no nice title to display in the search results...
> 
> Now I was wondering how you guys out there deal with this case? Do you just
> display something like "no title" in your search results or is there maybe
> a more elegant way to deal with this case?
> 
> Regards,
> ML

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350