You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by ShivaKarthik S <sh...@gmail.com> on 2018/03/17 10:46:47 UTC

Is there any way to block the hubpages while crawling

Hi,

Is there any way to block the hub pages & index only the articles from the
websites. I wanted to index only the articles & not hubpage. Hub pages will
be crawled & the outlines will be extracted, but while indexing, I needed
only the articles to be indexed.
E.g.
www.abc.com/xyz & www.abc.com/abc are hub pages and www.abc.com/xyz/1.html
& www.abc.com/ABC/1.html is an article.

In this case I can block all the urls not ending with .html or .aspx or
.JSP or any other extensions. But all the websites need not be following
same format. Some follow . html for hub pages as well as articles & some
follow no extension for both hub pages as well as articles. Considering
these cases, I can't generalize any rule saying that whichever is ending
without extension is hubpage & whichever is ending with extension is
article. Is there any way in nutch 1.x this can be handled?

Thanks & regards
Shiva


-- 
Thanks and Regards
Shiva

Re: Is there any way to block the hubpages while crawling

Posted by Michael Coffey <mc...@yahoo.com.INVALID>.

I think you will find that you need different rules for each website and that some amount of maintenance will be needed as the websites change their practices.

Re: Is there any way to block the hubpages while crawling

Posted by Sebastian Nagel <wa...@googlemail.com>.

Hi,

> more control over what is being indexed?

It's possible to enable URL filters for the indexer:
   bin/nutch index ... -filter
With little extra effort you can use different URL filter rules
during the index step, e.g. in local mode by pointing NUTCH_CONF_DIR
to a different folder.

>> I can't generalize any rule

What about to classify hubs by number of outlinks?
Then you could skip those pages using an indexing-filter, just return
null if a document shall be skipped.
For a larger crawl you'll definitely get lost with a URL filter.

Maybe you can also see this as a ranking problem: if hub pages are
only penalized you could apply simple but noisy heuristics.

Best,
Sebastian

On 03/18/2018 10:10 AM, BlackIce wrote:
> Basically what you're saying is that you need more control over what is
> being indexed?
> 
> That's an excellent question!
> 
> Greetz!
> 
> On Mar 17, 2018 11:46 AM, "ShivaKarthik S" <sh...@gmail.com>
> wrote:
> 
>> Hi,
>>
>> Is there any way to block the hub pages & index only the articles from the
>> websites. I wanted to index only the articles & not hubpage. Hub pages will
>> be crawled & the outlines will be extracted, but while indexing, I needed
>> only the articles to be indexed.
>> E.g.
>> www.abc.com/xyz & www.abc.com/abc are hub pages and www.abc.com/xyz/1.html
>> & www.abc.com/ABC/1.html is an article.
>>
>> In this case I can block all the urls not ending with .html or .aspx or
>> .JSP or any other extensions. But all the websites need not be following
>> same format. Some follow . html for hub pages as well as articles & some
>> follow no extension for both hub pages as well as articles. Considering
>> these cases, I can't generalize any rule saying that whichever is ending
>> without extension is hubpage & whichever is ending with extension is
>> article. Is there any way in nutch 1.x this can be handled?
>>
>> Thanks & regards
>> Shiva
>>
>>
>> --
>> Thanks and Regards
>> Shiva
>>
>

Re: Is there any way to block the hubpages while crawling

Posted by BlackIce <bl...@gmail.com>.

Basically what you're saying is that you need more control over what is
being indexed?

That's an excellent question!

Greetz!

On Mar 17, 2018 11:46 AM, "ShivaKarthik S" <sh...@gmail.com>
wrote:

> Hi,
>
> Is there any way to block the hub pages & index only the articles from the
> websites. I wanted to index only the articles & not hubpage. Hub pages will
> be crawled & the outlines will be extracted, but while indexing, I needed
> only the articles to be indexed.
> E.g.
> www.abc.com/xyz & www.abc.com/abc are hub pages and www.abc.com/xyz/1.html
> & www.abc.com/ABC/1.html is an article.
>
> In this case I can block all the urls not ending with .html or .aspx or
> .JSP or any other extensions. But all the websites need not be following
> same format. Some follow . html for hub pages as well as articles & some
> follow no extension for both hub pages as well as articles. Considering
> these cases, I can't generalize any rule saying that whichever is ending
> without extension is hubpage & whichever is ending with extension is
> article. Is there any way in nutch 1.x this can be handled?
>
> Thanks & regards
> Shiva
>
>
> --
> Thanks and Regards
> Shiva
>