You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "K.A.Hussain Ali" <Hu...@photoninfotech.com> on 2005/12/08 15:31:56 UTC

Crawling listing (pagination) pages.

HI all,

  Do Nutch crawl pages in any listing pages( pages with pagination as in search engines)

    While crawling through nutch i need to get the pages that gets displayed by the pagination unless i increase the depth of the whole crawling.
    Do nutch provide any plugin for the above issue ?
    Is there anyway to solve the above issue ?

Any help is greatly appreciated
Thanks in advance
regards
-Hussain

Re: Crawling listing (pagination) pages.

Posted by "K.A.Hussain Ali" <Hu...@photoninfotech.com>.
Hi jack,

.. the way  mentioned is one way to sort out the problem
but should we check for the URL against any regularexpression during 
crawling and is it possible ?
or while indexing. ?

Any helps is appreciated
Thanks in advance
regards
----- Original Message ----- 
From: "Jack Tang" <hi...@gmail.com>
To: <nu...@lucene.apache.org>; "K.A.Hussain Ali" 
<Hu...@photoninfotech.com>
Sent: Thursday, December 08, 2005 8:05 PM
Subject: Re: Crawling listing (pagination) pages.


Hi

I am facing the same problem. However my crawl only focuses on some
website and I recognize the paganition url ursing regexp and inject
them in every fetch cycle.

/Jack

On 12/8/05, K.A.Hussain Ali <Hu...@photoninfotech.com> wrote:
> HI all,
>
>   Do Nutch crawl pages in any listing pages( pages with pagination as in 
> search engines)
>
>     While crawling through nutch i need to get the pages that gets 
> displayed by the pagination unless i increase the depth of the whole 
> crawling.
>     Do nutch provide any plugin for the above issue ?
>     Is there anyway to solve the above issue ?
>
> Any help is greatly appreciated
> Thanks in advance
> regards
> -Hussain
>


--
Keep Discovering ... ...
http://www.jroller.com/page/jmars


Re: Crawling listing (pagination) pages.

Posted by Jack Tang <hi...@gmail.com>.
Hi

I am facing the same problem. However my crawl only focuses on some
website and I recognize the paganition url ursing regexp and inject
them in every fetch cycle.

/Jack

On 12/8/05, K.A.Hussain Ali <Hu...@photoninfotech.com> wrote:
> HI all,
>
>   Do Nutch crawl pages in any listing pages( pages with pagination as in search engines)
>
>     While crawling through nutch i need to get the pages that gets displayed by the pagination unless i increase the depth of the whole crawling.
>     Do nutch provide any plugin for the above issue ?
>     Is there anyway to solve the above issue ?
>
> Any help is greatly appreciated
> Thanks in advance
> regards
> -Hussain
>


--
Keep Discovering ... ...
http://www.jroller.com/page/jmars