You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Jamshaid Ashraf <ja...@gmail.com> on 2013/07/01 10:24:20 UTC

Re: Depth level 5 crawling issue

Hi,

I'm still facing same issue please help me out in this regard.

Regards,
Jamshaid


On Fri, Jun 28, 2013 at 4:32 PM, Jamshaid Ashraf <ja...@gmail.com>wrote:

>
> Hi,
>
> I have followed the given link and updated 'db.max.outlinks.per.page' to
> -1 in 'nutch-default' file.
>
> but facing same issue while crawling '
> http://www.halliburton.com/en-US/default.page & cnn.com', below is the
> last line of fetcher job which shows 0 page found on 3rd or 4th iteration.
>
> 0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0 URLs
> in 0 queues
> -activeThreads=0
> FetcherJob: done
>
> Please note that when I crawl amazon & others sites it works fine. Do you
> think is it because of some restriction of halliborton (robot.txt) or some
> misconfiguration at my end?
>
> Regards,
> Jamshaid
>
>
> On Fri, Jun 28, 2013 at 12:37 AM, Lewis John Mcgibbney <
> lewis.mcgibbney@gmail.com> wrote:
>
>> Hi,
>> Can you please try this
>> http://s.apache.org/wIC
>> Thanks
>> Lewis
>>
>>
>> On Thu, Jun 27, 2013 at 8:01 AM, Jamshaid Ashraf <jamshaid.qe@gmail.com
>> >wrote:
>>
>> > Hi,
>> >
>> > I'm using nutch 2.x with HBase and tried to crawl "
>> > http://www.halliburton.com/en-US/default.page" site for depth level 5.
>> >
>> > Following is the command:
>> >
>> > bin/crawl urls/seed.txt HB http://localhost:8080/solr/ 5
>> >
>> >
>> > It worked well till 3rd iteration but for remaining 4th and 5th nothing
>> > fetched (same case happened with cnn.com). but if i tried to crawl
>> other
>> > sites like amazon with depth level 5 it works.
>> >
>> > Could you please guide what could be the reasons for failing of 4th and
>> 5th
>> > iteration.
>> >
>> >
>> > Regards,
>> > Jamshaid
>> >
>>
>>
>>
>> --
>> *Lewis*
>>
>
>

Re: Depth level 5 crawling issue

Posted by feng lu <am...@gmail.com>.

Jamshaid

The regex-normalize.xml can also filter some urls into only one url. it
will also clean some query params.


On Mon, Jul 1, 2013 at 8:40 PM, Jamshaid Ashraf <ja...@gmail.com>wrote:

> Thanks tony!
>
> Issue with Halliburton site is resolved by changing 'regex-urlfilter' file.
> But still facing same issue with 'cnn.com'.
>
> Regards,
> Jamshaid
>
>
> On Mon, Jul 1, 2013 at 3:20 PM, Tony Mullins <tonymullins.tm@gmail.com
> >wrote:
>
> > Jamshaid ,
> >
> > I think your site urls contain query params and your regex-urlfilter.txt
> is
> > filtering them.
> > Go to your regex-urlfilter.txt and replace '-[?*!@=]' with '-[*!@]' , I
> > hope this would resolve your problem
> >
> > Tony.
> >
> >
> > On Mon, Jul 1, 2013 at 1:24 PM, Jamshaid Ashraf <jamshaid.qe@gmail.com
> > >wrote:
> >
> > > Hi,
> > >
> > > I'm still facing same issue please help me out in this regard.
> > >
> > > Regards,
> > > Jamshaid
> > >
> > >
> > > On Fri, Jun 28, 2013 at 4:32 PM, Jamshaid Ashraf <
> jamshaid.qe@gmail.com
> > > >wrote:
> > >
> > > >
> > > > Hi,
> > > >
> > > > I have followed the given link and updated 'db.max.outlinks.per.page'
> > to
> > > > -1 in 'nutch-default' file.
> > > >
> > > > but facing same issue while crawling '
> > > > http://www.halliburton.com/en-US/default.page & cnn.com', below is
> the
> > > > last line of fetcher job which shows 0 page found on 3rd or 4th
> > > iteration.
> > > >
> > > > 0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0
> > > URLs
> > > > in 0 queues
> > > > -activeThreads=0
> > > > FetcherJob: done
> > > >
> > > > Please note that when I crawl amazon & others sites it works fine. Do
> > you
> > > > think is it because of some restriction of halliborton (robot.txt) or
> > > some
> > > > misconfiguration at my end?
> > > >
> > > > Regards,
> > > > Jamshaid
> > > >
> > > >
> > > > On Fri, Jun 28, 2013 at 12:37 AM, Lewis John Mcgibbney <
> > > > lewis.mcgibbney@gmail.com> wrote:
> > > >
> > > >> Hi,
> > > >> Can you please try this
> > > >> http://s.apache.org/wIC
> > > >> Thanks
> > > >> Lewis
> > > >>
> > > >>
> > > >> On Thu, Jun 27, 2013 at 8:01 AM, Jamshaid Ashraf <
> > jamshaid.qe@gmail.com
> > > >> >wrote:
> > > >>
> > > >> > Hi,
> > > >> >
> > > >> > I'm using nutch 2.x with HBase and tried to crawl "
> > > >> > http://www.halliburton.com/en-US/default.page" site for depth
> level
> > > 5.
> > > >> >
> > > >> > Following is the command:
> > > >> >
> > > >> > bin/crawl urls/seed.txt HB http://localhost:8080/solr/ 5
> > > >> >
> > > >> >
> > > >> > It worked well till 3rd iteration but for remaining 4th and 5th
> > > nothing
> > > >> > fetched (same case happened with cnn.com). but if i tried to
> crawl
> > > >> other
> > > >> > sites like amazon with depth level 5 it works.
> > > >> >
> > > >> > Could you please guide what could be the reasons for failing of
> 4th
> > > and
> > > >> 5th
> > > >> > iteration.
> > > >> >
> > > >> >
> > > >> > Regards,
> > > >> > Jamshaid
> > > >> >
> > > >>
> > > >>
> > > >>
> > > >> --
> > > >> *Lewis*
> > > >>
> > > >
> > > >
> > >
> >
>



-- 
Don't Grow Old, Grow Up... :-)

Re: Depth level 5 crawling issue

Posted by Jamshaid Ashraf <ja...@gmail.com>.

Thanks tony!

Issue with Halliburton site is resolved by changing 'regex-urlfilter' file.
But still facing same issue with 'cnn.com'.

Regards,
Jamshaid


On Mon, Jul 1, 2013 at 3:20 PM, Tony Mullins <to...@gmail.com>wrote:

> Jamshaid ,
>
> I think your site urls contain query params and your regex-urlfilter.txt is
> filtering them.
> Go to your regex-urlfilter.txt and replace '-[?*!@=]' with '-[*!@]' , I
> hope this would resolve your problem
>
> Tony.
>
>
> On Mon, Jul 1, 2013 at 1:24 PM, Jamshaid Ashraf <jamshaid.qe@gmail.com
> >wrote:
>
> > Hi,
> >
> > I'm still facing same issue please help me out in this regard.
> >
> > Regards,
> > Jamshaid
> >
> >
> > On Fri, Jun 28, 2013 at 4:32 PM, Jamshaid Ashraf <jamshaid.qe@gmail.com
> > >wrote:
> >
> > >
> > > Hi,
> > >
> > > I have followed the given link and updated 'db.max.outlinks.per.page'
> to
> > > -1 in 'nutch-default' file.
> > >
> > > but facing same issue while crawling '
> > > http://www.halliburton.com/en-US/default.page & cnn.com', below is the
> > > last line of fetcher job which shows 0 page found on 3rd or 4th
> > iteration.
> > >
> > > 0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0
> > URLs
> > > in 0 queues
> > > -activeThreads=0
> > > FetcherJob: done
> > >
> > > Please note that when I crawl amazon & others sites it works fine. Do
> you
> > > think is it because of some restriction of halliborton (robot.txt) or
> > some
> > > misconfiguration at my end?
> > >
> > > Regards,
> > > Jamshaid
> > >
> > >
> > > On Fri, Jun 28, 2013 at 12:37 AM, Lewis John Mcgibbney <
> > > lewis.mcgibbney@gmail.com> wrote:
> > >
> > >> Hi,
> > >> Can you please try this
> > >> http://s.apache.org/wIC
> > >> Thanks
> > >> Lewis
> > >>
> > >>
> > >> On Thu, Jun 27, 2013 at 8:01 AM, Jamshaid Ashraf <
> jamshaid.qe@gmail.com
> > >> >wrote:
> > >>
> > >> > Hi,
> > >> >
> > >> > I'm using nutch 2.x with HBase and tried to crawl "
> > >> > http://www.halliburton.com/en-US/default.page" site for depth level
> > 5.
> > >> >
> > >> > Following is the command:
> > >> >
> > >> > bin/crawl urls/seed.txt HB http://localhost:8080/solr/ 5
> > >> >
> > >> >
> > >> > It worked well till 3rd iteration but for remaining 4th and 5th
> > nothing
> > >> > fetched (same case happened with cnn.com). but if i tried to crawl
> > >> other
> > >> > sites like amazon with depth level 5 it works.
> > >> >
> > >> > Could you please guide what could be the reasons for failing of 4th
> > and
> > >> 5th
> > >> > iteration.
> > >> >
> > >> >
> > >> > Regards,
> > >> > Jamshaid
> > >> >
> > >>
> > >>
> > >>
> > >> --
> > >> *Lewis*
> > >>
> > >
> > >
> >
>

Re: Depth level 5 crawling issue

Posted by Tony Mullins <to...@gmail.com>.

Jamshaid ,

I think your site urls contain query params and your regex-urlfilter.txt is
filtering them.
Go to your regex-urlfilter.txt and replace '-[?*!@=]' with '-[*!@]' , I
hope this would resolve your problem

Tony.


On Mon, Jul 1, 2013 at 1:24 PM, Jamshaid Ashraf <ja...@gmail.com>wrote:

> Hi,
>
> I'm still facing same issue please help me out in this regard.
>
> Regards,
> Jamshaid
>
>
> On Fri, Jun 28, 2013 at 4:32 PM, Jamshaid Ashraf <jamshaid.qe@gmail.com
> >wrote:
>
> >
> > Hi,
> >
> > I have followed the given link and updated 'db.max.outlinks.per.page' to
> > -1 in 'nutch-default' file.
> >
> > but facing same issue while crawling '
> > http://www.halliburton.com/en-US/default.page & cnn.com', below is the
> > last line of fetcher job which shows 0 page found on 3rd or 4th
> iteration.
> >
> > 0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0
> URLs
> > in 0 queues
> > -activeThreads=0
> > FetcherJob: done
> >
> > Please note that when I crawl amazon & others sites it works fine. Do you
> > think is it because of some restriction of halliborton (robot.txt) or
> some
> > misconfiguration at my end?
> >
> > Regards,
> > Jamshaid
> >
> >
> > On Fri, Jun 28, 2013 at 12:37 AM, Lewis John Mcgibbney <
> > lewis.mcgibbney@gmail.com> wrote:
> >
> >> Hi,
> >> Can you please try this
> >> http://s.apache.org/wIC
> >> Thanks
> >> Lewis
> >>
> >>
> >> On Thu, Jun 27, 2013 at 8:01 AM, Jamshaid Ashraf <jamshaid.qe@gmail.com
> >> >wrote:
> >>
> >> > Hi,
> >> >
> >> > I'm using nutch 2.x with HBase and tried to crawl "
> >> > http://www.halliburton.com/en-US/default.page" site for depth level
> 5.
> >> >
> >> > Following is the command:
> >> >
> >> > bin/crawl urls/seed.txt HB http://localhost:8080/solr/ 5
> >> >
> >> >
> >> > It worked well till 3rd iteration but for remaining 4th and 5th
> nothing
> >> > fetched (same case happened with cnn.com). but if i tried to crawl
> >> other
> >> > sites like amazon with depth level 5 it works.
> >> >
> >> > Could you please guide what could be the reasons for failing of 4th
> and
> >> 5th
> >> > iteration.
> >> >
> >> >
> >> > Regards,
> >> > Jamshaid
> >> >
> >>
> >>
> >>
> >> --
> >> *Lewis*
> >>
> >
> >
>