You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Mohamed Parvez <pa...@gmail.com> on 2009/09/08 23:02:54 UTC

How to crawl pagination in sequence

I have a paginated pages, which will only work if its crawled in a given
sequence, and in the same session.

For example first URL is

http://www.myhost.com/?page_number=1
http://www.myhost.com/?page_number=2
http://www.myhost.com/?page_number=3

The first page has link to second page.
Second page has link to first and second page.
Third page has link to third and second page.
So On...

Nutch is able to crawl the the first 6 pages, but beyond that it is not able
to crawl or is getting empty result.

If I manually click through the pagination, in a browser, I can reach till
the end with no problem.

Is the Nutch Crawl Session timing out? How do we increase it.

I tried crawling with on thread but still same result.

Any suggestion ?

---
Thanks/Regards,
Parvez

Re: How to crawl pagination in sequence

Posted by fa...@butterflycluster.net.
i dont know; look around the httpclient code.

but you probably want to make sure its a client session issue first.

i could be wrong.

> I tried running with one thread, still same results.
> Any hint on how do we make Nutch aware of session cookies
>
> ---
> Thanks/Regards,
> Parvez
>
>
> On Wed, Sep 9, 2009 at 12:51 AM, <fa...@butterflycluster.net> wrote:
>
>> how many threads are you running at?
>>
>> nutch doesnt know about sessions;
>>
>> you might have to do something like fetching one thread at a time but
>> thats slow.
>>
>> or maybe make nutch aware of session cookies.
>>
>> > I am crawling at depth 40 as there are 40 pages in the pagination.
>> >
>> > It works fine till the first 6 pages and after that it goes to the 7th
>> > page,
>> > but looks like its different session and hence the pagination wont
>> work.
>> >
>> > I mean if you you directly hit page 7, using the URL, the pagination
>> wont
>> > work and will return empty set.
>> >
>> > But if you go in the sequence in the same session the pagination
>> works.
>> >
>> >
>> > ---
>> > Thanks/Regards,
>> > Parvez
>> >
>> >
>> > On Wed, Sep 9, 2009 at 12:15 AM, <fa...@butterflycluster.net> wrote:
>> >
>> >> could be tricky from what i've seen;
>> >>
>> >> theres limits on how many times you can hit one host/ip;
>> >>
>> >> also what depth you are crawling at may come to play in your case
>> (which
>> >> is probably what you want to look at in this case).
>> >>
>> >>
>> >> > Any hint to increase the session time of the Nutch crawl thread.
>> >> > I tried crawling with one thread, still no luck.
>> >> >
>> >> > ----
>> >> > Thanks/Regards,
>> >> > Parvez
>> >> >
>> >> >
>> >> >
>> >> > On Tue, Sep 8, 2009 at 4:02 PM, Mohamed Parvez <pa...@gmail.com>
>> >> wrote:
>> >> >
>> >> >> I have a paginated pages, which will only work if its crawled in a
>> >> given
>> >> >> sequence, and in the same session.
>> >> >>
>> >> >> For example first URL is
>> >> >>
>> >> >> http://www.myhost.com/?page_number=1
>> >> >> http://www.myhost.com/?page_number=2
>> >> >> http://www.myhost.com/?page_number=3
>> >> >>
>> >> >> The first page has link to second page.
>> >> >> Second page has link to first and second page.
>> >> >> Third page has link to third and second page.
>> >> >> So On...
>> >> >>
>> >> >> Nutch is able to crawl the the first 6 pages, but beyond that it
>> is
>> >> not
>> >> >> able to crawl or is getting empty result.
>> >> >>
>> >> >> If I manually click through the pagination, in a browser, I can
>> reach
>> >> >> till
>> >> >> the end with no problem.
>> >> >>
>> >> >> Is the Nutch Crawl Session timing out? How do we increase it.
>> >> >>
>> >> >> I tried crawling with on thread but still same result.
>> >> >>
>> >> >> Any suggestion ?
>> >> >>
>> >> >> ---
>> >> >> Thanks/Regards,
>> >> >> Parvez
>> >> >>
>> >> >>
>> >> >
>> >>
>> >>
>> >>
>> >
>>
>>
>>
>



Re: How to crawl pagination in sequence

Posted by Mohamed Parvez <pa...@gmail.com>.
I tried running with one thread, still same results.
Any hint on how do we make Nutch aware of session cookies

---
Thanks/Regards,
Parvez


On Wed, Sep 9, 2009 at 12:51 AM, <fa...@butterflycluster.net> wrote:

> how many threads are you running at?
>
> nutch doesnt know about sessions;
>
> you might have to do something like fetching one thread at a time but
> thats slow.
>
> or maybe make nutch aware of session cookies.
>
> > I am crawling at depth 40 as there are 40 pages in the pagination.
> >
> > It works fine till the first 6 pages and after that it goes to the 7th
> > page,
> > but looks like its different session and hence the pagination wont work.
> >
> > I mean if you you directly hit page 7, using the URL, the pagination wont
> > work and will return empty set.
> >
> > But if you go in the sequence in the same session the pagination works.
> >
> >
> > ---
> > Thanks/Regards,
> > Parvez
> >
> >
> > On Wed, Sep 9, 2009 at 12:15 AM, <fa...@butterflycluster.net> wrote:
> >
> >> could be tricky from what i've seen;
> >>
> >> theres limits on how many times you can hit one host/ip;
> >>
> >> also what depth you are crawling at may come to play in your case (which
> >> is probably what you want to look at in this case).
> >>
> >>
> >> > Any hint to increase the session time of the Nutch crawl thread.
> >> > I tried crawling with one thread, still no luck.
> >> >
> >> > ----
> >> > Thanks/Regards,
> >> > Parvez
> >> >
> >> >
> >> >
> >> > On Tue, Sep 8, 2009 at 4:02 PM, Mohamed Parvez <pa...@gmail.com>
> >> wrote:
> >> >
> >> >> I have a paginated pages, which will only work if its crawled in a
> >> given
> >> >> sequence, and in the same session.
> >> >>
> >> >> For example first URL is
> >> >>
> >> >> http://www.myhost.com/?page_number=1
> >> >> http://www.myhost.com/?page_number=2
> >> >> http://www.myhost.com/?page_number=3
> >> >>
> >> >> The first page has link to second page.
> >> >> Second page has link to first and second page.
> >> >> Third page has link to third and second page.
> >> >> So On...
> >> >>
> >> >> Nutch is able to crawl the the first 6 pages, but beyond that it is
> >> not
> >> >> able to crawl or is getting empty result.
> >> >>
> >> >> If I manually click through the pagination, in a browser, I can reach
> >> >> till
> >> >> the end with no problem.
> >> >>
> >> >> Is the Nutch Crawl Session timing out? How do we increase it.
> >> >>
> >> >> I tried crawling with on thread but still same result.
> >> >>
> >> >> Any suggestion ?
> >> >>
> >> >> ---
> >> >> Thanks/Regards,
> >> >> Parvez
> >> >>
> >> >>
> >> >
> >>
> >>
> >>
> >
>
>
>

Re: How to crawl pagination in sequence

Posted by fa...@butterflycluster.net.
how many threads are you running at?

nutch doesnt know about sessions;

you might have to do something like fetching one thread at a time but
thats slow.

or maybe make nutch aware of session cookies.

> I am crawling at depth 40 as there are 40 pages in the pagination.
>
> It works fine till the first 6 pages and after that it goes to the 7th
> page,
> but looks like its different session and hence the pagination wont work.
>
> I mean if you you directly hit page 7, using the URL, the pagination wont
> work and will return empty set.
>
> But if you go in the sequence in the same session the pagination works.
>
>
> ---
> Thanks/Regards,
> Parvez
>
>
> On Wed, Sep 9, 2009 at 12:15 AM, <fa...@butterflycluster.net> wrote:
>
>> could be tricky from what i've seen;
>>
>> theres limits on how many times you can hit one host/ip;
>>
>> also what depth you are crawling at may come to play in your case (which
>> is probably what you want to look at in this case).
>>
>>
>> > Any hint to increase the session time of the Nutch crawl thread.
>> > I tried crawling with one thread, still no luck.
>> >
>> > ----
>> > Thanks/Regards,
>> > Parvez
>> >
>> >
>> >
>> > On Tue, Sep 8, 2009 at 4:02 PM, Mohamed Parvez <pa...@gmail.com>
>> wrote:
>> >
>> >> I have a paginated pages, which will only work if its crawled in a
>> given
>> >> sequence, and in the same session.
>> >>
>> >> For example first URL is
>> >>
>> >> http://www.myhost.com/?page_number=1
>> >> http://www.myhost.com/?page_number=2
>> >> http://www.myhost.com/?page_number=3
>> >>
>> >> The first page has link to second page.
>> >> Second page has link to first and second page.
>> >> Third page has link to third and second page.
>> >> So On...
>> >>
>> >> Nutch is able to crawl the the first 6 pages, but beyond that it is
>> not
>> >> able to crawl or is getting empty result.
>> >>
>> >> If I manually click through the pagination, in a browser, I can reach
>> >> till
>> >> the end with no problem.
>> >>
>> >> Is the Nutch Crawl Session timing out? How do we increase it.
>> >>
>> >> I tried crawling with on thread but still same result.
>> >>
>> >> Any suggestion ?
>> >>
>> >> ---
>> >> Thanks/Regards,
>> >> Parvez
>> >>
>> >>
>> >
>>
>>
>>
>



Re: How to crawl pagination in sequence

Posted by Mohamed Parvez <pa...@gmail.com>.
I am crawling at depth 40 as there are 40 pages in the pagination.

It works fine till the first 6 pages and after that it goes to the 7th page,
but looks like its different session and hence the pagination wont work.

I mean if you you directly hit page 7, using the URL, the pagination wont
work and will return empty set.

But if you go in the sequence in the same session the pagination works.


---
Thanks/Regards,
Parvez


On Wed, Sep 9, 2009 at 12:15 AM, <fa...@butterflycluster.net> wrote:

> could be tricky from what i've seen;
>
> theres limits on how many times you can hit one host/ip;
>
> also what depth you are crawling at may come to play in your case (which
> is probably what you want to look at in this case).
>
>
> > Any hint to increase the session time of the Nutch crawl thread.
> > I tried crawling with one thread, still no luck.
> >
> > ----
> > Thanks/Regards,
> > Parvez
> >
> >
> >
> > On Tue, Sep 8, 2009 at 4:02 PM, Mohamed Parvez <pa...@gmail.com> wrote:
> >
> >> I have a paginated pages, which will only work if its crawled in a given
> >> sequence, and in the same session.
> >>
> >> For example first URL is
> >>
> >> http://www.myhost.com/?page_number=1
> >> http://www.myhost.com/?page_number=2
> >> http://www.myhost.com/?page_number=3
> >>
> >> The first page has link to second page.
> >> Second page has link to first and second page.
> >> Third page has link to third and second page.
> >> So On...
> >>
> >> Nutch is able to crawl the the first 6 pages, but beyond that it is not
> >> able to crawl or is getting empty result.
> >>
> >> If I manually click through the pagination, in a browser, I can reach
> >> till
> >> the end with no problem.
> >>
> >> Is the Nutch Crawl Session timing out? How do we increase it.
> >>
> >> I tried crawling with on thread but still same result.
> >>
> >> Any suggestion ?
> >>
> >> ---
> >> Thanks/Regards,
> >> Parvez
> >>
> >>
> >
>
>
>

Re: How to crawl pagination in sequence

Posted by fa...@butterflycluster.net.
could be tricky from what i've seen;

theres limits on how many times you can hit one host/ip;

also what depth you are crawling at may come to play in your case (which
is probably what you want to look at in this case).


> Any hint to increase the session time of the Nutch crawl thread.
> I tried crawling with one thread, still no luck.
>
> ----
> Thanks/Regards,
> Parvez
>
>
>
> On Tue, Sep 8, 2009 at 4:02 PM, Mohamed Parvez <pa...@gmail.com> wrote:
>
>> I have a paginated pages, which will only work if its crawled in a given
>> sequence, and in the same session.
>>
>> For example first URL is
>>
>> http://www.myhost.com/?page_number=1
>> http://www.myhost.com/?page_number=2
>> http://www.myhost.com/?page_number=3
>>
>> The first page has link to second page.
>> Second page has link to first and second page.
>> Third page has link to third and second page.
>> So On...
>>
>> Nutch is able to crawl the the first 6 pages, but beyond that it is not
>> able to crawl or is getting empty result.
>>
>> If I manually click through the pagination, in a browser, I can reach
>> till
>> the end with no problem.
>>
>> Is the Nutch Crawl Session timing out? How do we increase it.
>>
>> I tried crawling with on thread but still same result.
>>
>> Any suggestion ?
>>
>> ---
>> Thanks/Regards,
>> Parvez
>>
>>
>



Re: How to crawl pagination in sequence

Posted by Mohamed Parvez <pa...@gmail.com>.
Any hint to increase the session time of the Nutch crawl thread.
I tried crawling with one thread, still no luck.

----
Thanks/Regards,
Parvez



On Tue, Sep 8, 2009 at 4:02 PM, Mohamed Parvez <pa...@gmail.com> wrote:

> I have a paginated pages, which will only work if its crawled in a given
> sequence, and in the same session.
>
> For example first URL is
>
> http://www.myhost.com/?page_number=1
> http://www.myhost.com/?page_number=2
> http://www.myhost.com/?page_number=3
>
> The first page has link to second page.
> Second page has link to first and second page.
> Third page has link to third and second page.
> So On...
>
> Nutch is able to crawl the the first 6 pages, but beyond that it is not
> able to crawl or is getting empty result.
>
> If I manually click through the pagination, in a browser, I can reach till
> the end with no problem.
>
> Is the Nutch Crawl Session timing out? How do we increase it.
>
> I tried crawling with on thread but still same result.
>
> Any suggestion ?
>
> ---
> Thanks/Regards,
> Parvez
>
>