You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by abhayd <aj...@hotmail.com> on 2012/01/26 23:45:26 UTC

maintain state between urls in same crawl session

hi 

I am crawling a site x.y.z which sets a cookie, now when nutch crawls
another page from the same site it is not passing this cookie to the server
causing many sessions .

I m using nutch 1.3.

Any setting we need to change in order to maintain state in same crawl?

thanks
abhay

--
View this message in context: http://lucene.472066.n3.nabble.com/maintain-state-between-urls-in-same-crawl-session-tp3691839p3691839.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: maintain state between urls in same crawl session

Posted by Markus Jelsma <ma...@openindex.io>.
> thanks, Indeed its a pain. Specially when crawlers we use other than nutch
> have that built into it.
> 
> Is this something planned in future releases?

No, but you're welcome to provide patches if possible.

I think it may be a bit easier to do if the work is done in the underlying 
protocol library. It is reused as well if im not mistaken, so a map of domains 
and obtained session cookie's could be stored and reused there.

> 
> 
> 
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/maintain-state-between-urls-in-same-cra
> wl-session-tp3691839p3693576.html Sent from the Nutch - User mailing list
> archive at Nabble.com.

Re: maintain state between urls in same crawl session

Posted by abhayd <aj...@hotmail.com>.
thanks, Indeed its a pain. Specially when crawlers we use other than nutch
have that built into it.

Is this something planned in future releases?

 

--
View this message in context: http://lucene.472066.n3.nabble.com/maintain-state-between-urls-in-same-crawl-session-tp3691839p3693576.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: maintain state between urls in same crawl session

Posted by Markus Jelsma <ma...@openindex.io>.
I don't think this can be done out of the box since there is no state at any 
time. But you should be able to hack the queue in the fetcher; that is the 
only point where URL's in a given crawl share a common object in which you can 
store a state.

If URL's are partitioned by host or domain they are guaranteed to end up in 
the same queue object. Making things work would certainly need some serious 
hacking around such as retrieving the cookie of the first session from the 
HTTP client and attempt to have it reused by following URL's from the same 
queue.

This would be quite a pain to make if you're unfamiliar with the fetcher i 
guess.

> hi
> 
> I am crawling a site x.y.z which sets a cookie, now when nutch crawls
> another page from the same site it is not passing this cookie to the server
> causing many sessions .
> 
> I m using nutch 1.3.
> 
> Any setting we need to change in order to maintain state in same crawl?
> 
> thanks
> abhay
> 
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/maintain-state-between-urls-in-same-cra
> wl-session-tp3691839p3691839.html Sent from the Nutch - User mailing list
> archive at Nabble.com.