You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "Wilkerson, Cory" <cw...@cars.com> on 2005/08/09 17:51:18 UTC

Cookies, etc.

Good morning everyone,

I've been spending a bit of time with Nutch lately - it's looking like a
really solid product - but I've a couple of questions that I need to
resolve before I can really say whether or not Nutch will work for my
particular situation.

<Note - the rest of this email presume some knowledge of server-side
knowledge.>

I've pointed Nutch at a relatively small J2EE-based intranet and am
currently performing and intranet crawl, per the Nutch tutorial.  As per
the J2EE spec, the presentation tier utilizes the jsessionid token to
maintain client state.  Right now, I'm seeing my pages perform
accordingly to non-cookied clients (Nutch) and serialize the jsessionid
onto the generated link (foo.jsp;jsessionid=XXXXXXXXX), and while this
works, the urls that Nutch stores in the index contain the jsessionid
token (yes, it works, but it's a bit confusing and unnecessary).  

What I'd like to see is Nutch obey the standard cookie model for any
cookies returned by the requested domain.  I realize this probably
doesn't scale well for the web crawls but it does make a lot of sense
for small intranet crawls.

Am I missing some command-line argument that will tell Nutch to play
well with cookies?

Thanks for any assistance you can offer,
Cory Wilkerson

Re: Cookies, etc.

Posted by Andrzej Bialecki <ab...@getopt.org>.
Wilkerson, Cory wrote:
> Good morning everyone,
> 
> I've been spending a bit of time with Nutch lately - it's looking like a
> really solid product - but I've a couple of questions that I need to
> resolve before I can really say whether or not Nutch will work for my
> particular situation.
> 
> <Note - the rest of this email presume some knowledge of server-side
> knowledge.>
> 
> I've pointed Nutch at a relatively small J2EE-based intranet and am
> currently performing and intranet crawl, per the Nutch tutorial.  As per
> the J2EE spec, the presentation tier utilizes the jsessionid token to
> maintain client state.  Right now, I'm seeing my pages perform
> accordingly to non-cookied clients (Nutch) and serialize the jsessionid
                  ^^^^^^^^^^^^^^^^^^^
Recent development versions of Nutch use protocol-httpclient plugin to 
handle HTTP, and this plugin supports cookies. Whic version are you using?

> onto the generated link (foo.jsp;jsessionid=XXXXXXXXX), and while this
> works, the urls that Nutch stores in the index contain the jsessionid
> token (yes, it works, but it's a bit confusing and unnecessary).  

This can be removed through a regular expression in 
conf/regex-normalizer.xml

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com