You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Martin Louis <ma...@gmail.com> on 2012/09/10 15:58:56 UTC

Help needed on Large scale single domain crawling ( Multiple country / Multilanguage / user type ) CGI urls

Hi Guys

I am a JAVA engineer, trying to set up an environment with all the features
of GSA and more to address large website needs;

My website
> Works mostly on CGI commands to redirect to pages
 (like, ?cmd=_services-page )
> Multiple counties ( means; they are different for different counties as
products we sell to different countries are different ); reachable by sub
URL: *mydomain.com/<country-code>/ *.
> Supports multiple local languages for each country.
> Each country can have users having multiple types of accounts ( we
support 2-3 types of users in each country based on the service level like
"free user" / "premium user" ) and the content for them will vary.

*What will be the best approach to crawl this website for a good "site wide
search" experience both for logged -in and out users with relevant content.*

Below are the questions with me

1. If i keep my *seed to be "mydomain.com"*  and initiate a crawl on entire
site
  >Q. How can i capture "/<country-code>/"   as a field in NUTCH ) during
crawl ?
  >Q. How can i crawl language specific pages and index it
            -  Same CGI command ( like : ?cmd=_login-run ) is used for all
languages in a country
            -  Language flip done by setting a cookie in the website

3. My website support different types of accounts and the content can be
different for each type of account for same CGI ?cmd
     > Q. How to group based on account types used to crawl.

4. How can i do a post ( form authentication ), i know i can hack HTTP
connection, but above grouping of crawl based on authentication is blocking
me.



Thanks in advance, for any of your valuable suggestion to my problem

-- 
- Martin

RE: Help needed on Large scale single domain crawling ( Multiple country / Multilanguage / user type ) CGI urls

Posted by Markus Jelsma <ma...@openindex.io>.

Hi Martin,
 
-----Original message-----
> From:Martin Louis <ma...@gmail.com>
> Sent: Wed 12-Sep-2012 11:46
> To: Markus Jelsma <ma...@openindex.io>
> Cc: user@nutch.apache.org
> Subject: Re: Help needed on Large scale single domain crawling ( Multiple country / Multilanguage / user type ) CGI urls
> 
> Thanks Markus for you answers, I will try them and post back, but one question remains in my mind ; 
> 
> I can hack http conection for POST authentication, but I have multiple login credentials ( user types ) for the website, what will be the approach to re-run nutch crawling based on different login credentials, as i also i want to seach based on user types; so the info has to be captured to a nutch field some how. Any suggestions ?

This is tricky. Perhaps running separate crawls will do the trick but make sure the URL's are not identical, otherwise your index will contain overwritten items. If the URL's are unique you can have one crawl and use marker in the URL to decide how to login.

> 
> > Is there a way i can capture cookie information into nutch as a field  ?

Cookies are saved in the Content Metadata in the segment. You can use the parsechecker tool of Nutch and see what is exactly saved. The content metadata must contain the cookie.

> 
> > Any recommendations for the CGI issue ? Any part of code that can be hacked to append http params to the URL that nutch stores ; so that stored URLS will be different.

I think it's best for your application to generate distinct URL's. Otherwise it may be too difficult and you may run into unexpected problems.

 OR 
> Can i set up multiple nutch instances for each country i support ?

Yes, but again, if the URL's are not unique, the indexed URL's will be overwritten.

> OR 
> Does nutch allows some kind of grouping ? ( like "Collections" and "Front ends" in GSA ) 

Are you talking about queries? Solr can do some kind of grouping.

> 
> 
> Thanks 
> Martin 
> 
> On Tue, Sep 11, 2012 at 6:30 PM, Markus Jelsma <markus.jelsma@openindex.io <ma...@openindex.io> > wrote:
> Hello Martin,
> 
> -----Original message-----
> > From:Martin Louis <mail.louis@gmail.com <ma...@gmail.com> >
> > Sent: Mon 10-Sep-2012 16:41
> > To: user@nutch.apache.org <ma...@nutch.apache.org> 
> > Subject: Help needed on Large scale single domain crawling ( Multiple country / Multilanguage / user type ) CGI urls
> >
> > Hi Guys
> >
> > I am a JAVA engineer, trying to set up an environment with all the features
> > of GSA and more to address large website needs;
> >
> > My website
> > > Works mostly on CGI commands to redirect to pages
> >  (like, ?cmd=_services-page )
> > > Multiple counties ( means; they are different for different counties as
> > products we sell to different countries are different ); reachable by sub
> > URL: *mydomain.com/ <http://mydomain.com/> <country-code>/ *.
> > > Supports multiple local languages for each country.
> > > Each country can have users having multiple types of accounts ( we
> > support 2-3 types of users in each country based on the service level like
> > "free user" / "premium user" ) and the content for them will vary.
> >
> > *What will be the best approach to crawl this website for a good "site wide
> > search" experience both for logged -in and out users with relevant content.*
> >
> > Below are the questions with me
> >
> > 1. If i keep my *seed to be "mydomain.com <http://mydomain.com> "*  and initiate a crawl on entire
> > site
> >   >Q. How can i capture "/<country-code>/"   as a field in NUTCH ) during
> > crawl ?
> 
> Depends on where the country-code is located, is it a HTTP element? If so, you must create a custom HTML parse filter and look for it in the DOM. Is is part of the URL? Then you can still do it with an HTML parse filter or indexing filter as they both have access to the URL and you can look it up.
> 
> >   >Q. How can i crawl language specific pages and index it
> >             -  Same CGI command ( like : ?cmd=_login-run ) is used for all
> > languages in a country
> >             -  Language flip done by setting a cookie in the website
> 
> This is not going to work. The URL must be unique, see below.
> 
> >
> > 3. My website support different types of accounts and the content can be
> > different for each type of account for same CGI ?cmd
> >      > Q. How to group based on account types used to crawl.
> 
> Very tricky. You must make sure the URL's are not identical. Different content for the same URL will not work in Nutch because the URL is the key in all of Nutch' databases. You can get different content for the same URL by sending different HTTP headers but in Nutch' database you will just overwrite the `other content` for the URL.
> 
> >
> > 4. How can i do a post ( form authentication ), i know i can hack HTTP
> > connection, but above grouping of crawl based on authentication is blocking
> > me.
> 
> Indeed, hack into the HTTP protocol plugin you're using. Nutch cannot do this by default.
> 
> >
> >
> >
> > Thanks in advance, for any of your valuable suggestion to my problem
> >
> > --
> > - Martin
> >
> 
> 
> 
> -- 
> - Martin 
>

Re: Help needed on Large scale single domain crawling ( Multiple country / Multilanguage / user type ) CGI urls

Posted by Martin Louis <ma...@gmail.com>.

Thanks Markus for you answers, I will try them and post back, but one
question remains in my mind ;

I can hack http conection for POST authentication, but I have multiple
login credentials ( user types ) for the website, what will be the approach
to re-run nutch crawling based on different login credentials, as i also i
want to seach based on user types; so the info has to be captured to a
nutch field some how. Any suggestions ?

> Is there a way i can capture cookie information into nutch as a field  ?

> Any recommendations for the CGI issue ? Any part of code that can be
hacked to append http params to the URL that nutch stores ; so that stored
URLS will be different. OR
Can i set up multiple nutch instances for each country i support ?
OR
Does nutch allows some kind of grouping ? ( like "Collections" and "Front
ends" in GSA )


Thanks
Martin

On Tue, Sep 11, 2012 at 6:30 PM, Markus Jelsma
<ma...@openindex.io>wrote:

> Hello Martin,
>
> -----Original message-----
> > From:Martin Louis <ma...@gmail.com>
> > Sent: Mon 10-Sep-2012 16:41
> > To: user@nutch.apache.org
> > Subject: Help needed on Large scale single domain crawling ( Multiple
> country / Multilanguage / user type ) CGI urls
> >
> > Hi Guys
> >
> > I am a JAVA engineer, trying to set up an environment with all the
> features
> > of GSA and more to address large website needs;
> >
> > My website
> > > Works mostly on CGI commands to redirect to pages
> >  (like, ?cmd=_services-page )
> > > Multiple counties ( means; they are different for different counties as
> > products we sell to different countries are different ); reachable by sub
> > URL: *mydomain.com/<country-code>/ *.
> > > Supports multiple local languages for each country.
> > > Each country can have users having multiple types of accounts ( we
> > support 2-3 types of users in each country based on the service level
> like
> > "free user" / "premium user" ) and the content for them will vary.
> >
> > *What will be the best approach to crawl this website for a good "site
> wide
> > search" experience both for logged -in and out users with relevant
> content.*
> >
> > Below are the questions with me
> >
> > 1. If i keep my *seed to be "mydomain.com"*  and initiate a crawl on
> entire
> > site
> >   >Q. How can i capture "/<country-code>/"   as a field in NUTCH ) during
> > crawl ?
>
> Depends on where the country-code is located, is it a HTTP element? If so,
> you must create a custom HTML parse filter and look for it in the DOM. Is
> is part of the URL? Then you can still do it with an HTML parse filter or
> indexing filter as they both have access to the URL and you can look it up.
>
> >   >Q. How can i crawl language specific pages and index it
> >             -  Same CGI command ( like : ?cmd=_login-run ) is used for
> all
> > languages in a country
> >             -  Language flip done by setting a cookie in the website
>
> This is not going to work. The URL must be unique, see below.
>
> >
> > 3. My website support different types of accounts and the content can be
> > different for each type of account for same CGI ?cmd
> >      > Q. How to group based on account types used to crawl.
>
> Very tricky. You must make sure the URL's are not identical. Different
> content for the same URL will not work in Nutch because the URL is the key
> in all of Nutch' databases. You can get different content for the same URL
> by sending different HTTP headers but in Nutch' database you will just
> overwrite the `other content` for the URL.
>
> >
> > 4. How can i do a post ( form authentication ), i know i can hack HTTP
> > connection, but above grouping of crawl based on authentication is
> blocking
> > me.
>
> Indeed, hack into the HTTP protocol plugin you're using. Nutch cannot do
> this by default.
>
> >
> >
> >
> > Thanks in advance, for any of your valuable suggestion to my problem
> >
> > --
> > - Martin
> >
>



-- 
- Martin

RE: Help needed on Large scale single domain crawling ( Multiple country / Multilanguage / user type ) CGI urls

Posted by Markus Jelsma <ma...@openindex.io>.

Hello Martin, 
 
-----Original message-----
> From:Martin Louis <ma...@gmail.com>
> Sent: Mon 10-Sep-2012 16:41
> To: user@nutch.apache.org
> Subject: Help needed on Large scale single domain crawling ( Multiple country / Multilanguage / user type ) CGI urls
> 
> Hi Guys
> 
> I am a JAVA engineer, trying to set up an environment with all the features
> of GSA and more to address large website needs;
> 
> My website
> > Works mostly on CGI commands to redirect to pages
>  (like, ?cmd=_services-page )
> > Multiple counties ( means; they are different for different counties as
> products we sell to different countries are different ); reachable by sub
> URL: *mydomain.com/<country-code>/ *.
> > Supports multiple local languages for each country.
> > Each country can have users having multiple types of accounts ( we
> support 2-3 types of users in each country based on the service level like
> "free user" / "premium user" ) and the content for them will vary.
> 
> *What will be the best approach to crawl this website for a good "site wide
> search" experience both for logged -in and out users with relevant content.*
> 
> Below are the questions with me
> 
> 1. If i keep my *seed to be "mydomain.com"*  and initiate a crawl on entire
> site
>   >Q. How can i capture "/<country-code>/"   as a field in NUTCH ) during
> crawl ?

Depends on where the country-code is located, is it a HTTP element? If so, you must create a custom HTML parse filter and look for it in the DOM. Is is part of the URL? Then you can still do it with an HTML parse filter or indexing filter as they both have access to the URL and you can look it up.

>   >Q. How can i crawl language specific pages and index it
>             -  Same CGI command ( like : ?cmd=_login-run ) is used for all
> languages in a country
>             -  Language flip done by setting a cookie in the website

This is not going to work. The URL must be unique, see below.

> 
> 3. My website support different types of accounts and the content can be
> different for each type of account for same CGI ?cmd
>      > Q. How to group based on account types used to crawl.

Very tricky. You must make sure the URL's are not identical. Different content for the same URL will not work in Nutch because the URL is the key in all of Nutch' databases. You can get different content for the same URL by sending different HTTP headers but in Nutch' database you will just overwrite the `other content` for the URL.

> 
> 4. How can i do a post ( form authentication ), i know i can hack HTTP
> connection, but above grouping of crawl based on authentication is blocking
> me.

Indeed, hack into the HTTP protocol plugin you're using. Nutch cannot do this by default.

> 
> 
> 
> Thanks in advance, for any of your valuable suggestion to my problem
> 
> -- 
> - Martin
>

un-subscribe me

Posted by IGM Networks - Vasilis Pasparas <va...@interactivegm.com>.

please un-subscribe me