You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by mbehlok <m_...@hotmail.com> on 2013/02/13 20:04:25 UTC

Nutch identifier while indexing.

Hello, I am indexing 3 sites:

SiteA
SiteB
SiteC

I want to index these sites in a way that when searching them in solr I can
query a search on each of these sites in separate. So one could say... thats
easy, just filter them by host... WRONG...  Sites are hosted on the same
host but have different starting points. That is, starting the crawl from
different root urls (SiteA, SiteB, SiteC) produces different results. My
imagination tells me to somehow specify an identifier on schema.xml that
passes to solr which was the root url that produced that crawl. Any ideas on
how to implement this? any variations?

Mitch 
 



--
View this message in context: http://lucene.472066.n3.nabble.com/Nutch-identifier-while-indexing-tp4040285.html
Sent from the Nutch - User mailing list archive at Nabble.com.

RE: Nutch identifier while indexing.

Posted by mbehlok <m_...@hotmail.com>.
thank you for you reply. Once indexed each site in a sub collection, how will
solr determine their difference?



--
View this message in context: http://lucene.472066.n3.nabble.com/Nutch-identifier-while-indexing-tp4040285p4040335.html
Sent from the Nutch - User mailing list archive at Nabble.com.

RE: Nutch identifier while indexing.

Posted by Markus Jelsma <ma...@openindex.io>.
You can use the subcollection indexing filter to set a value for URL's that match a string. With it you can distinquish even if they are on the same host and domain.
 
-----Original message-----
> From:mbehlok <m_...@hotmail.com>
> Sent: Wed 13-Feb-2013 21:20
> To: user@nutch.apache.org
> Subject: Re: Nutch identifier while indexing.
> 
> wish it was that simple:
> 
> SitaA = www.myDomain.com/index.aspx?site=1
> 
> SitaB = www.myDomain.com/index.aspx?site=2
> 
> SitaC = www.myDomain.com/index.aspx?site=3
> 
> 
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Nutch-identifier-while-indexing-tp4040285p4040323.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
> 

Re: Nutch identifier while indexing.

Posted by al...@aim.com.
The only suggestion that I know is that you can index the site param at the end of the urls as a separate field and make facet search in solr with that param values.

Alex.

 

 

 

-----Original Message-----
From: mbehlok <m_...@hotmail.com>
To: user <us...@nutch.apache.org>
Sent: Wed, Feb 13, 2013 12:20 pm
Subject: Re: Nutch identifier while indexing.


wish it was that simple:

SitaA = www.myDomain.com/index.aspx?site=1

SitaB = www.myDomain.com/index.aspx?site=2

SitaC = www.myDomain.com/index.aspx?site=3



--
View this message in context: http://lucene.472066.n3.nabble.com/Nutch-identifier-while-indexing-tp4040285p4040323.html
Sent from the Nutch - User mailing list archive at Nabble.com.

 

Re: Nutch identifier while indexing.

Posted by mbehlok <m_...@hotmail.com>.
wish it was that simple:

SitaA = www.myDomain.com/index.aspx?site=1

SitaB = www.myDomain.com/index.aspx?site=2

SitaC = www.myDomain.com/index.aspx?site=3



--
View this message in context: http://lucene.472066.n3.nabble.com/Nutch-identifier-while-indexing-tp4040285p4040323.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Nutch identifier while indexing.

Posted by al...@aim.com.
Are you telling that your sites have form siteA.mydomain.com, siteB.mydomain.com, siteC.mydomain.com?

Alex.

 

 

 

-----Original Message-----
From: mbehlok <m_...@hotmail.com>
To: user <us...@nutch.apache.org>
Sent: Wed, Feb 13, 2013 11:05 am
Subject: Nutch identifier while indexing.


Hello, I am indexing 3 sites:

SiteA
SiteB
SiteC

I want to index these sites in a way that when searching them in solr I can
query a search on each of these sites in separate. So one could say... thats
easy, just filter them by host... WRONG...  Sites are hosted on the same
host but have different starting points. That is, starting the crawl from
different root urls (SiteA, SiteB, SiteC) produces different results. My
imagination tells me to somehow specify an identifier on schema.xml that
passes to solr which was the root url that produced that crawl. Any ideas on
how to implement this? any variations?

Mitch 
 



--
View this message in context: http://lucene.472066.n3.nabble.com/Nutch-identifier-while-indexing-tp4040285.html
Sent from the Nutch - User mailing list archive at Nabble.com.