You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "Nemani, Raj" <Ra...@turner.com> on 2010/08/26 20:03:32 UTC

Setting the Nutchschema field to a constant value

All,

 

I was wondering if I can force a constant value into one of the fields
defined in Nutch's schema.  Here is the scenario.

 

I have two sub-sites that I would like to crawl separately.  Something
like

 

http://parentsite.mydomain.com/site1/index.php

 

http://parentsite.mydomain.com/site12/index.php

 

 

I am sending the results of the crawl to the same Solr/Lucene index.
The Index is used by a drupal website to provide search results to the
user.

 

The user has checkboxes on the drupal website to search for either Site1
search results or site 2 search results.  

 

Here is the problem.  There is no way for me to differentiate between
site1 and site2 documents in the index.

 

One of the Schema fields generated by the Nutch document is called
'site'.  Ideally this should have been a good field for me to use to
differentiate

between the documents in the index.  But for the sub-sites I am crawling
the 'Site' field value will be set to "parentsite.mydomain.com" because
both the urls have the same site value.

 

That is reason for me ask this question.  Can I set the value of 'Site"
field to "Site1" for Site1 url and "Site2" for site 2 url crawls.

 

Hope I have explained the scenario clearly.  If what I am thinking is
not possible then can I  achieve my ultimate objective in any other way.

 

Thanks so much in advance

Raj

 

 


RE: Setting the Nutchschema field to a constant value

Posted by "Nemani, Raj" <Ra...@turner.com>.
All,
I was able to find the steps to set this plugin up.  So I am good there.
I do have one question.  I running 1.1 Nutch. I believe I have setup
every this correctly.  I can see the Subcollection plugin getting
registered in hadopp.log.  Bit I cannot find the "subcollection" fileld
in the index (seen using Luke).  

Based on some of the emails from the archives of the list there are no
know problems with this plugin in 1.1.   I will include my
subsollections.xml and the plugins.include (from nutch-site.xml) below.
But my question is there any special tirck to have the logging enabled
for plugins.  This is what I did in lo4j.properties to turn on the
logging for subcollection plugin classes in the log4j.properties file.

log4j.logger.org.apache.nutch.collection.CollectionManager=INFO,cmdstdou
t
log4j.logger.org.apache.nutch.searcher.subcollection.SubcollectionQueryF
ilter=INFO,cmdstdout
log4j.logger.org.apache.nutch.indexer.subcollection.SubcollectionIndexin
gFilter=INFO,cmdstdout

I even tried DRFA in place to cmdstdout hoping that I will see the log
statements from these classes in hadoop.log. But nothing seems to work.
Other classes setup similar (as shown below) seem to work fine and
produce log statements in cmdstdout



I am a .Net dev and have used log4net so I could be missing something
with log4J
-----Original Message-----
From: Nemani, Raj [mailto:Raj.Nemani@turner.com] 
Sent: Friday, August 27, 2010 4:14 PM
To: user@nutch.apache.org
Subject: RE: Setting the Nutchschema field to a constant value

Thank you Julien.  

I was trying to look fora some documentation on how to set this plugin
up.  Can anybody point me to a link where the setup is documented.

I appreciate your help.

Raj


-----Original Message-----
From: Julien Nioche [mailto:lists.digitalpebble@gmail.com] 
Sent: Friday, August 27, 2010 4:42 AM
To: user@nutch.apache.org
Subject: Re: Setting the Nutchschema field to a constant value

Have a look at the subcollection plugin - I haven't used it myself but I
think it does what you need

Julien
-- 
DigitalPebble Ltd

Open Source Solutions for Text Engineering
http://www.digitalpebble.com

On 26 August 2010 19:03, Nemani, Raj <Ra...@turner.com> wrote:

> All,
>
>
>
> I was wondering if I can force a constant value into one of the fields
> defined in Nutch's schema.  Here is the scenario.
>
>
>
> I have two sub-sites that I would like to crawl separately.  Something
> like
>
>
>
> http://parentsite.mydomain.com/site1/index.php
>
>
>
> http://parentsite.mydomain.com/site12/index.php
>
>
>
>
>
> I am sending the results of the crawl to the same Solr/Lucene index.
> The Index is used by a drupal website to provide search results to the
> user.
>
>
>
> The user has checkboxes on the drupal website to search for either
Site1
> search results or site 2 search results.
>
>
>
> Here is the problem.  There is no way for me to differentiate between
> site1 and site2 documents in the index.
>
>
>
> One of the Schema fields generated by the Nutch document is called
> 'site'.  Ideally this should have been a good field for me to use to
> differentiate
>
> between the documents in the index.  But for the sub-sites I am
crawling
> the 'Site' field value will be set to "parentsite.mydomain.com"
because
> both the urls have the same site value.
>
>
>
> That is reason for me ask this question.  Can I set the value of
'Site"
> field to "Site1" for Site1 url and "Site2" for site 2 url crawls.
>
>
>
> Hope I have explained the scenario clearly.  If what I am thinking is
> not possible then can I  achieve my ultimate objective in any other
way.
>
>
>
> Thanks so much in advance
>
> Raj
>
>
>
>
>
>

RE: Setting the Nutchschema field to a constant value

Posted by "Nemani, Raj" <Ra...@turner.com>.
Thank you Julien.  

I was trying to look fora some documentation on how to set this plugin
up.  Can anybody point me to a link where the setup is documented.

I appreciate your help.

Raj


-----Original Message-----
From: Julien Nioche [mailto:lists.digitalpebble@gmail.com] 
Sent: Friday, August 27, 2010 4:42 AM
To: user@nutch.apache.org
Subject: Re: Setting the Nutchschema field to a constant value

Have a look at the subcollection plugin - I haven't used it myself but I
think it does what you need

Julien
-- 
DigitalPebble Ltd

Open Source Solutions for Text Engineering
http://www.digitalpebble.com

On 26 August 2010 19:03, Nemani, Raj <Ra...@turner.com> wrote:

> All,
>
>
>
> I was wondering if I can force a constant value into one of the fields
> defined in Nutch's schema.  Here is the scenario.
>
>
>
> I have two sub-sites that I would like to crawl separately.  Something
> like
>
>
>
> http://parentsite.mydomain.com/site1/index.php
>
>
>
> http://parentsite.mydomain.com/site12/index.php
>
>
>
>
>
> I am sending the results of the crawl to the same Solr/Lucene index.
> The Index is used by a drupal website to provide search results to the
> user.
>
>
>
> The user has checkboxes on the drupal website to search for either
Site1
> search results or site 2 search results.
>
>
>
> Here is the problem.  There is no way for me to differentiate between
> site1 and site2 documents in the index.
>
>
>
> One of the Schema fields generated by the Nutch document is called
> 'site'.  Ideally this should have been a good field for me to use to
> differentiate
>
> between the documents in the index.  But for the sub-sites I am
crawling
> the 'Site' field value will be set to "parentsite.mydomain.com"
because
> both the urls have the same site value.
>
>
>
> That is reason for me ask this question.  Can I set the value of
'Site"
> field to "Site1" for Site1 url and "Site2" for site 2 url crawls.
>
>
>
> Hope I have explained the scenario clearly.  If what I am thinking is
> not possible then can I  achieve my ultimate objective in any other
way.
>
>
>
> Thanks so much in advance
>
> Raj
>
>
>
>
>
>

Re: Setting the Nutchschema field to a constant value

Posted by Julien Nioche <li...@gmail.com>.
Have a look at the subcollection plugin - I haven't used it myself but I
think it does what you need

Julien
-- 
DigitalPebble Ltd

Open Source Solutions for Text Engineering
http://www.digitalpebble.com

On 26 August 2010 19:03, Nemani, Raj <Ra...@turner.com> wrote:

> All,
>
>
>
> I was wondering if I can force a constant value into one of the fields
> defined in Nutch's schema.  Here is the scenario.
>
>
>
> I have two sub-sites that I would like to crawl separately.  Something
> like
>
>
>
> http://parentsite.mydomain.com/site1/index.php
>
>
>
> http://parentsite.mydomain.com/site12/index.php
>
>
>
>
>
> I am sending the results of the crawl to the same Solr/Lucene index.
> The Index is used by a drupal website to provide search results to the
> user.
>
>
>
> The user has checkboxes on the drupal website to search for either Site1
> search results or site 2 search results.
>
>
>
> Here is the problem.  There is no way for me to differentiate between
> site1 and site2 documents in the index.
>
>
>
> One of the Schema fields generated by the Nutch document is called
> 'site'.  Ideally this should have been a good field for me to use to
> differentiate
>
> between the documents in the index.  But for the sub-sites I am crawling
> the 'Site' field value will be set to "parentsite.mydomain.com" because
> both the urls have the same site value.
>
>
>
> That is reason for me ask this question.  Can I set the value of 'Site"
> field to "Site1" for Site1 url and "Site2" for site 2 url crawls.
>
>
>
> Hope I have explained the scenario clearly.  If what I am thinking is
> not possible then can I  achieve my ultimate objective in any other way.
>
>
>
> Thanks so much in advance
>
> Raj
>
>
>
>
>
>