You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Glenn Barney <gb...@gmail.com> on 2007/12/08 22:50:08 UTC

adding category field based on terms

Hi All,

I've been reading and going through the nutch examples for a couple days but
haven't found an exact answer to my problem.  I want to add a category field
(with a boost score) to each document I index based on the text content of a
web page.  For example, I'm creating the category farm, and I have a set
list of keywords I want to map to the category farm (say "cow", "pig",
"farm", and "farmer").  The boost score for the new field "farm" is relative
to the frequency of these terms in my document.

The examples in this forum all talk about 1)Scraping metadata from a html
page while parsing and adding your category field if this metadata is
present.  This doesn't work for me as I don't have any special metadata in
my documents (I'm using the web) and 2)I don't want to do anything in the
parse stage of crawling.  I want to add my new field in the index stage.  So
that leaves method 2)In the index stage, I have a reference to the document
text (in Parse.getText()) in filter() in IndexingFilter.  I can using java's
string methods to search the text string for each of my terms one by one
(and find repeats), and then create a score based on frequency and add this
to a new field called "farm".  However *this is the whole point of indexing*
and to my understanding lucene/nutch is already doing this, it's already
tokenizing and already calculating term frequencys in the tokenized content
field.

As I index, I want to have nutch do its magic, tokenize and parse the
content in the content field, then have me go in and use these results to
add a new field based on these tokens.  I don't want to "index" the whole
thing twice, I'm sure smarter people then I wrote a very effective
tokenizing (say removing punctuation, effectively finding duplicate terms)
implementation that I want to use.

I guess if I had some magic pseudocode, I'm looking to do something like
this
filter (
     for each word in my category
        score += thisDocument.getFrequency(word); //uses the index that's
being built before this filter applys
     addNewField(farm, score) //set farm's boost to score
)

Is there any way (or any better way) to do what I want above?
Thanks,
-Glenn

Custom Indexer help

Posted by ajaxtrend <te...@yahoo.com>.
Hi group,
              I just want to index certain documents based on URL type and reject other documents. I understand that I can specify the URL pattern in crawl-urlfiter.txt, but it is difficult to generated pattern for so many URLs so I thought to maintain a separate properties file for those URLs and dont add document to Index for these URLs. In my custom filter, I added a meta- tag
   
  parse.getData().getParseMeta().set("indexit", new Boolean(shouldIndex).toString());
   
  And check the value of this meta-tag in write method of RecordWriter, however that does not seem to work. 
  Any idea? I think, I have to check for this meta-tag somewhere in Indexer class, I am not sure if you can guide, would be great.
   
  - BR

       
---------------------------------
Never miss a thing.   Make Yahoo your homepage.

Re: adding category field based on terms

Posted by Jasper Kamperman <ja...@openwaternet.com>.
There may be plenty other ways, but the indices that nutch creates  
are standard Lucene indices. So after nutch is done creating an index  
you can use IndexReaders/Writers which pretty much support all the  
methods you use in your "magic pseudo code".

On Dec 8, 2007, at 1:50 PM, Glenn Barney wrote:

> Hi All,
>
> I've been reading and going through the nutch examples for a couple  
> days but
> haven't found an exact answer to my problem.  I want to add a  
> category field
> (with a boost score) to each document I index based on the text  
> content of a
> web page.  For example, I'm creating the category farm, and I have  
> a set
> list of keywords I want to map to the category farm (say "cow", "pig",
> "farm", and "farmer").  The boost score for the new field "farm" is  
> relative
> to the frequency of these terms in my document.
>
> The examples in this forum all talk about 1)Scraping metadata from  
> a html
> page while parsing and adding your category field if this metadata is
> present.  This doesn't work for me as I don't have any special  
> metadata in
> my documents (I'm using the web) and 2)I don't want to do anything  
> in the
> parse stage of crawling.  I want to add my new field in the index  
> stage.  So
> that leaves method 2)In the index stage, I have a reference to the  
> document
> text (in Parse.getText()) in filter() in IndexingFilter.  I can  
> using java's
> string methods to search the text string for each of my terms one  
> by one
> (and find repeats), and then create a score based on frequency and  
> add this
> to a new field called "farm".  However *this is the whole point of  
> indexing*
> and to my understanding lucene/nutch is already doing this, it's  
> already
> tokenizing and already calculating term frequencys in the tokenized  
> content
> field.
>
> As I index, I want to have nutch do its magic, tokenize and parse the
> content in the content field, then have me go in and use these  
> results to
> add a new field based on these tokens.  I don't want to "index" the  
> whole
> thing twice, I'm sure smarter people then I wrote a very effective
> tokenizing (say removing punctuation, effectively finding duplicate  
> terms)
> implementation that I want to use.
>
> I guess if I had some magic pseudocode, I'm looking to do something  
> like
> this
> filter (
>      for each word in my category
>         score += thisDocument.getFrequency(word); //uses the index  
> that's
> being built before this filter applys
>      addNewField(farm, score) //set farm's boost to score
> )
>
> Is there any way (or any better way) to do what I want above?
> Thanks,
> -Glenn


Re: adding category field based on terms

Posted by DS jha <ae...@gmail.com>.
I am not sure if you will be able to do that at Index time (that is,
without parsing the document text) - search engines usually maintain
an inverted index - so it doesn't store keywords by document but
rather - it maintains: for each keyword, list documents containing
that term and corresponding position information. So - I don't think
Document/Field class in lucene has getTermFrequency or similar type of
methods

Cheers
-Jha


On Dec 8, 2007 4:50 PM, Glenn Barney <gb...@gmail.com> wrote:
> Hi All,
>
> I've been reading and going through the nutch examples for a couple days but
> haven't found an exact answer to my problem.  I want to add a category field
> (with a boost score) to each document I index based on the text content of a
> web page.  For example, I'm creating the category farm, and I have a set
> list of keywords I want to map to the category farm (say "cow", "pig",
> "farm", and "farmer").  The boost score for the new field "farm" is relative
> to the frequency of these terms in my document.
>
> The examples in this forum all talk about 1)Scraping metadata from a html
> page while parsing and adding your category field if this metadata is
> present.  This doesn't work for me as I don't have any special metadata in
> my documents (I'm using the web) and 2)I don't want to do anything in the
> parse stage of crawling.  I want to add my new field in the index stage.  So
> that leaves method 2)In the index stage, I have a reference to the document
> text (in Parse.getText()) in filter() in IndexingFilter.  I can using java's
> string methods to search the text string for each of my terms one by one
> (and find repeats), and then create a score based on frequency and add this
> to a new field called "farm".  However *this is the whole point of indexing*
> and to my understanding lucene/nutch is already doing this, it's already
> tokenizing and already calculating term frequencys in the tokenized content
> field.
>
> As I index, I want to have nutch do its magic, tokenize and parse the
> content in the content field, then have me go in and use these results to
> add a new field based on these tokens.  I don't want to "index" the whole
> thing twice, I'm sure smarter people then I wrote a very effective
> tokenizing (say removing punctuation, effectively finding duplicate terms)
> implementation that I want to use.
>
> I guess if I had some magic pseudocode, I'm looking to do something like
> this
> filter (
>      for each word in my category
>         score += thisDocument.getFrequency(word); //uses the index that's
> being built before this filter applys
>      addNewField(farm, score) //set farm's boost to score
> )
>
> Is there any way (or any better way) to do what I want above?
> Thanks,
> -Glenn
>