You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Brian Whitman <br...@variogr.am> on 2007/02/17 18:48:41 UTC

focused crawls -- where to add parse filter

In doing whole-internet focused crawls we'd like a parse/injector  
filter.

Say we only want pages in our nutch db and index that have the word  
"nutch" in them. I'd like to express the rule as a lucene boolean  
query, contents:nutch, because in our real world scenario the match  
is more fuzzy and involves many phrases and terms. It's not just a  
regular expression.

If the query does not match or matches under a threshold score, I  
don't want to add the fetched/parsed document to the index, nor (more  
importantly) have the generator find outlinks from that page for  
future crawls.

This is somewhat like a url filter, but instead of filtering by url  
content I want to filter by parsed page content.

Where would I add this in nutch?

-Brian

Re: focused crawls -- where to add parse filter

Posted by Doğacan Güney <do...@gmail.com>.

On 2/19/07, Dennis Kubes <nu...@dragonflymc.com> wrote:

[snip]

>
> You could drop the HtmlParseFilter part and simply write the post
> crawl/index MR job after to update the CrawlDatum based on your lucene
> queries.  You would still need to write the second part that does the
> generation based on a different sort value.

The second part can be written with a different scoring plugin. Simply
put whatever it is you need in CrawlDatum's metadata then change
ScoringFilter.generatorSortValue to look up that value and give a
good/bad score.

[snip]

-- 
Doğacan Güney

Re: focused crawls -- where to add parse filter

Posted by Dennis Kubes <nu...@dragonflymc.com>.

Brian Whitman wrote:
> 
>> How about an outlink filter that works during parse? In 
>> ParseOutputFormat,
>> it will take the parse text, parse data (etc.) of the source page and
>> the destination url then will either return "filter this outlink" or
>> "let it through".
> 
>> Write an HtmlParseFilter that sets an attribute in the ParseData 
>> MetaData based on whether the page contains what you are looking for. 
>> Then write another MR job that runs after the crawl/index cycle.  This 
>> job would need to update the CrawlDatum MetaData based on your 
>> priority calculation (inlinks and contains text, etc.).  Then hack the 
>> Generator class around line 160 to change the sort value that it is 
>> using based on the CrawlDatum MetaData.  I would make using this new 
>> sort value an option that you can turn on and off by using different 
>> configuration values.
> 
> Hi Doğacan, Dennis:
> 
> Thanks for the ideas. I spent some time mentally planning out how to 
> implement both of these ideas by looking at the source. I'm still newish 
> to Nutch so excuse my naiveté.
> 
> Do either of these approaches let me get at the analyzed/indexed 
> contents of the page text so that I can perform Lucene queries for 
> filtering? What I could tell of the HtmlParseFilter or Parse in general 
> is that it gets me at the parse tree, which i could do regexp queries on 
> -- but I'd rather it be all in Lucene and be influenced by the relative 
> ranking of terms amongst all documents. I am envisioning machine 
> generated queries from our classifiers that might be hundreds of tokens 
> long with boost values per term, and a score threshold. So I'd need to 
> act on the documents post-index. Unless I'm reading your suggestions 
> incorrectly, neither of them let me at that?

You could drop the HtmlParseFilter part and simply write the post 
crawl/index MR job after to update the CrawlDatum based on your lucene 
queries.  You would still need to write the second part that does the 
generation based on a different sort value.
> 
> I am currently looking at PruneIndexTool -- could a modification of this 
> work? I could run it after a crawl/index cycle but before invertlinks 
> and the next generate. The one issue I see is that PruneIndexTool claims 
> not to affect the WebDB. Does this mean that even though the lucene doc 
> will be gone, the link and outlinks will remain in the WebDB and will be 
> fetched anyway?

That is correct.  You will need to alter the CrawlDb to affect what is 
generated and hence fetched.
> 
> If I should instead be looking harder at your recommended 
> HtmlParseFilter or ParseOutputFormat, please correct me.

No if you are doing complex queries instead of something like "if this 
page contains words x, y, and z"  then I wouldn't do it through 
HtmlParseFilter I would probably go with the lucene after index approach.

Dennis Kubes
> 
> -Brian
>

Re: focused crawls -- where to add parse filter

Posted by Brian Whitman <br...@variogr.am>.

> How about an outlink filter that works during parse? In  
> ParseOutputFormat,
> it will take the parse text, parse data (etc.) of the source page and
> the destination url then will either return "filter this outlink" or
> "let it through".

> Write an HtmlParseFilter that sets an attribute in the ParseData  
> MetaData based on whether the page contains what you are looking  
> for. Then write another MR job that runs after the crawl/index  
> cycle.  This job would need to update the CrawlDatum MetaData based  
> on your priority calculation (inlinks and contains text, etc.).   
> Then hack the Generator class around line 160 to change the sort  
> value that it is using based on the CrawlDatum MetaData.  I would  
> make using this new sort value an option that you can turn on and  
> off by using different configuration values.

Hi Doğacan, Dennis:

Thanks for the ideas. I spent some time mentally planning out how to  
implement both of these ideas by looking at the source. I'm still  
newish to Nutch so excuse my naiveté.

Do either of these approaches let me get at the analyzed/indexed  
contents of the page text so that I can perform Lucene queries for  
filtering? What I could tell of the HtmlParseFilter or Parse in  
general is that it gets me at the parse tree, which i could do regexp  
queries on -- but I'd rather it be all in Lucene and be influenced by  
the relative ranking of terms amongst all documents. I am envisioning  
machine generated queries from our classifiers that might be hundreds  
of tokens long with boost values per term, and a score threshold. So  
I'd need to act on the documents post-index. Unless I'm reading your  
suggestions incorrectly, neither of them let me at that?


I am currently looking at PruneIndexTool -- could a modification of  
this work? I could run it after a crawl/index cycle but before  
invertlinks and the next generate. The one issue I see is that  
PruneIndexTool claims not to affect the WebDB. Does this mean that  
even though the lucene doc will be gone, the link and outlinks will  
remain in the WebDB and will be fetched anyway?

If I should instead be looking harder at your recommended  
HtmlParseFilter or ParseOutputFormat, please correct me.

-Brian

Re: focused crawls -- where to add parse filter

Posted by Doğacan Güney <do...@gmail.com>.

Hi,

On 2/17/07, Brian Whitman <br...@variogr.am> wrote:
>
> I'm not worried about a hack, our whole set up is very "der lauf der
> dinge" and one more plank won't matter much :) But after sending my
> question out, I realized that I would need to index the document
> anyway before being able to lucene query it for topicality. I don't
> mind having pages stored that don't match my query, but I really
> would rather the generator not get more outlinks from those pages.

How about an outlink filter that works during parse? In ParseOutputFormat,
it will take the parse text, parse data (etc.) of the source page and
the destination url then will either return "filter this outlink" or
"let it through".

>
> So a simple fix would be something I can write or run after a crawl/
> index cycle that can mark certain pages to not emit more URIs in the
> generator. It would query each page in an index and update some flag.
> But what is that flag and how can I get at it?
>
> And more advanced and later on -- the generator has smarts to
> prioritize fetching by inlink counts-- is there something I can hack
> to "boost" outlink fetches based on the source page's content?  for
> example - I find a page that scores high on my lucene query after
> crawl/index gets done. I would want the generator to put all of its
> outlinks up top, even if there's not many inlinks to that page...
> would this be a "generator plugin?"

You should be able to do this with a scoring plugin and a parse plugin.

Write a parse plugin (or update a current one) to analyze the content
and put the result in parse data's metadata(for example, put a
<"boost", "10"> pair in it). Then in
<your_scoring_filter>.distributeScoreToOutlink check if parse data's
metadata has the "boost" field and boost it accordingly. You may also
want to consider changing the indexerScore method to give it an even
higher boost.

>
> -Brian
>

-- 
Doğacan Güney

Re: focused crawls -- where to add parse filter

Posted by Dennis Kubes <nu...@dragonflymc.com>.

If I understand what you are trying to do then here is how I would 
approach it.

Write an HtmlParseFilter that sets an attribute in the ParseData 
MetaData based on whether the page contains what you are looking for. 
Then write another MR job that runs after the crawl/index cycle.  This 
job would need to update the CrawlDatum MetaData based on your priority 
calculation (inlinks and contains text, etc.).  Then hack the Generator 
class around line 160 to change the sort value that it is using based on 
the CrawlDatum MetaData.  I would make using this new sort value an 
option that you can turn on and off by using different configuration values.

Hope this helps.

Dennis Kubes

Brian Whitman wrote:
> On Feb 17, 2007, at 12:58 PM, Dennis Kubes wrote:
> 
>> You can use an HtmlParseFilter and then set a metadata attribute as to 
>> whether or not it contains the phrase.  Problem with this is that all 
>> of the content is still stored.  You could also change the 
>> ParseOutputFormat to only write out if the word is contained although 
>> that is a bit of a hack.
> 
> I'm not worried about a hack, our whole set up is very "der lauf der 
> dinge" and one more plank won't matter much :) But after sending my 
> question out, I realized that I would need to index the document anyway 
> before being able to lucene query it for topicality. I don't mind having 
> pages stored that don't match my query, but I really would rather the 
> generator not get more outlinks from those pages.
> 
> So a simple fix would be something I can write or run after a 
> crawl/index cycle that can mark certain pages to not emit more URIs in 
> the generator. It would query each page in an index and update some 
> flag. But what is that flag and how can I get at it?
> 
> And more advanced and later on -- the generator has smarts to prioritize 
> fetching by inlink counts-- is there something I can hack to "boost" 
> outlink fetches based on the source page's content?  for example - I 
> find a page that scores high on my lucene query after crawl/index gets 
> done. I would want the generator to put all of its outlinks up top, even 
> if there's not many inlinks to that page... would this be a "generator 
> plugin?"
> 
> -Brian
> 
> 
> 
> 
> 
>

Re: focused crawls -- where to add parse filter

Posted by Brian Whitman <br...@variogr.am>.

On Feb 17, 2007, at 12:58 PM, Dennis Kubes wrote:

> You can use an HtmlParseFilter and then set a metadata attribute as  
> to whether or not it contains the phrase.  Problem with this is  
> that all of the content is still stored.  You could also change the  
> ParseOutputFormat to only write out if the word is contained  
> although that is a bit of a hack.

I'm not worried about a hack, our whole set up is very "der lauf der  
dinge" and one more plank won't matter much :) But after sending my  
question out, I realized that I would need to index the document  
anyway before being able to lucene query it for topicality. I don't  
mind having pages stored that don't match my query, but I really  
would rather the generator not get more outlinks from those pages.

So a simple fix would be something I can write or run after a crawl/ 
index cycle that can mark certain pages to not emit more URIs in the  
generator. It would query each page in an index and update some flag.  
But what is that flag and how can I get at it?

And more advanced and later on -- the generator has smarts to  
prioritize fetching by inlink counts-- is there something I can hack  
to "boost" outlink fetches based on the source page's content?  for  
example - I find a page that scores high on my lucene query after  
crawl/index gets done. I would want the generator to put all of its  
outlinks up top, even if there's not many inlinks to that page...  
would this be a "generator plugin?"

-Brian

Re: focused crawls -- where to add parse filter

Posted by Dennis Kubes <nu...@dragonflymc.com>.

You can use an HtmlParseFilter and then set a metadata attribute as to 
whether or not it contains the phrase.  Problem with this is that all of 
the content is still stored.  You could also change the 
ParseOutputFormat to only write out if the word is contained although 
that is a bit of a hack.

This may be an area that we need to add an extension point to if one 
doesn't already exist.  I am sure there are many more people out there 
that would like to selectively store content based on the content.

Dennis Kubes

Brian Whitman wrote:
> In doing whole-internet focused crawls we'd like a parse/injector filter.
> 
> Say we only want pages in our nutch db and index that have the word 
> "nutch" in them. I'd like to express the rule as a lucene boolean query, 
> contents:nutch, because in our real world scenario the match is more 
> fuzzy and involves many phrases and terms. It's not just a regular 
> expression.
> 
> If the query does not match or matches under a threshold score, I don't 
> want to add the fetched/parsed document to the index, nor (more 
> importantly) have the generator find outlinks from that page for future 
> crawls.
> 
> This is somewhat like a url filter, but instead of filtering by url 
> content I want to filter by parsed page content.
> 
> Where would I add this in nutch?
> 
> -Brian
> 
> 
> 
> 
>