You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Brian Whitman <br...@variogr.am> on 2007/02/17 18:48:41 UTC
focused crawls -- where to add parse filter
In doing whole-internet focused crawls we'd like a parse/injector
filter.
Say we only want pages in our nutch db and index that have the word
"nutch" in them. I'd like to express the rule as a lucene boolean
query, contents:nutch, because in our real world scenario the match
is more fuzzy and involves many phrases and terms. It's not just a
regular expression.
If the query does not match or matches under a threshold score, I
don't want to add the fetched/parsed document to the index, nor (more
importantly) have the generator find outlinks from that page for
future crawls.
This is somewhat like a url filter, but instead of filtering by url
content I want to filter by parsed page content.
Where would I add this in nutch?
-Brian
Re: focused crawls -- where to add parse filter
Posted by Doğacan Güney <do...@gmail.com>.
On 2/19/07, Dennis Kubes <nu...@dragonflymc.com> wrote:
[snip]
>
> You could drop the HtmlParseFilter part and simply write the post
> crawl/index MR job after to update the CrawlDatum based on your lucene
> queries. You would still need to write the second part that does the
> generation based on a different sort value.
The second part can be written with a different scoring plugin. Simply
put whatever it is you need in CrawlDatum's metadata then change
ScoringFilter.generatorSortValue to look up that value and give a
good/bad score.
[snip]
--
Doğacan Güney
Re: focused crawls -- where to add parse filter
Posted by Dennis Kubes <nu...@dragonflymc.com>.
Brian Whitman wrote:
>
>> How about an outlink filter that works during parse? In
>> ParseOutputFormat,
>> it will take the parse text, parse data (etc.) of the source page and
>> the destination url then will either return "filter this outlink" or
>> "let it through".
>
>> Write an HtmlParseFilter that sets an attribute in the ParseData
>> MetaData based on whether the page contains what you are looking for.
>> Then write another MR job that runs after the crawl/index cycle. This
>> job would need to update the CrawlDatum MetaData based on your
>> priority calculation (inlinks and contains text, etc.). Then hack the
>> Generator class around line 160 to change the sort value that it is
>> using based on the CrawlDatum MetaData. I would make using this new
>> sort value an option that you can turn on and off by using different
>> configuration values.
>
> Hi Doğacan, Dennis:
>
> Thanks for the ideas. I spent some time mentally planning out how to
> implement both of these ideas by looking at the source. I'm still newish
> to Nutch so excuse my naiveté.
>
> Do either of these approaches let me get at the analyzed/indexed
> contents of the page text so that I can perform Lucene queries for
> filtering? What I could tell of the HtmlParseFilter or Parse in general
> is that it gets me at the parse tree, which i could do regexp queries on
> -- but I'd rather it be all in Lucene and be influenced by the relative
> ranking of terms amongst all documents. I am envisioning machine
> generated queries from our classifiers that might be hundreds of tokens
> long with boost values per term, and a score threshold. So I'd need to
> act on the documents post-index. Unless I'm reading your suggestions
> incorrectly, neither of them let me at that?
You could drop the HtmlParseFilter part and simply write the post
crawl/index MR job after to update the CrawlDatum based on your lucene
queries. You would still need to write the second part that does the
generation based on a different sort value.
>
> I am currently looking at PruneIndexTool -- could a modification of this
> work? I could run it after a crawl/index cycle but before invertlinks
> and the next generate. The one issue I see is that PruneIndexTool claims
> not to affect the WebDB. Does this mean that even though the lucene doc
> will be gone, the link and outlinks will remain in the WebDB and will be
> fetched anyway?
That is correct. You will need to alter the CrawlDb to affect what is
generated and hence fetched.
>
> If I should instead be looking harder at your recommended
> HtmlParseFilter or ParseOutputFormat, please correct me.
No if you are doing complex queries instead of something like "if this
page contains words x, y, and z" then I wouldn't do it through
HtmlParseFilter I would probably go with the lucene after index approach.
Dennis Kubes
>
> -Brian
>
Re: focused crawls -- where to add parse filter
Posted by Brian Whitman <br...@variogr.am>.
> How about an outlink filter that works during parse? In
> ParseOutputFormat,
> it will take the parse text, parse data (etc.) of the source page and
> the destination url then will either return "filter this outlink" or
> "let it through".
> Write an HtmlParseFilter that sets an attribute in the ParseData
> MetaData based on whether the page contains what you are looking
> for. Then write another MR job that runs after the crawl/index
> cycle. This job would need to update the CrawlDatum MetaData based
> on your priority calculation (inlinks and contains text, etc.).
> Then hack the Generator class around line 160 to change the sort
> value that it is using based on the CrawlDatum MetaData. I would
> make using this new sort value an option that you can turn on and
> off by using different configuration values.
Hi Doğacan, Dennis:
Thanks for the ideas. I spent some time mentally planning out how to
implement both of these ideas by looking at the source. I'm still
newish to Nutch so excuse my naiveté.
Do either of these approaches let me get at the analyzed/indexed
contents of the page text so that I can perform Lucene queries for
filtering? What I could tell of the HtmlParseFilter or Parse in
general is that it gets me at the parse tree, which i could do regexp
queries on -- but I'd rather it be all in Lucene and be influenced by
the relative ranking of terms amongst all documents. I am envisioning
machine generated queries from our classifiers that might be hundreds
of tokens long with boost values per term, and a score threshold. So
I'd need to act on the documents post-index. Unless I'm reading your
suggestions incorrectly, neither of them let me at that?
I am currently looking at PruneIndexTool -- could a modification of
this work? I could run it after a crawl/index cycle but before
invertlinks and the next generate. The one issue I see is that
PruneIndexTool claims not to affect the WebDB. Does this mean that
even though the lucene doc will be gone, the link and outlinks will
remain in the WebDB and will be fetched anyway?
If I should instead be looking harder at your recommended
HtmlParseFilter or ParseOutputFormat, please correct me.
-Brian
Re: focused crawls -- where to add parse filter
Posted by Doğacan Güney <do...@gmail.com>.
Hi,
On 2/17/07, Brian Whitman <br...@variogr.am> wrote:
>
> I'm not worried about a hack, our whole set up is very "der lauf der
> dinge" and one more plank won't matter much :) But after sending my
> question out, I realized that I would need to index the document
> anyway before being able to lucene query it for topicality. I don't
> mind having pages stored that don't match my query, but I really
> would rather the generator not get more outlinks from those pages.
How about an outlink filter that works during parse? In ParseOutputFormat,
it will take the parse text, parse data (etc.) of the source page and
the destination url then will either return "filter this outlink" or
"let it through".
>
> So a simple fix would be something I can write or run after a crawl/
> index cycle that can mark certain pages to not emit more URIs in the
> generator. It would query each page in an index and update some flag.
> But what is that flag and how can I get at it?
>
> And more advanced and later on -- the generator has smarts to
> prioritize fetching by inlink counts-- is there something I can hack
> to "boost" outlink fetches based on the source page's content? for
> example - I find a page that scores high on my lucene query after
> crawl/index gets done. I would want the generator to put all of its
> outlinks up top, even if there's not many inlinks to that page...
> would this be a "generator plugin?"
You should be able to do this with a scoring plugin and a parse plugin.
Write a parse plugin (or update a current one) to analyze the content
and put the result in parse data's metadata(for example, put a
<"boost", "10"> pair in it). Then in
<your_scoring_filter>.distributeScoreToOutlink check if parse data's
metadata has the "boost" field and boost it accordingly. You may also
want to consider changing the indexerScore method to give it an even
higher boost.
>
> -Brian
>
--
Doğacan Güney
Re: focused crawls -- where to add parse filter
Posted by Dennis Kubes <nu...@dragonflymc.com>.
If I understand what you are trying to do then here is how I would
approach it.
Write an HtmlParseFilter that sets an attribute in the ParseData
MetaData based on whether the page contains what you are looking for.
Then write another MR job that runs after the crawl/index cycle. This
job would need to update the CrawlDatum MetaData based on your priority
calculation (inlinks and contains text, etc.). Then hack the Generator
class around line 160 to change the sort value that it is using based on
the CrawlDatum MetaData. I would make using this new sort value an
option that you can turn on and off by using different configuration values.
Hope this helps.
Dennis Kubes
Brian Whitman wrote:
> On Feb 17, 2007, at 12:58 PM, Dennis Kubes wrote:
>
>> You can use an HtmlParseFilter and then set a metadata attribute as to
>> whether or not it contains the phrase. Problem with this is that all
>> of the content is still stored. You could also change the
>> ParseOutputFormat to only write out if the word is contained although
>> that is a bit of a hack.
>
> I'm not worried about a hack, our whole set up is very "der lauf der
> dinge" and one more plank won't matter much :) But after sending my
> question out, I realized that I would need to index the document anyway
> before being able to lucene query it for topicality. I don't mind having
> pages stored that don't match my query, but I really would rather the
> generator not get more outlinks from those pages.
>
> So a simple fix would be something I can write or run after a
> crawl/index cycle that can mark certain pages to not emit more URIs in
> the generator. It would query each page in an index and update some
> flag. But what is that flag and how can I get at it?
>
> And more advanced and later on -- the generator has smarts to prioritize
> fetching by inlink counts-- is there something I can hack to "boost"
> outlink fetches based on the source page's content? for example - I
> find a page that scores high on my lucene query after crawl/index gets
> done. I would want the generator to put all of its outlinks up top, even
> if there's not many inlinks to that page... would this be a "generator
> plugin?"
>
> -Brian
>
>
>
>
>
>
Re: focused crawls -- where to add parse filter
Posted by Brian Whitman <br...@variogr.am>.
On Feb 17, 2007, at 12:58 PM, Dennis Kubes wrote:
> You can use an HtmlParseFilter and then set a metadata attribute as
> to whether or not it contains the phrase. Problem with this is
> that all of the content is still stored. You could also change the
> ParseOutputFormat to only write out if the word is contained
> although that is a bit of a hack.
I'm not worried about a hack, our whole set up is very "der lauf der
dinge" and one more plank won't matter much :) But after sending my
question out, I realized that I would need to index the document
anyway before being able to lucene query it for topicality. I don't
mind having pages stored that don't match my query, but I really
would rather the generator not get more outlinks from those pages.
So a simple fix would be something I can write or run after a crawl/
index cycle that can mark certain pages to not emit more URIs in the
generator. It would query each page in an index and update some flag.
But what is that flag and how can I get at it?
And more advanced and later on -- the generator has smarts to
prioritize fetching by inlink counts-- is there something I can hack
to "boost" outlink fetches based on the source page's content? for
example - I find a page that scores high on my lucene query after
crawl/index gets done. I would want the generator to put all of its
outlinks up top, even if there's not many inlinks to that page...
would this be a "generator plugin?"
-Brian
Re: focused crawls -- where to add parse filter
Posted by Dennis Kubes <nu...@dragonflymc.com>.
You can use an HtmlParseFilter and then set a metadata attribute as to
whether or not it contains the phrase. Problem with this is that all of
the content is still stored. You could also change the
ParseOutputFormat to only write out if the word is contained although
that is a bit of a hack.
This may be an area that we need to add an extension point to if one
doesn't already exist. I am sure there are many more people out there
that would like to selectively store content based on the content.
Dennis Kubes
Brian Whitman wrote:
> In doing whole-internet focused crawls we'd like a parse/injector filter.
>
> Say we only want pages in our nutch db and index that have the word
> "nutch" in them. I'd like to express the rule as a lucene boolean query,
> contents:nutch, because in our real world scenario the match is more
> fuzzy and involves many phrases and terms. It's not just a regular
> expression.
>
> If the query does not match or matches under a threshold score, I don't
> want to add the fetched/parsed document to the index, nor (more
> importantly) have the generator find outlinks from that page for future
> crawls.
>
> This is somewhat like a url filter, but instead of filtering by url
> content I want to filter by parsed page content.
>
> Where would I add this in nutch?
>
> -Brian
>
>
>
>
>