You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Dennis Kubes <ku...@apache.org> on 2010/06/15 01:40:14 UTC

Thoughts on Filtering Fetches, Parses, Content....

There was some discussion before on FetchFilters or ParseFilters and I 
wanted to lay out some ideas.  I have been looking at the flow of 
fetches and parses and here is what I am seeing:

   1. Fetch acquires and stores content.
         1. That content parsed during fetch or afterwards through
            ParseSegment.
         2. Parses during fetch and afterwards both eventually use
            ParseUtil and parse filters.
         3. Outputs also eventually use ParseOutputFormat.
   2. Parsing (html) creates outlinks that are stored in ParseData.
         1. The Outlinks are changed to LINKED CrawlDatum objects in
            ParseOutputFormat.
         2. The CrawlDatums are then added back into the CrawlDb using
            the updatedb command (CrawlDb class).
   3. Pages are never really GONE even when they are GONE.
         1. CrawlDatum objects with statuses such as GONE are put back
            into the CrawlDb.
         2. Their next fetch time is determined by FetchSchedule.
   4. Content is available in Fetcher and in ParseSegment, but not in
      ParseOutputFormat.  ParseData/ParseText is available in all.

Let's lay out some use cases for FetchFilters.  Let's assume we have a 
topical search engine where we fetch pages and determine if they belong 
to one or more sets (think sports, news, entertainment, etc.).  If a 
page is found to NOT belong to a given set we may want to:

   1. Store its content.  It may be useful for other analysis.
   2. Not store the content.
   3. Parse its outlinks, and fetch the outlinks to see if they do
      belong to the set.
   4. Not parse the outlinks and not fetch any of the outlinks on the
      page.  Break the crawl graph.
   5. Never fetch the page again.
   6. Fetch the page again at some time in the future.

There is also the question of how we make the decision of whether a page 
belongs to a given set.  Do we need the raw HTML, the text so we don't 
duplicate parse text?  Do we need parse meta data.  I would think we 
would need the content and parse data at the very least.

    * I can see creating a Parse/Fetch filter extension that has options
      for storing content.  I could also see where storing content is
      just the default, where we always do it, even if the page doesn't
      belong to the set.
    * I can see having an option to parse outlinks if the page isn't in
      the set.
    * I don't know how we are going to handle never fetching the page
      again since that is handled in outlinks.  It may require changes
      to generator to always avoid status X.

These are my thoughts.  I would love to know everyone else's thoughts on 
this.

Dennis

Re: Thoughts on Filtering Fetches, Parses, Content....

Posted by Alex McLintock <al...@gmail.com>.
Dennis,

I'm sorry I haven't seen any responses to your thoughts on filters. I
was hoping someone more knowledgeable than me would step in.

I'm quite keen on working on topical crawls as in your first "use
case". However I feel like I am a bit in the dark as the filter system
seems far more complicated that I like. A few people *have* gotten it
working but I don't see a public design document or "best practice"
yet.


I think we should take the document you've emailed us below and
actually turn it into overall design documentation (which of course I
am volunteering to help with).

Fundamentally you can't tell the worth of a page (or its outlinks)
until it has been parsed. Now I have difficulty in understanding the
design of the potential filters so don't know where the best place to
put my analysis code is, nor where to save the analysis.

More later....

Alex

On 15 June 2010 00:40, Dennis Kubes <ku...@apache.org> wrote:
> There was some discussion before on FetchFilters or ParseFilters and I
> wanted to lay out some ideas.  I have been looking at the flow of fetches
> and parses and here is what I am seeing:
>
> Fetch acquires and stores content.
>
> That content parsed during fetch or afterwards through ParseSegment.
> Parses during fetch and afterwards both eventually use ParseUtil and parse
> filters.
> Outputs also eventually use ParseOutputFormat.
>
> Parsing (html) creates outlinks that are stored in ParseData.
>
> The Outlinks are changed to LINKED CrawlDatum objects in ParseOutputFormat.
> The CrawlDatums are then added back into the CrawlDb using the updatedb
> command (CrawlDb class).
>
> Pages are never really GONE even when they are GONE.
>
> CrawlDatum objects with statuses such as GONE are put back into the
> CrawlDb.
> Their next fetch time is determined by FetchSchedule.
>
> Content is available in Fetcher and in ParseSegment, but not in
> ParseOutputFormat.  ParseData/ParseText is available in all.
>
> Let's lay out some use cases for FetchFilters.  Let's assume we have a
> topical search engine where we fetch pages and determine if they belong to
> one or more sets (think sports, news, entertainment, etc.).  If a page is
> found to NOT belong to a given set we may want to:
>
> Store its content.  It may be useful for other analysis.
> Not store the content.
> Parse its outlinks, and fetch the outlinks to see if they do belong to the
> set.
> Not parse the outlinks and not fetch any of the outlinks on the page.  Break
> the crawl graph.
> Never fetch the page again.
> Fetch the page again at some time in the future.
>
> There is also the question of how we make the decision of whether a page
> belongs to a given set.  Do we need the raw HTML, the text so we don't
> duplicate parse text?  Do we need parse meta data.  I would think we would
> need the content and parse data at the very least.
>
> I can see creating a Parse/Fetch filter extension that has options for
> storing content.  I could also see where storing content is just the
> default, where we always do it, even if the page doesn't belong to the set.
> I can see having an option to parse outlinks if the page isn't in the set.
> I don't know how we are going to handle never fetching the page again since
> that is handled in outlinks.  It may require changes to generator to always
> avoid status X.
>
> These are my thoughts.  I would love to know everyone else's thoughts on
> this.
>
> Dennis
>