You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by HUYLEBROECK Jeremy RD-ILAB-SSF <je...@orange-ft.com> on 2006/07/15 03:16:08 UTC

Parser returning several ParseData?

I am in need of feedback/ideas. ;)

What would be the cleanest way to not return only one ParseData (or
Parse) object from a getParse but return several and still use the rest
of the framework? Anybody did this?
I look at the different classes and where it could be done but I always
find me breaking the whole process and having to change the code in a
lot of places.

The use case is like the following:
An RSS document has items, the goal is to index the Items and not the
channel like parse-rss does.
So the steps would be
-extract outlinks, keywords ...  for one item
And do it for all the items in the Content.
I think it would then require different ParseImpl, ParseSegment,
Indexer, signature, DeleteDuplicate etc...

Am I completely wrong?
I am trying to use as much Nutch stuff as possible as I use it for HTML
stuff also. Otherwise, I'll go for mostly hadoop and some sort of
light-nutch with a homemade scheduler/adaptive fetch/crawldb/parser.

Your thoughts are much appreciated to help my brain on a Friday end of
afternoon... ;)
Thanks!

Jeremy.

Re: Parser returning several ParseData?

Posted by Andrzej Bialecki <ab...@getopt.org>.

HUYLEBROECK Jeremy RD-ILAB-SSF wrote:
> I am in need of feedback/ideas. ;)
>
> What would be the cleanest way to not return only one ParseData (or
> Parse) object from a getParse but return several and still use the rest
> of the framework? Anybody did this?
> I look at the different classes and where it could be done but I always
> find me breaking the whole process and having to change the code in a
> lot of places.
>   

Well, the problem with this is that current Nutch architecture follows 
several assumptions that make this difficult:

1) it enforces a strong split between protocol and parse layers, once 
the resource content leaves the protocol layer there is no way back to 
fetch additional resources (but see below),

2) it assumes that one input URL results in a single resource

3) it assumes that URLs identify independent resources (there is no 
composition or aggregation of resources).

4) fetching is performed breadth-first, in random order.

Of course, that's a bunch of idealistic assumptions ... ;) In reality, 
Nutch compromises some of them:

ad 1) some of the parse-level data gets pushed down to the protocol 
layer if needed, namely redirects and robot exclusions metadata from 
HTML meta tags (the same should be done for set-cookie, but this is not 
handled yet). This is further complicated by the fact that fetching and 
parsing don't have to be tightly coupled in a single process, they may 
be executed as separate batch jobs - so there are private mini-protocols 
between these layers to facilitate passing this info across batch runs.

ad 2) only redirects are handled now, in the sense that all data (both 
the response before redirect and after redirect) are stored. There is no 
support for returning multiple responses from a single request. RSS is a 
good example of why we would need to extend the API to provide this 
support. Exhaustive fetching scenarios (e.g. collect all URLs below that 
URL path) would be another case. Crawling a DB (select * from $TABLE) 
would be yet another case where this support would make sense.

ad 3) Nutch doesn't handle this at all now. This is sometimes 
frustrating, because if you get one part of a page (e.g. the top frame), 
you can't be sure that you got all subcomponents (images, nested frames, 
scripts) that match this particular version of the container-type 
resource. This may affect the subsequent analysis of the page, and 
eventually it will affect the "cached view". Support for this 
functionality would be a welcome addition. I intended to pursue this 
subject when I added ParseStatus.FAILED_MISSING_PARTS - please see the 
javadoc there - however, no code at the moment makes use of this.

ad 4) fetch jobs are organized along randomized fetchlists, and not 
high-level instructions like "fetch depth-first starting from this url, 
n-levels deep, at most M pages". This could be fixed by changing the 
Generator and Fetcher (or  rather implementing alternative versions of 
each).

> The use case is like the following:
> An RSS document has items, the goal is to index the Items and not the
> channel like parse-rss does.
> So the steps would be
> -extract outlinks, keywords ...  for one item
> And do it for all the items in the Content.
> I think it would then require different ParseImpl, ParseSegment,
> Indexer, signature, DeleteDuplicate etc...
>   

I don't think all of them would have to be modified - so long as you 
don't change the segment format most tools should work properly. A lot 
of meta-information (like aggregation relationships) can be carried 
across in CrawlDatum.metaData or ParseData.metadata.

> Am I completely wrong?
> I am trying to use as much Nutch stuff as possible as I use it for HTML
> stuff also. Otherwise, I'll go for mostly hadoop and some sort of
> light-nutch with a homemade scheduler/adaptive fetch/crawldb/parser.
>
> Your thoughts are much appreciated to help my brain on a Friday end of
> afternoon... ;)
>   

Well, hard to say if it's better to work-around / change the Fetcher and 
associated tools, or just pick some Nutch parts (crawldb, segments, 
parsers, protocol handlers) and write your own fetcher/generator, using 
hadoop as the overall framework.

Unfortunately, some seemingly simple changes (like e.g. extending 
Protocol interface to return Iterator<ProtocolOutput>, and Parser to 
return Iterator<Parse>) have far reaching consequences across many parts 
of Nutch, not only from purely mechanical view of API compatibility, but 
from the semantic POV (discovering new resources, updating old ones, 
managing part-whole relationships, etc).

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com