You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by HUYLEBROECK Jeremy RD-ILAB-SSF <je...@orange-ft.com> on 2006/07/15 03:16:08 UTC
Parser returning several ParseData?
I am in need of feedback/ideas. ;)
What would be the cleanest way to not return only one ParseData (or
Parse) object from a getParse but return several and still use the rest
of the framework? Anybody did this?
I look at the different classes and where it could be done but I always
find me breaking the whole process and having to change the code in a
lot of places.
The use case is like the following:
An RSS document has items, the goal is to index the Items and not the
channel like parse-rss does.
So the steps would be
-extract outlinks, keywords ... for one item
And do it for all the items in the Content.
I think it would then require different ParseImpl, ParseSegment,
Indexer, signature, DeleteDuplicate etc...
Am I completely wrong?
I am trying to use as much Nutch stuff as possible as I use it for HTML
stuff also. Otherwise, I'll go for mostly hadoop and some sort of
light-nutch with a homemade scheduler/adaptive fetch/crawldb/parser.
Your thoughts are much appreciated to help my brain on a Friday end of
afternoon... ;)
Thanks!
Jeremy.
Re: Parser returning several ParseData?
Posted by Andrzej Bialecki <ab...@getopt.org>.
HUYLEBROECK Jeremy RD-ILAB-SSF wrote:
> I am in need of feedback/ideas. ;)
>
> What would be the cleanest way to not return only one ParseData (or
> Parse) object from a getParse but return several and still use the rest
> of the framework? Anybody did this?
> I look at the different classes and where it could be done but I always
> find me breaking the whole process and having to change the code in a
> lot of places.
>
Well, the problem with this is that current Nutch architecture follows
several assumptions that make this difficult:
1) it enforces a strong split between protocol and parse layers, once
the resource content leaves the protocol layer there is no way back to
fetch additional resources (but see below),
2) it assumes that one input URL results in a single resource
3) it assumes that URLs identify independent resources (there is no
composition or aggregation of resources).
4) fetching is performed breadth-first, in random order.
Of course, that's a bunch of idealistic assumptions ... ;) In reality,
Nutch compromises some of them:
ad 1) some of the parse-level data gets pushed down to the protocol
layer if needed, namely redirects and robot exclusions metadata from
HTML meta tags (the same should be done for set-cookie, but this is not
handled yet). This is further complicated by the fact that fetching and
parsing don't have to be tightly coupled in a single process, they may
be executed as separate batch jobs - so there are private mini-protocols
between these layers to facilitate passing this info across batch runs.
ad 2) only redirects are handled now, in the sense that all data (both
the response before redirect and after redirect) are stored. There is no
support for returning multiple responses from a single request. RSS is a
good example of why we would need to extend the API to provide this
support. Exhaustive fetching scenarios (e.g. collect all URLs below that
URL path) would be another case. Crawling a DB (select * from $TABLE)
would be yet another case where this support would make sense.
ad 3) Nutch doesn't handle this at all now. This is sometimes
frustrating, because if you get one part of a page (e.g. the top frame),
you can't be sure that you got all subcomponents (images, nested frames,
scripts) that match this particular version of the container-type
resource. This may affect the subsequent analysis of the page, and
eventually it will affect the "cached view". Support for this
functionality would be a welcome addition. I intended to pursue this
subject when I added ParseStatus.FAILED_MISSING_PARTS - please see the
javadoc there - however, no code at the moment makes use of this.
ad 4) fetch jobs are organized along randomized fetchlists, and not
high-level instructions like "fetch depth-first starting from this url,
n-levels deep, at most M pages". This could be fixed by changing the
Generator and Fetcher (or rather implementing alternative versions of
each).
> The use case is like the following:
> An RSS document has items, the goal is to index the Items and not the
> channel like parse-rss does.
> So the steps would be
> -extract outlinks, keywords ... for one item
> And do it for all the items in the Content.
> I think it would then require different ParseImpl, ParseSegment,
> Indexer, signature, DeleteDuplicate etc...
>
I don't think all of them would have to be modified - so long as you
don't change the segment format most tools should work properly. A lot
of meta-information (like aggregation relationships) can be carried
across in CrawlDatum.metaData or ParseData.metadata.
> Am I completely wrong?
> I am trying to use as much Nutch stuff as possible as I use it for HTML
> stuff also. Otherwise, I'll go for mostly hadoop and some sort of
> light-nutch with a homemade scheduler/adaptive fetch/crawldb/parser.
>
> Your thoughts are much appreciated to help my brain on a Friday end of
> afternoon... ;)
>
Well, hard to say if it's better to work-around / change the Fetcher and
associated tools, or just pick some Nutch parts (crawldb, segments,
parsers, protocol handlers) and write your own fetcher/generator, using
hadoop as the overall framework.
Unfortunately, some seemingly simple changes (like e.g. extending
Protocol interface to return Iterator<ProtocolOutput>, and Parser to
return Iterator<Parse>) have far reaching consequences across many parts
of Nutch, not only from purely mechanical view of API compatibility, but
from the semantic POV (discovering new resources, updating old ones,
managing part-whole relationships, etc).
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com