You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by "Doğacan Güney (JIRA)" <ji...@apache.org> on 2007/12/30 12:16:43 UTC

[jira] Commented: (NUTCH-596) ParseSegments parse content even if its not CrawlDatum.STATUS_FETCH_SUCCESS

    [ https://issues.apache.org/jira/browse/NUTCH-596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12554989 ] 

Doğacan Güney commented on NUTCH-596:
-------------------------------------

I agree that ParseSegment's end result should be identical to parsing during fetching. 

>read the status code in the Metadata of the Content object

This is possible but feels like a hack.... Still, if we can't come up with anything better we can use this one.

> don't store content for fetch with a crawldatun <> STATUS_FETCH_SUCCESS

We _do_ store content during fetch if status is not FETCH_SUCCESS, as long as there is something useful to store.

> load the crawldatum object in ParseSegement

We can't read CrawlDatum object during map() (*), so during map operations, we would still parse content (even for non-FETCH_SUCCESS), and output CrawlDatum-s, then during reduce, we can read CrawlDatum and store/discard parse object accordingly.

I like this approach, but it brings extra overhead (reading crawl_fetch) and we still unnecessarily try to parse Content-s, only to discard it later.

So, I think 3rd approach sounds better, but 1st approach is simpler and lightweight.

(*) We may try to create a reader that reads from content and crawl_fetch at the same time, but I don't think that crawl_fetch and content are necessarily in sync, so this will probably not work.

> ParseSegments parse content even if its not CrawlDatum.STATUS_FETCH_SUCCESS
> ---------------------------------------------------------------------------
>
>                 Key: NUTCH-596
>                 URL: https://issues.apache.org/jira/browse/NUTCH-596
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 0.9.0
>            Reporter: Emmanuel Joke
>
> We have 2 choices to parse the content either within the Fetcher class or with the ParseSegment class
> Fetcher(1 or 2) will check first if the CrawlDatum == STATUS_FETCH_SUCCESS nad if its true it will parse the content.
> However we don't have this check in ParseSegment, thus we parse every content store on the disk without checking the Status.
> So i think we should implement this check, i can see only 3 solutions:
> - read the status code in the Metadata of the Content object
> - don't store content for fetch with a crawldatun <>  STATUS_FETCH_SUCCESS
> - load the crawldatum object in ParseSegement
> What are your thoughts ?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.