You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Andrzej Bialecki (JIRA)" <ji...@apache.org> on 2010/10/01 14:21:33 UTC

[jira] Commented: (NUTCH-907) DataStore API doesn't support multiple storage areas for multiple disjoint crawls

    [ https://issues.apache.org/jira/browse/NUTCH-907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12916870#action_12916870 ] 

Andrzej Bialecki  commented on NUTCH-907:
-----------------------------------------

Hi Sertan,

Thanks for the patch, this looks very good! A few  comments:

* I'm not good at naming things either... schemaId is a little bit cryptic though. If we didn't already use crawlId I would vote for that (and then rename crawlId to batchId or fetchId), as it is now... I dont know, maybe datasetId ..

* since we now create multiple datasets, we need somehow to manage them - i.e. list and delete at least (create is implicit). There is no such functionality in this patch, but this can be addressed also as a separate issue.

* IndexerMapReduce.createIndexJob: I think it would be useful to pass the "datasetId" as a Job property - this way indexing filter plugins can use this property to populate NutchDocument fields if needed. FWIW, this may be a good idea to do in other jobs as well...

> DataStore API doesn't support multiple storage areas for multiple disjoint crawls
> ---------------------------------------------------------------------------------
>
>                 Key: NUTCH-907
>                 URL: https://issues.apache.org/jira/browse/NUTCH-907
>             Project: Nutch
>          Issue Type: Bug
>            Reporter: Andrzej Bialecki 
>             Fix For: 2.0
>
>         Attachments: NUTCH-907.patch
>
>
> In Nutch 1.x it was possible to easily select a set of crawl data (crawldb, page data, linkdb, etc) by specifying a path where the data was stored. This enabled users to run several disjoint crawls with different configs, but still using the same storage medium, just under different paths.
> This is not possible now because there is a 1:1 mapping between a specific DataStore instance and a set of crawl data.
> In order to support this functionality the Gora API should be extended so that it can create stores (and data tables in the underlying storage) that use arbitrary prefixes to identify the particular crawl dataset. Then the Nutch API should be extended to allow passing this "crawlId" value to select one of possibly many existing crawl datasets.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.