You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Sertan Alkan (JIRA)" <ji...@apache.org> on 2010/10/01 16:11:33 UTC

[jira] Issue Comment Edited: (NUTCH-907) DataStore API doesn't support multiple storage areas for multiple disjoint crawls

    [ https://issues.apache.org/jira/browse/NUTCH-907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12916890#action_12916890 ] 

Sertan Alkan edited comment on NUTCH-907 at 10/1/10 10:10 AM:
--------------------------------------------------------------

Hi Andrzej,

Thanks for the review and the feedback.

* Funny thing, I was actually going for {{datasetId}} for the name, but now that you mention, I prefer to use {{crawlId}} for this and rename the old {{crawlId}} to {{batchId}}. I am not entirely sure how much invasive that's going to be, but I don't think it will be much of a hassle to change both all at once.
* I agree that arguments should override the configuration by actually setting it so that the setting could be accessible elsewhere. I'll modify the patch to work this way.
* A utility to handle the datasets is a good idea, though, considering the current GORA architecture I think we may need to add a client interface there somewhere. I've opened up an [issue|http://github.com/enis/gora/issues/issue/56] for this, we can start thinking about the design there. We won't be able write a generic utility in Nutch, though, since this won't be available till we roll out a new version of Gora. I'll pitch in the utility once we have that but as that doesn't affect this issue directly, I'd rather go for a separate issue for that. And until that issue is solved, I think it would be safe to leave manipulation of stores (listing, removing, truncation.. etc) to user's responsibility.

I'll modify the patch to reflect those two changes.

      was (Author: sertan):
    Hi Andrzej,

Thanks for the review and the feedback.

* Funny thing, I was actually going for {{datasetId}} for the name, but now that you mention, I prefer to use {{crawlId}} for this and rename the old {{crawlId}} to {{batchId}}. I am not entirely sure on how much invasive that's going to be, but I don't think it will be much of a hassle to change both all at once.
* I agree that arguments should override the configuration by actually setting it so that the setting could be accessible elsewhere. I'll modify the patch to work this way.
* A utility to handle the datasets is a good idea, though, considering the current GORA architecture I think we may need to add a client interface there somewhere. I've opened up an [issue|http://github.com/enis/gora/issues/issue/56] for this, we can start thinking about the design there. We won't be able write a generic utility in Nutch, though, since this won't be available till we roll out a new version of Gora. I'll pitch in the utility once we have that but as that doesn't affect this issue directly, I'd rather go for a separate issue for that. And until that issue is solved, I think it would be safe to leave manipulation of stores (listing, removing, truncation.. etc) to user's responsibility.
  
> DataStore API doesn't support multiple storage areas for multiple disjoint crawls
> ---------------------------------------------------------------------------------
>
>                 Key: NUTCH-907
>                 URL: https://issues.apache.org/jira/browse/NUTCH-907
>             Project: Nutch
>          Issue Type: Bug
>            Reporter: Andrzej Bialecki 
>             Fix For: 2.0
>
>         Attachments: NUTCH-907.patch
>
>
> In Nutch 1.x it was possible to easily select a set of crawl data (crawldb, page data, linkdb, etc) by specifying a path where the data was stored. This enabled users to run several disjoint crawls with different configs, but still using the same storage medium, just under different paths.
> This is not possible now because there is a 1:1 mapping between a specific DataStore instance and a set of crawl data.
> In order to support this functionality the Gora API should be extended so that it can create stores (and data tables in the underlying storage) that use arbitrary prefixes to identify the particular crawl dataset. Then the Nutch API should be extended to allow passing this "crawlId" value to select one of possibly many existing crawl datasets.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.