You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by "Andrzej Bialecki (JIRA)" <ji...@apache.org> on 2010/09/15 17:00:34 UTC

[jira] Created: (NUTCH-907) DataStore API doesn't support multiple storage areas for multiple disjoint crawls

DataStore API doesn't support multiple storage areas for multiple disjoint crawls
---------------------------------------------------------------------------------

                 Key: NUTCH-907
                 URL: https://issues.apache.org/jira/browse/NUTCH-907
             Project: Nutch
          Issue Type: Bug
            Reporter: Andrzej Bialecki 
             Fix For: 2.0


In Nutch 1.x it was possible to easily select a set of crawl data (crawldb, page data, linkdb, etc) by specifying a path where the data was stored. This enabled users to run several disjoint crawls with different configs, but still using the same storage medium, just under different paths.

This is not possible now because there is a 1:1 mapping between a specific DataStore instance and a set of crawl data.

In order to support this functionality the Gora API should be extended so that it can create stores (and data tables in the underlying storage) that use arbitrary prefixes to identify the particular crawl dataset. Then the Nutch API should be extended to allow passing this "crawlId" value to select one of possibly many existing crawl datasets.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-907) DataStore API doesn't support multiple storage areas for multiple disjoint crawls

Posted by "Sertan Alkan (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12916890#action_12916890 ] 

Sertan Alkan commented on NUTCH-907:
------------------------------------

Hi Andrzej,

Thanks for the review and the feedback.

* Funny thing, I was actually going for {{datasetId}} for the name, but now that you mention, I prefer to use {{crawlId}} for this and rename the old {{crawlId}} to {{batchId}}. I am not entirely sure on how much invasive that's going to be, but I don't think it will be much of a hassle to change both all at once.
* I agree that arguments should override the configuration by actually setting it so that the setting could be accessible elsewhere. I'll modify the patch to work this way.
* A utility to handle the datasets is a good idea, though, considering the current GORA architecture I think we may need to add a client interface there somewhere. I've opened up an [issue|http://github.com/enis/gora/issues/issue/56] for this, we can start thinking about the design there. We won't be able write a generic utility in Nutch, though, since this won't be available till we roll out a new version of Gora. I'll pitch in the utility once we have that but as that doesn't affect this issue directly, I'd rather go for a separate issue for that. And until that issue is solved, I think it would be safe to leave manipulation of stores (listing, removing, truncation.. etc) to user's responsibility.

> DataStore API doesn't support multiple storage areas for multiple disjoint crawls
> ---------------------------------------------------------------------------------
>
>                 Key: NUTCH-907
>                 URL: https://issues.apache.org/jira/browse/NUTCH-907
>             Project: Nutch
>          Issue Type: Bug
>            Reporter: Andrzej Bialecki 
>             Fix For: 2.0
>
>         Attachments: NUTCH-907.patch
>
>
> In Nutch 1.x it was possible to easily select a set of crawl data (crawldb, page data, linkdb, etc) by specifying a path where the data was stored. This enabled users to run several disjoint crawls with different configs, but still using the same storage medium, just under different paths.
> This is not possible now because there is a 1:1 mapping between a specific DataStore instance and a set of crawl data.
> In order to support this functionality the Gora API should be extended so that it can create stores (and data tables in the underlying storage) that use arbitrary prefixes to identify the particular crawl dataset. Then the Nutch API should be extended to allow passing this "crawlId" value to select one of possibly many existing crawl datasets.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (NUTCH-907) DataStore API doesn't support multiple storage areas for multiple disjoint crawls

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrzej Bialecki  resolved NUTCH-907.
-------------------------------------

    Resolution: Fixed

Committed in rev. 1025963. Thank you Sertan for a high-quality patch and unit tests!

> DataStore API doesn't support multiple storage areas for multiple disjoint crawls
> ---------------------------------------------------------------------------------
>
>                 Key: NUTCH-907
>                 URL: https://issues.apache.org/jira/browse/NUTCH-907
>             Project: Nutch
>          Issue Type: Bug
>            Reporter: Andrzej Bialecki 
>            Assignee: Andrzej Bialecki 
>             Fix For: 2.0
>
>         Attachments: NUTCH-907.patch, NUTCH-907.v2.patch
>
>
> In Nutch 1.x it was possible to easily select a set of crawl data (crawldb, page data, linkdb, etc) by specifying a path where the data was stored. This enabled users to run several disjoint crawls with different configs, but still using the same storage medium, just under different paths.
> This is not possible now because there is a 1:1 mapping between a specific DataStore instance and a set of crawl data.
> In order to support this functionality the Gora API should be extended so that it can create stores (and data tables in the underlying storage) that use arbitrary prefixes to identify the particular crawl dataset. Then the Nutch API should be extended to allow passing this "crawlId" value to select one of possibly many existing crawl datasets.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-907) DataStore API doesn't support multiple storage areas for multiple disjoint crawls

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12910098#action_12910098 ] 

Doğacan Güney commented on NUTCH-907:
-------------------------------------

Gora already supports this somewhat. While creating a data store, you can optionally specify a table name:

  public static <D extends DataStore<K,T>, K, T extends Persistent>
  D createDataStore(Class<D> dataStoreClass
      , Class<K> keyClass, Class<T> persistent, String schemaName)

We should be able to leverage that in Nutch to support different crawl datasets. If we extend Nutch's current API to allow names to be specified for crawls then Nutch can simply create tables prefixed with crawl names as Andrzej suggested. For example, a crawl dataset with name "foo" will have a table called "foo_webtable".

What do you think Andrzej? I think Gora needs no extension here but if people think API is awkward we can change Gora too.

> DataStore API doesn't support multiple storage areas for multiple disjoint crawls
> ---------------------------------------------------------------------------------
>
>                 Key: NUTCH-907
>                 URL: https://issues.apache.org/jira/browse/NUTCH-907
>             Project: Nutch
>          Issue Type: Bug
>            Reporter: Andrzej Bialecki 
>             Fix For: 2.0
>
>
> In Nutch 1.x it was possible to easily select a set of crawl data (crawldb, page data, linkdb, etc) by specifying a path where the data was stored. This enabled users to run several disjoint crawls with different configs, but still using the same storage medium, just under different paths.
> This is not possible now because there is a 1:1 mapping between a specific DataStore instance and a set of crawl data.
> In order to support this functionality the Gora API should be extended so that it can create stores (and data tables in the underlying storage) that use arbitrary prefixes to identify the particular crawl dataset. Then the Nutch API should be extended to allow passing this "crawlId" value to select one of possibly many existing crawl datasets.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-907) DataStore API doesn't support multiple storage areas for multiple disjoint crawls

Posted by "Sertan Alkan (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sertan Alkan updated NUTCH-907:
-------------------------------

    Attachment: NUTCH-907.v2.patch

Here's the modified version of the patch after Andrzej's review. The additional points to the original patch are as follows;

* The old {{crawlId}} option is renamed to {{batchId}} for convenience.

* All jobs now accept an optional argument, {{-crawlId <id>}}, to prefix the schema. Jobs now keep this property in the configuration allowing later use by, say, plugins.

All unit tests pass and again I have run a simple crawl w/o any problems. I have also tested the {{batchId}} option by generating two different sets of the injected urls and run a fetch-parse cycle on those sets. Jobs seem to recognize the correct {{batchId}} and select only the corresponding urls.

Like I said before, I prefer to leave store manipulation utility out of this patch, and handle it in a separate issue once we have that functionality in Gora. What do you think?



> DataStore API doesn't support multiple storage areas for multiple disjoint crawls
> ---------------------------------------------------------------------------------
>
>                 Key: NUTCH-907
>                 URL: https://issues.apache.org/jira/browse/NUTCH-907
>             Project: Nutch
>          Issue Type: Bug
>            Reporter: Andrzej Bialecki 
>             Fix For: 2.0
>
>         Attachments: NUTCH-907.patch, NUTCH-907.v2.patch
>
>
> In Nutch 1.x it was possible to easily select a set of crawl data (crawldb, page data, linkdb, etc) by specifying a path where the data was stored. This enabled users to run several disjoint crawls with different configs, but still using the same storage medium, just under different paths.
> This is not possible now because there is a 1:1 mapping between a specific DataStore instance and a set of crawl data.
> In order to support this functionality the Gora API should be extended so that it can create stores (and data tables in the underlying storage) that use arbitrary prefixes to identify the particular crawl dataset. Then the Nutch API should be extended to allow passing this "crawlId" value to select one of possibly many existing crawl datasets.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (NUTCH-907) DataStore API doesn't support multiple storage areas for multiple disjoint crawls

Posted by "Sertan Alkan (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12916890#action_12916890 ] 

Sertan Alkan edited comment on NUTCH-907 at 10/1/10 10:10 AM:
--------------------------------------------------------------

Hi Andrzej,

Thanks for the review and the feedback.

* Funny thing, I was actually going for {{datasetId}} for the name, but now that you mention, I prefer to use {{crawlId}} for this and rename the old {{crawlId}} to {{batchId}}. I am not entirely sure how much invasive that's going to be, but I don't think it will be much of a hassle to change both all at once.
* I agree that arguments should override the configuration by actually setting it so that the setting could be accessible elsewhere. I'll modify the patch to work this way.
* A utility to handle the datasets is a good idea, though, considering the current GORA architecture I think we may need to add a client interface there somewhere. I've opened up an [issue|http://github.com/enis/gora/issues/issue/56] for this, we can start thinking about the design there. We won't be able write a generic utility in Nutch, though, since this won't be available till we roll out a new version of Gora. I'll pitch in the utility once we have that but as that doesn't affect this issue directly, I'd rather go for a separate issue for that. And until that issue is solved, I think it would be safe to leave manipulation of stores (listing, removing, truncation.. etc) to user's responsibility.

I'll modify the patch to reflect those two changes.

      was (Author: sertan):
    Hi Andrzej,

Thanks for the review and the feedback.

* Funny thing, I was actually going for {{datasetId}} for the name, but now that you mention, I prefer to use {{crawlId}} for this and rename the old {{crawlId}} to {{batchId}}. I am not entirely sure on how much invasive that's going to be, but I don't think it will be much of a hassle to change both all at once.
* I agree that arguments should override the configuration by actually setting it so that the setting could be accessible elsewhere. I'll modify the patch to work this way.
* A utility to handle the datasets is a good idea, though, considering the current GORA architecture I think we may need to add a client interface there somewhere. I've opened up an [issue|http://github.com/enis/gora/issues/issue/56] for this, we can start thinking about the design there. We won't be able write a generic utility in Nutch, though, since this won't be available till we roll out a new version of Gora. I'll pitch in the utility once we have that but as that doesn't affect this issue directly, I'd rather go for a separate issue for that. And until that issue is solved, I think it would be safe to leave manipulation of stores (listing, removing, truncation.. etc) to user's responsibility.
  
> DataStore API doesn't support multiple storage areas for multiple disjoint crawls
> ---------------------------------------------------------------------------------
>
>                 Key: NUTCH-907
>                 URL: https://issues.apache.org/jira/browse/NUTCH-907
>             Project: Nutch
>          Issue Type: Bug
>            Reporter: Andrzej Bialecki 
>             Fix For: 2.0
>
>         Attachments: NUTCH-907.patch
>
>
> In Nutch 1.x it was possible to easily select a set of crawl data (crawldb, page data, linkdb, etc) by specifying a path where the data was stored. This enabled users to run several disjoint crawls with different configs, but still using the same storage medium, just under different paths.
> This is not possible now because there is a 1:1 mapping between a specific DataStore instance and a set of crawl data.
> In order to support this functionality the Gora API should be extended so that it can create stores (and data tables in the underlying storage) that use arbitrary prefixes to identify the particular crawl dataset. Then the Nutch API should be extended to allow passing this "crawlId" value to select one of possibly many existing crawl datasets.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-907) DataStore API doesn't support multiple storage areas for multiple disjoint crawls

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12910109#action_12910109 ] 

Andrzej Bialecki  commented on NUTCH-907:
-----------------------------------------

That's very good news - in that case I'm fine with the Gora API as it is now, we should change Nutch to make use of this functionality.

> DataStore API doesn't support multiple storage areas for multiple disjoint crawls
> ---------------------------------------------------------------------------------
>
>                 Key: NUTCH-907
>                 URL: https://issues.apache.org/jira/browse/NUTCH-907
>             Project: Nutch
>          Issue Type: Bug
>            Reporter: Andrzej Bialecki 
>             Fix For: 2.0
>
>
> In Nutch 1.x it was possible to easily select a set of crawl data (crawldb, page data, linkdb, etc) by specifying a path where the data was stored. This enabled users to run several disjoint crawls with different configs, but still using the same storage medium, just under different paths.
> This is not possible now because there is a 1:1 mapping between a specific DataStore instance and a set of crawl data.
> In order to support this functionality the Gora API should be extended so that it can create stores (and data tables in the underlying storage) that use arbitrary prefixes to identify the particular crawl dataset. Then the Nutch API should be extended to allow passing this "crawlId" value to select one of possibly many existing crawl datasets.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (NUTCH-907) DataStore API doesn't support multiple storage areas for multiple disjoint crawls

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrzej Bialecki  reassigned NUTCH-907:
---------------------------------------

    Assignee: Andrzej Bialecki 

> DataStore API doesn't support multiple storage areas for multiple disjoint crawls
> ---------------------------------------------------------------------------------
>
>                 Key: NUTCH-907
>                 URL: https://issues.apache.org/jira/browse/NUTCH-907
>             Project: Nutch
>          Issue Type: Bug
>            Reporter: Andrzej Bialecki 
>            Assignee: Andrzej Bialecki 
>             Fix For: 2.0
>
>         Attachments: NUTCH-907.patch, NUTCH-907.v2.patch
>
>
> In Nutch 1.x it was possible to easily select a set of crawl data (crawldb, page data, linkdb, etc) by specifying a path where the data was stored. This enabled users to run several disjoint crawls with different configs, but still using the same storage medium, just under different paths.
> This is not possible now because there is a 1:1 mapping between a specific DataStore instance and a set of crawl data.
> In order to support this functionality the Gora API should be extended so that it can create stores (and data tables in the underlying storage) that use arbitrary prefixes to identify the particular crawl dataset. Then the Nutch API should be extended to allow passing this "crawlId" value to select one of possibly many existing crawl datasets.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-907) DataStore API doesn't support multiple storage areas for multiple disjoint crawls

Posted by "Sertan Alkan (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12923418#action_12923418 ] 

Sertan Alkan commented on NUTCH-907:
------------------------------------

Thanks Andrzej, I've been waiting for this; I have a couple of use cases just for the functionality.

> DataStore API doesn't support multiple storage areas for multiple disjoint crawls
> ---------------------------------------------------------------------------------
>
>                 Key: NUTCH-907
>                 URL: https://issues.apache.org/jira/browse/NUTCH-907
>             Project: Nutch
>          Issue Type: Bug
>            Reporter: Andrzej Bialecki 
>            Assignee: Andrzej Bialecki 
>             Fix For: 2.0
>
>         Attachments: NUTCH-907.patch, NUTCH-907.v2.patch
>
>
> In Nutch 1.x it was possible to easily select a set of crawl data (crawldb, page data, linkdb, etc) by specifying a path where the data was stored. This enabled users to run several disjoint crawls with different configs, but still using the same storage medium, just under different paths.
> This is not possible now because there is a 1:1 mapping between a specific DataStore instance and a set of crawl data.
> In order to support this functionality the Gora API should be extended so that it can create stores (and data tables in the underlying storage) that use arbitrary prefixes to identify the particular crawl dataset. Then the Nutch API should be extended to allow passing this "crawlId" value to select one of possibly many existing crawl datasets.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-907) DataStore API doesn't support multiple storage areas for multiple disjoint crawls

Posted by "Sertan Alkan (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sertan Alkan updated NUTCH-907:
-------------------------------

    Attachment: NUTCH-907.patch

Here's a patch to allow Nutch to create different schemas to based on the same schema definition. Some points about the patch;

* To be able to prefix a schema name with a value, Nutch needs to know the default schema name defined in the gora mapping file (e.g ...table=<name>...). Gora handles creation internally at the moment and it doesn't expose this name to outside. So, the patch introduces two new configuration options to pass the schema name to Nutch internals.
** Nutch *ignores* the schema name setting in gora mapping file, instead, configuration option {{storage.schema}} will tell the Nutch which schema name it should use to access to data store. This value is defaulted to _webpage_.
** {{storage.schema.id}} option defines the prefix to add to schema name in {{storage.schema}}, and by default this id is not provided, i.e. all jobs will run on _webpage_ store as before.
* Apart from giving it as a configuration option, all jobs (injector, generator, fetcher, updatedb, indexer, benchmark and webtable reader) are modified to accept a schema id as an optional command line argument, {{-schemaId}}, which will override the configuration option ({{-schemaId}} may seem an odd name but I am not big on naming things).
* Patch also modifies unit tests to use the same logic.

All unit tests pass without a problem and I have run a simple crawl with a)default configuration, b)by providing a schema id from configuration and c)giving the ids from command line and jobs seem to run well.

> DataStore API doesn't support multiple storage areas for multiple disjoint crawls
> ---------------------------------------------------------------------------------
>
>                 Key: NUTCH-907
>                 URL: https://issues.apache.org/jira/browse/NUTCH-907
>             Project: Nutch
>          Issue Type: Bug
>            Reporter: Andrzej Bialecki 
>             Fix For: 2.0
>
>         Attachments: NUTCH-907.patch
>
>
> In Nutch 1.x it was possible to easily select a set of crawl data (crawldb, page data, linkdb, etc) by specifying a path where the data was stored. This enabled users to run several disjoint crawls with different configs, but still using the same storage medium, just under different paths.
> This is not possible now because there is a 1:1 mapping between a specific DataStore instance and a set of crawl data.
> In order to support this functionality the Gora API should be extended so that it can create stores (and data tables in the underlying storage) that use arbitrary prefixes to identify the particular crawl dataset. Then the Nutch API should be extended to allow passing this "crawlId" value to select one of possibly many existing crawl datasets.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-907) DataStore API doesn't support multiple storage areas for multiple disjoint crawls

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12916870#action_12916870 ] 

Andrzej Bialecki  commented on NUTCH-907:
-----------------------------------------

Hi Sertan,

Thanks for the patch, this looks very good! A few  comments:

* I'm not good at naming things either... schemaId is a little bit cryptic though. If we didn't already use crawlId I would vote for that (and then rename crawlId to batchId or fetchId), as it is now... I dont know, maybe datasetId ..

* since we now create multiple datasets, we need somehow to manage them - i.e. list and delete at least (create is implicit). There is no such functionality in this patch, but this can be addressed also as a separate issue.

* IndexerMapReduce.createIndexJob: I think it would be useful to pass the "datasetId" as a Job property - this way indexing filter plugins can use this property to populate NutchDocument fields if needed. FWIW, this may be a good idea to do in other jobs as well...

> DataStore API doesn't support multiple storage areas for multiple disjoint crawls
> ---------------------------------------------------------------------------------
>
>                 Key: NUTCH-907
>                 URL: https://issues.apache.org/jira/browse/NUTCH-907
>             Project: Nutch
>          Issue Type: Bug
>            Reporter: Andrzej Bialecki 
>             Fix For: 2.0
>
>         Attachments: NUTCH-907.patch
>
>
> In Nutch 1.x it was possible to easily select a set of crawl data (crawldb, page data, linkdb, etc) by specifying a path where the data was stored. This enabled users to run several disjoint crawls with different configs, but still using the same storage medium, just under different paths.
> This is not possible now because there is a 1:1 mapping between a specific DataStore instance and a set of crawl data.
> In order to support this functionality the Gora API should be extended so that it can create stores (and data tables in the underlying storage) that use arbitrary prefixes to identify the particular crawl dataset. Then the Nutch API should be extended to allow passing this "crawlId" value to select one of possibly many existing crawl datasets.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.