You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Lewis John Mcgibbney <le...@gmail.com> on 2013/02/20 22:05:01 UTC

Configuration improvements to GeneratorJob

Hi,
Following on from a discussion on user@ I dived into the GeneratorJob code
and have the following general comment based on my observation... Usage of
configuration options is really unstructured and loosely applied. This
should not be the case. For example

Observations
===========

nutch-default.xml
---------------------
 - generate.max.count property appears here but I cannot see for the life
of me where it actually is used in the GeneratorJob, Mapper or Reducer.

Unused in GeneratorJob
--------------------------------
 - GENERATOR_MIN_SCORE - seems not be to used
 - GENERATOR_MAX_COUNT - seems not be to used

Missing in nutch-default.xml
------------------------------------
 - generate.min.score - but used in GeneratorJob
 - generate.filter - set to true by default and available as a CLI override
but should also be specified in nutch-default.xml
 - generate.normalise - set to true by default and available as a CLI
override but should also be specified in nutch-default.xml
 - generate.topN - set to 263-1 by default and available as a CLI override
but should also be specified in nutch-default.xml

Suggestions to add
--------------------------
 - GENERATOR_COUNT_VALUE_IP - We should add a @Deprecated on this static
element. I am not sure if it is used... I don't think it is.

Any comments on this please?

[0] http://www.mail-archive.com/user%40nutch.apache.org/msg08854.html

-- 
*Lewis*

Re: Configuration improvements to GeneratorJob

Posted by feng lu <am...@gmail.com>.
@Tejas +1

I think:

Keep Property
---------------------
-  generate.max.count. keep it because it still used GeneratorJob, Reducer.
-  GENERATOR_MAX_COUNT

Deprecate Property
------------------------------
- GENERATOR_MIN_SCORE
- GENERATOR_COUNT_VALUE_IP

Add in nutch-default.xml
-------------------------------------
- generate.min.score
- generate.filter
- generate.normalise
- generate.topN

Thanks
lufeng


On Mon, Feb 25, 2013 at 3:44 AM, Tejas Patil <te...@gmail.com>wrote:

> Hi Lewis,
>
> We have not came to a conclusion for this topic.
> Here is what I propose:
> 1. keep "generate.max.count"
> 2. GENERATOR_MIN_SCORE and GENERATOR_MAX_COUNT: once we get to know that
> if they were kept back in 2.x for some valid reason, then we can safely
> remove these params. These seem to do nothing meaningful.
> 3. generate.min.score : remove ?
> 4. generate.filter, generate.normalise, generate.topN : there is not
> problem in keeping it. we can even remove it.
> 5. GENERATOR_COUNT_VALUE_IP : ??
>
> thanks,
> Tejas Patil
>
>
> On Wed, Feb 20, 2013 at 9:44 PM, Tejas Patil <te...@gmail.com>wrote:
>
>> Hi Lufeng,
>>
>> On Wed, Feb 20, 2013 at 9:19 PM, feng lu <am...@gmail.com> wrote:
>>
>>> Hi Tejas
>>>
>>> Yes , your are right. I misread the description of property
>>> "generate.count.mode". I'm so sorry, i did also not found any information
>>> about why disabled the IP based counting mode of "generate.count.mode".
>>>
>>> Yes, i see that the FetchEntryPartitioner class (combination
>>> of URLPartitioner) is used by FetcherJob. So as you say that the setting of
>>> "partition.url.mode"  is not effect to the GeneratorJob.
>>>
>>> Do you think we can add some detail description in the property of
>>> "generate.count.mode". such as
>>>
>>> <property>
>>>   <name>generate.count.mode</name>
>>>   <value>host</value>
>>>   <description>Determines how the URLs are counted for
>>> generator.max.count.
>>>   Default value is 'host' but can be 'domain'. Note that we do not count
>>>   per IP in the new version of the Generator. It will irrespective of
>>> the value of 'partition.url.mode' in GeneratorJob.
>>>   </description>
>>> </property>
>>>
>>> +1. This will help the users.
>>
>> Sorry for my bad English.
>>>
>> Thats fine. I am not perfect either :) There was a typo in my reply. I
>> missed few words or maybe accidentally they got deleted. Correction in
>> bold:
>> "There might be some reason behind removing it *and we must look into it*before adding it back
>> ".
>>
>>>
>>> Thanks
>>> lufeng
>>>
>>> On Thu, Feb 21, 2013 at 12:14 PM, Tejas Patil <te...@gmail.com>wrote:
>>>
>>>> Hi Lufeng,
>>>>
>>>> On Wed, Feb 20, 2013 at 7:16 PM, feng lu <am...@gmail.com> wrote:
>>>>
>>>>> Hi Lewis
>>>>>
>>>>> Sorry, I am wrong, The GeneratorJob is only used in Nutch 2.x not 1.x.
>>>>>
>>>>> To the property of GENERATOR_COUNT_VALUE_IP, i think we can add a
>>>>> patch to GeneratorJob, instead of deprecated it. patch may like this.
>>>>>
>>>>> if (GENERATOR_COUNT_VALUE_HOST.equalsIgnoreCase(mode)) {
>>>>>       getConf().set(URLPartitioner.PARTITION_MODE_KEY,
>>>>> URLPartitioner.PARTITION_MODE_HOST);
>>>>>     } else if (GENERATOR_COUNT_VALUE_DOMAIN.equalsIgnoreCase(mode)) {
>>>>>         getConf().set(URLPartitioner.PARTITION_MODE_KEY,
>>>>> URLPartitioner.PARTITION_MODE_DOMAIN);
>>>>>     }
>>>>>     else if (GENERATOR_COUNT_VALUE_IP.equalsIgnoreCase(mode)) {
>>>>>         getConf().set(URLPartitioner.PARTITION_MODE_KEY,
>>>>> URLPartitioner.PARTITION_MODE_IP);
>>>>>     }
>>>>>     else {
>>>>>       LOG.warn("Unknown generator.max.count mode '" + mode + "', using
>>>>> mode=" + GENERATOR_COUNT_VALUE_HOST);
>>>>>       getConf().set(GENERATOR_COUNT_MODE, GENERATOR_COUNT_VALUE_HOST);
>>>>>       getConf().set(URLPartitioner.PARTITION_MODE_KEY,
>>>>> URLPartitioner.PARTITION_MODE_HOST);
>>>>>     }
>>>>>
>>>>> The description of property "generate.count.mode" says the IP based
>>>> counting has been disabled in the newer Generator version. There might be
>>>> some reason behind removing it before adding it back. I am searching out
>>>> for any relevant discussion(s) over @user / @dev  or Jira about this. If
>>>> you find anything, do share.
>>>>
>>>>
>>>>
>>>>> if we deprecated it, the URLPartitioner mode PARTITION_MODE_IP will
>>>>> never be setting even we set the partition.url.mode property to byIP in
>>>>> nutch-default.xml. Maybe the partition.url.mode property will be removed in
>>>>> nutch-default.xml. Because it's depends on the value of
>>>>> GENERATOR_COUNT_MODE.
>>>>>
>>>>> How do your think please?
>>>>>
>>>>
>>>> The url partitioning is done not only in generate phase, but fetch
>>>> phase too. The mode of the URLPartitioner is defined by the param
>>>> "partition.url.mode" which can be by host, domain or ip. This works out
>>>> well for fetch phase as it supports partitioning of urls in all these
>>>> modes. For generate phase, the mode of the URLPartitioner is governed by
>>>> the value of "generate.count.mode" (irrespective of the value of "partition.url.mode").
>>>> This "hack" is implemented in GeneratorJob [0] at lines 176-183.
>>>>
>>>> [0] :
>>>> http://svn.apache.org/viewvc/nutch/branches/2.x/src/java/org/apache/nutch/crawl/GeneratorJob.java?view=markup
>>>>
>>>>>
>>>>> Thanks,
>>>>> lufeng
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Feb 21, 2013 at 10:26 AM, Tejas Patil <
>>>>> tejas.patil.cs@gmail.com> wrote:
>>>>>
>>>>>> Hey Lewis,
>>>>>>
>>>>>> On Wed, Feb 20, 2013 at 1:05 PM, Lewis John Mcgibbney <
>>>>>> lewis.mcgibbney@gmail.com> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>> Following on from a discussion on user@ I dived into the
>>>>>>> GeneratorJob code and have the following general comment based on my
>>>>>>> observation... Usage of configuration options is really unstructured and
>>>>>>> loosely applied. This should not be the case. For example
>>>>>>>
>>>>>>> Observations
>>>>>>> ===========
>>>>>>>
>>>>>>> nutch-default.xml
>>>>>>> ---------------------
>>>>>>>  - generate.max.count property appears here but I cannot see for the
>>>>>>> life of me where it actually is used in the GeneratorJob, Mapper or Reducer.
>>>>>>>
>>>>>>
>>>>>> Not sure if you are talking in terms of usage of the value of the
>>>>>> param in the code logic or practical application of the param for some use
>>>>>> case.
>>>>>> The GeneratorJob stores "generate.max.count" as "GENERATOR_MAX_COUNT"
>>>>>> and later this is picked up by GeneratorReducer in its local variable
>>>>>> "maxcount" which is used in reduce method. So I think that its been used in
>>>>>> generate phase. To be honest, I have never faced a situation where I had to
>>>>>> use it but I think that it might be helpful for some class of (rare)
>>>>>> scenarios.
>>>>>>
>>>>>>>
>>>>>>> Unused in GeneratorJob
>>>>>>> --------------------------------
>>>>>>>  - GENERATOR_MIN_SCORE - seems not be to used
>>>>>>>  - GENERATOR_MAX_COUNT - seems not be to used
>>>>>>>
>>>>>>
>>>>>> You are right. These are used in 1.X but not in 2.X. Not sure if this
>>>>>> is something that was intentionally left out in 2.x or got missed while 2.x
>>>>>> due to overlook. Do you have any idea ?
>>>>>>
>>>>>>>
>>>>>>> Missing in nutch-default.xml
>>>>>>> ------------------------------------
>>>>>>>  - generate.min.score - but used in GeneratorJob
>>>>>>>
>>>>>> Well as per earlier point, GeneratorJob  just picks this property and
>>>>>> stores in its local variable. Later aint used be either map or reduce for
>>>>>> any processing.
>>>>>>
>>>>>>  - generate.filter - set to true by default and available as a CLI
>>>>>>> override but should also be specified in nutch-default.xml
>>>>>>>  - generate.normalise - set to true by default and available as a
>>>>>>> CLI override but should also be specified in nutch-default.xml
>>>>>>>  - generate.topN - set to 263-1 by default and available as a CLI
>>>>>>> override but should also be specified in nutch-default.xml
>>>>>>>
>>>>>>> Suggestions to add
>>>>>>> --------------------------
>>>>>>>  - GENERATOR_COUNT_VALUE_IP - We should add a @Deprecated on this
>>>>>>> static element. I am not sure if it is used... I don't think it is.
>>>>>>>
>>>>>>> It is not used. In my opinion, I would favor removal of such things.
>>>>>> There was some discussion going on over the user group to remove such
>>>>>> deprecated properties from nutch-default.xml to avoid confusion. (see [1]).
>>>>>> The corresponding jira [2] was limited to the configs discussed over [1].
>>>>>> Maybe this discussion can be regarded as an extension/continuation for that
>>>>>> jira. What say ?
>>>>>>
>>>>>> Any comments on this please?
>>>>>>>
>>>>>>> [0]
>>>>>>> http://www.mail-archive.com/user%40nutch.apache.org/msg08854.html
>>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> *Lewis*
>>>>>>>
>>>>>>
>>>>>> [1] :
>>>>>> http://lucene.472066.n3.nabble.com/generate-max-count-was-not-affected-td4031013.html
>>>>>> [2] : https://issues.apache.org/jira/browse/NUTCH-1409
>>>>>>
>>>>>> Thanks,
>>>>>> Tejas Patil
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Don't Grow Old, Grow Up... :-)
>>>>>
>>>>
>>>> Thanks,
>>>> Tejas Patil
>>>>
>>>
>>>
>>>
>>> --
>>> Don't Grow Old, Grow Up... :-)
>>>
>>
>>
>


-- 
Don't Grow Old, Grow Up... :-)

Re: Configuration improvements to GeneratorJob

Posted by Tejas Patil <te...@gmail.com>.
Hi Lewis,

We have not came to a conclusion for this topic.
Here is what I propose:
1. keep "generate.max.count"
2. GENERATOR_MIN_SCORE and GENERATOR_MAX_COUNT: once we get to know that if
they were kept back in 2.x for some valid reason, then we can safely remove
these params. These seem to do nothing meaningful.
3. generate.min.score : remove ?
4. generate.filter, generate.normalise, generate.topN : there is not
problem in keeping it. we can even remove it.
5. GENERATOR_COUNT_VALUE_IP : ??

thanks,
Tejas Patil


On Wed, Feb 20, 2013 at 9:44 PM, Tejas Patil <te...@gmail.com>wrote:

> Hi Lufeng,
>
> On Wed, Feb 20, 2013 at 9:19 PM, feng lu <am...@gmail.com> wrote:
>
>> Hi Tejas
>>
>> Yes , your are right. I misread the description of property
>> "generate.count.mode". I'm so sorry, i did also not found any information
>> about why disabled the IP based counting mode of "generate.count.mode".
>>
>> Yes, i see that the FetchEntryPartitioner class (combination
>> of URLPartitioner) is used by FetcherJob. So as you say that the setting of
>> "partition.url.mode"  is not effect to the GeneratorJob.
>>
>> Do you think we can add some detail description in the property of
>> "generate.count.mode". such as
>>
>> <property>
>>   <name>generate.count.mode</name>
>>   <value>host</value>
>>   <description>Determines how the URLs are counted for
>> generator.max.count.
>>   Default value is 'host' but can be 'domain'. Note that we do not count
>>   per IP in the new version of the Generator. It will irrespective of the
>> value of 'partition.url.mode' in GeneratorJob.
>>   </description>
>> </property>
>>
>> +1. This will help the users.
>
> Sorry for my bad English.
>>
> Thats fine. I am not perfect either :) There was a typo in my reply. I
> missed few words or maybe accidentally they got deleted. Correction in
> bold:
> "There might be some reason behind removing it *and we must look into it*before adding it back
> ".
>
>>
>> Thanks
>> lufeng
>>
>> On Thu, Feb 21, 2013 at 12:14 PM, Tejas Patil <te...@gmail.com>wrote:
>>
>>> Hi Lufeng,
>>>
>>> On Wed, Feb 20, 2013 at 7:16 PM, feng lu <am...@gmail.com> wrote:
>>>
>>>> Hi Lewis
>>>>
>>>> Sorry, I am wrong, The GeneratorJob is only used in Nutch 2.x not 1.x.
>>>>
>>>> To the property of GENERATOR_COUNT_VALUE_IP, i think we can add a patch
>>>> to GeneratorJob, instead of deprecated it. patch may like this.
>>>>
>>>> if (GENERATOR_COUNT_VALUE_HOST.equalsIgnoreCase(mode)) {
>>>>       getConf().set(URLPartitioner.PARTITION_MODE_KEY,
>>>> URLPartitioner.PARTITION_MODE_HOST);
>>>>     } else if (GENERATOR_COUNT_VALUE_DOMAIN.equalsIgnoreCase(mode)) {
>>>>         getConf().set(URLPartitioner.PARTITION_MODE_KEY,
>>>> URLPartitioner.PARTITION_MODE_DOMAIN);
>>>>     }
>>>>     else if (GENERATOR_COUNT_VALUE_IP.equalsIgnoreCase(mode)) {
>>>>         getConf().set(URLPartitioner.PARTITION_MODE_KEY,
>>>> URLPartitioner.PARTITION_MODE_IP);
>>>>     }
>>>>     else {
>>>>       LOG.warn("Unknown generator.max.count mode '" + mode + "', using
>>>> mode=" + GENERATOR_COUNT_VALUE_HOST);
>>>>       getConf().set(GENERATOR_COUNT_MODE, GENERATOR_COUNT_VALUE_HOST);
>>>>       getConf().set(URLPartitioner.PARTITION_MODE_KEY,
>>>> URLPartitioner.PARTITION_MODE_HOST);
>>>>     }
>>>>
>>>> The description of property "generate.count.mode" says the IP based
>>> counting has been disabled in the newer Generator version. There might be
>>> some reason behind removing it before adding it back. I am searching out
>>> for any relevant discussion(s) over @user / @dev  or Jira about this. If
>>> you find anything, do share.
>>>
>>>
>>>
>>>> if we deprecated it, the URLPartitioner mode PARTITION_MODE_IP will
>>>> never be setting even we set the partition.url.mode property to byIP in
>>>> nutch-default.xml. Maybe the partition.url.mode property will be removed in
>>>> nutch-default.xml. Because it's depends on the value of
>>>> GENERATOR_COUNT_MODE.
>>>>
>>>> How do your think please?
>>>>
>>>
>>> The url partitioning is done not only in generate phase, but fetch phase
>>> too. The mode of the URLPartitioner is defined by the param
>>> "partition.url.mode" which can be by host, domain or ip. This works out
>>> well for fetch phase as it supports partitioning of urls in all these
>>> modes. For generate phase, the mode of the URLPartitioner is governed by
>>> the value of "generate.count.mode" (irrespective of the value of "partition.url.mode").
>>> This "hack" is implemented in GeneratorJob [0] at lines 176-183.
>>>
>>> [0] :
>>> http://svn.apache.org/viewvc/nutch/branches/2.x/src/java/org/apache/nutch/crawl/GeneratorJob.java?view=markup
>>>
>>>>
>>>> Thanks,
>>>> lufeng
>>>>
>>>>
>>>>
>>>> On Thu, Feb 21, 2013 at 10:26 AM, Tejas Patil <tejas.patil.cs@gmail.com
>>>> > wrote:
>>>>
>>>>> Hey Lewis,
>>>>>
>>>>> On Wed, Feb 20, 2013 at 1:05 PM, Lewis John Mcgibbney <
>>>>> lewis.mcgibbney@gmail.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>> Following on from a discussion on user@ I dived into the
>>>>>> GeneratorJob code and have the following general comment based on my
>>>>>> observation... Usage of configuration options is really unstructured and
>>>>>> loosely applied. This should not be the case. For example
>>>>>>
>>>>>> Observations
>>>>>> ===========
>>>>>>
>>>>>> nutch-default.xml
>>>>>> ---------------------
>>>>>>  - generate.max.count property appears here but I cannot see for the
>>>>>> life of me where it actually is used in the GeneratorJob, Mapper or Reducer.
>>>>>>
>>>>>
>>>>> Not sure if you are talking in terms of usage of the value of the
>>>>> param in the code logic or practical application of the param for some use
>>>>> case.
>>>>> The GeneratorJob stores "generate.max.count" as "GENERATOR_MAX_COUNT" and
>>>>> later this is picked up by GeneratorReducer in its local variable
>>>>> "maxcount" which is used in reduce method. So I think that its been used in
>>>>> generate phase. To be honest, I have never faced a situation where I had to
>>>>> use it but I think that it might be helpful for some class of (rare)
>>>>> scenarios.
>>>>>
>>>>>>
>>>>>> Unused in GeneratorJob
>>>>>> --------------------------------
>>>>>>  - GENERATOR_MIN_SCORE - seems not be to used
>>>>>>  - GENERATOR_MAX_COUNT - seems not be to used
>>>>>>
>>>>>
>>>>> You are right. These are used in 1.X but not in 2.X. Not sure if this
>>>>> is something that was intentionally left out in 2.x or got missed while 2.x
>>>>> due to overlook. Do you have any idea ?
>>>>>
>>>>>>
>>>>>> Missing in nutch-default.xml
>>>>>> ------------------------------------
>>>>>>  - generate.min.score - but used in GeneratorJob
>>>>>>
>>>>> Well as per earlier point, GeneratorJob  just picks this property and
>>>>> stores in its local variable. Later aint used be either map or reduce for
>>>>> any processing.
>>>>>
>>>>>  - generate.filter - set to true by default and available as a CLI
>>>>>> override but should also be specified in nutch-default.xml
>>>>>>  - generate.normalise - set to true by default and available as a CLI
>>>>>> override but should also be specified in nutch-default.xml
>>>>>>  - generate.topN - set to 263-1 by default and available as a CLI
>>>>>> override but should also be specified in nutch-default.xml
>>>>>>
>>>>>> Suggestions to add
>>>>>> --------------------------
>>>>>>  - GENERATOR_COUNT_VALUE_IP - We should add a @Deprecated on this
>>>>>> static element. I am not sure if it is used... I don't think it is.
>>>>>>
>>>>>> It is not used. In my opinion, I would favor removal of such things.
>>>>> There was some discussion going on over the user group to remove such
>>>>> deprecated properties from nutch-default.xml to avoid confusion. (see [1]).
>>>>> The corresponding jira [2] was limited to the configs discussed over [1].
>>>>> Maybe this discussion can be regarded as an extension/continuation for that
>>>>> jira. What say ?
>>>>>
>>>>> Any comments on this please?
>>>>>>
>>>>>> [0] http://www.mail-archive.com/user%40nutch.apache.org/msg08854.html
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> *Lewis*
>>>>>>
>>>>>
>>>>> [1] :
>>>>> http://lucene.472066.n3.nabble.com/generate-max-count-was-not-affected-td4031013.html
>>>>> [2] : https://issues.apache.org/jira/browse/NUTCH-1409
>>>>>
>>>>> Thanks,
>>>>> Tejas Patil
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Don't Grow Old, Grow Up... :-)
>>>>
>>>
>>> Thanks,
>>> Tejas Patil
>>>
>>
>>
>>
>> --
>> Don't Grow Old, Grow Up... :-)
>>
>
>

Re: Configuration improvements to GeneratorJob

Posted by Tejas Patil <te...@gmail.com>.
Hi Lufeng,

On Wed, Feb 20, 2013 at 9:19 PM, feng lu <am...@gmail.com> wrote:

> Hi Tejas
>
> Yes , your are right. I misread the description of property
> "generate.count.mode". I'm so sorry, i did also not found any information
> about why disabled the IP based counting mode of "generate.count.mode".
>
> Yes, i see that the FetchEntryPartitioner class (combination
> of URLPartitioner) is used by FetcherJob. So as you say that the setting of
> "partition.url.mode"  is not effect to the GeneratorJob.
>
> Do you think we can add some detail description in the property of
> "generate.count.mode". such as
>
> <property>
>   <name>generate.count.mode</name>
>   <value>host</value>
>   <description>Determines how the URLs are counted for generator.max.count.
>   Default value is 'host' but can be 'domain'. Note that we do not count
>   per IP in the new version of the Generator. It will irrespective of the
> value of 'partition.url.mode' in GeneratorJob.
>   </description>
> </property>
>
> +1. This will help the users.

Sorry for my bad English.
>
Thats fine. I am not perfect either :) There was a typo in my reply. I
missed few words or maybe accidentally they got deleted. Correction in
bold:
"There might be some reason behind removing it *and we must look into
it*before adding it back
".

>
> Thanks
> lufeng
>
> On Thu, Feb 21, 2013 at 12:14 PM, Tejas Patil <te...@gmail.com>wrote:
>
>> Hi Lufeng,
>>
>> On Wed, Feb 20, 2013 at 7:16 PM, feng lu <am...@gmail.com> wrote:
>>
>>> Hi Lewis
>>>
>>> Sorry, I am wrong, The GeneratorJob is only used in Nutch 2.x not 1.x.
>>>
>>> To the property of GENERATOR_COUNT_VALUE_IP, i think we can add a patch
>>> to GeneratorJob, instead of deprecated it. patch may like this.
>>>
>>> if (GENERATOR_COUNT_VALUE_HOST.equalsIgnoreCase(mode)) {
>>>       getConf().set(URLPartitioner.PARTITION_MODE_KEY,
>>> URLPartitioner.PARTITION_MODE_HOST);
>>>     } else if (GENERATOR_COUNT_VALUE_DOMAIN.equalsIgnoreCase(mode)) {
>>>         getConf().set(URLPartitioner.PARTITION_MODE_KEY,
>>> URLPartitioner.PARTITION_MODE_DOMAIN);
>>>     }
>>>     else if (GENERATOR_COUNT_VALUE_IP.equalsIgnoreCase(mode)) {
>>>         getConf().set(URLPartitioner.PARTITION_MODE_KEY,
>>> URLPartitioner.PARTITION_MODE_IP);
>>>     }
>>>     else {
>>>       LOG.warn("Unknown generator.max.count mode '" + mode + "', using
>>> mode=" + GENERATOR_COUNT_VALUE_HOST);
>>>       getConf().set(GENERATOR_COUNT_MODE, GENERATOR_COUNT_VALUE_HOST);
>>>       getConf().set(URLPartitioner.PARTITION_MODE_KEY,
>>> URLPartitioner.PARTITION_MODE_HOST);
>>>     }
>>>
>>> The description of property "generate.count.mode" says the IP based
>> counting has been disabled in the newer Generator version. There might be
>> some reason behind removing it before adding it back. I am searching out
>> for any relevant discussion(s) over @user / @dev  or Jira about this. If
>> you find anything, do share.
>>
>>
>>
>>> if we deprecated it, the URLPartitioner mode PARTITION_MODE_IP will
>>> never be setting even we set the partition.url.mode property to byIP in
>>> nutch-default.xml. Maybe the partition.url.mode property will be removed in
>>> nutch-default.xml. Because it's depends on the value of
>>> GENERATOR_COUNT_MODE.
>>>
>>> How do your think please?
>>>
>>
>> The url partitioning is done not only in generate phase, but fetch phase
>> too. The mode of the URLPartitioner is defined by the param
>> "partition.url.mode" which can be by host, domain or ip. This works out
>> well for fetch phase as it supports partitioning of urls in all these
>> modes. For generate phase, the mode of the URLPartitioner is governed by
>> the value of "generate.count.mode" (irrespective of the value of "partition.url.mode").
>> This "hack" is implemented in GeneratorJob [0] at lines 176-183.
>>
>> [0] :
>> http://svn.apache.org/viewvc/nutch/branches/2.x/src/java/org/apache/nutch/crawl/GeneratorJob.java?view=markup
>>
>>>
>>> Thanks,
>>> lufeng
>>>
>>>
>>>
>>> On Thu, Feb 21, 2013 at 10:26 AM, Tejas Patil <te...@gmail.com>wrote:
>>>
>>>> Hey Lewis,
>>>>
>>>> On Wed, Feb 20, 2013 at 1:05 PM, Lewis John Mcgibbney <
>>>> lewis.mcgibbney@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>> Following on from a discussion on user@ I dived into the GeneratorJob
>>>>> code and have the following general comment based on my observation...
>>>>> Usage of configuration options is really unstructured and loosely applied.
>>>>> This should not be the case. For example
>>>>>
>>>>> Observations
>>>>> ===========
>>>>>
>>>>> nutch-default.xml
>>>>> ---------------------
>>>>>  - generate.max.count property appears here but I cannot see for the
>>>>> life of me where it actually is used in the GeneratorJob, Mapper or Reducer.
>>>>>
>>>>
>>>> Not sure if you are talking in terms of usage of the value of the param
>>>> in the code logic or practical application of the param for some use case.
>>>> The GeneratorJob stores "generate.max.count" as "GENERATOR_MAX_COUNT" and
>>>> later this is picked up by GeneratorReducer in its local variable
>>>> "maxcount" which is used in reduce method. So I think that its been used in
>>>> generate phase. To be honest, I have never faced a situation where I had to
>>>> use it but I think that it might be helpful for some class of (rare)
>>>> scenarios.
>>>>
>>>>>
>>>>> Unused in GeneratorJob
>>>>> --------------------------------
>>>>>  - GENERATOR_MIN_SCORE - seems not be to used
>>>>>  - GENERATOR_MAX_COUNT - seems not be to used
>>>>>
>>>>
>>>> You are right. These are used in 1.X but not in 2.X. Not sure if this
>>>> is something that was intentionally left out in 2.x or got missed while 2.x
>>>> due to overlook. Do you have any idea ?
>>>>
>>>>>
>>>>> Missing in nutch-default.xml
>>>>> ------------------------------------
>>>>>  - generate.min.score - but used in GeneratorJob
>>>>>
>>>> Well as per earlier point, GeneratorJob  just picks this property and
>>>> stores in its local variable. Later aint used be either map or reduce for
>>>> any processing.
>>>>
>>>>  - generate.filter - set to true by default and available as a CLI
>>>>> override but should also be specified in nutch-default.xml
>>>>>  - generate.normalise - set to true by default and available as a CLI
>>>>> override but should also be specified in nutch-default.xml
>>>>>  - generate.topN - set to 263-1 by default and available as a CLI
>>>>> override but should also be specified in nutch-default.xml
>>>>>
>>>>> Suggestions to add
>>>>> --------------------------
>>>>>  - GENERATOR_COUNT_VALUE_IP - We should add a @Deprecated on this
>>>>> static element. I am not sure if it is used... I don't think it is.
>>>>>
>>>>> It is not used. In my opinion, I would favor removal of such things.
>>>> There was some discussion going on over the user group to remove such
>>>> deprecated properties from nutch-default.xml to avoid confusion. (see [1]).
>>>> The corresponding jira [2] was limited to the configs discussed over [1].
>>>> Maybe this discussion can be regarded as an extension/continuation for that
>>>> jira. What say ?
>>>>
>>>> Any comments on this please?
>>>>>
>>>>> [0] http://www.mail-archive.com/user%40nutch.apache.org/msg08854.html
>>>>
>>>>
>>>>
>>>>>
>>>>>
>>>>> --
>>>>> *Lewis*
>>>>>
>>>>
>>>> [1] :
>>>> http://lucene.472066.n3.nabble.com/generate-max-count-was-not-affected-td4031013.html
>>>> [2] : https://issues.apache.org/jira/browse/NUTCH-1409
>>>>
>>>> Thanks,
>>>> Tejas Patil
>>>>
>>>
>>>
>>>
>>> --
>>> Don't Grow Old, Grow Up... :-)
>>>
>>
>> Thanks,
>> Tejas Patil
>>
>
>
>
> --
> Don't Grow Old, Grow Up... :-)
>

Re: Configuration improvements to GeneratorJob

Posted by feng lu <am...@gmail.com>.
Hi Tejas

Yes , your are right. I misread the description of property
"generate.count.mode". I'm so sorry, i did also not found any information
about why disabled the IP based counting mode of "generate.count.mode".

Yes, i see that the FetchEntryPartitioner class (combination
of URLPartitioner) is used by FetcherJob. So as you say that the setting of
"partition.url.mode"  is not effect to the GeneratorJob.

Do you think we can add some detail description in the property of
"generate.count.mode". such as

<property>
  <name>generate.count.mode</name>
  <value>host</value>
  <description>Determines how the URLs are counted for generator.max.count.
  Default value is 'host' but can be 'domain'. Note that we do not count
  per IP in the new version of the Generator. It will irrespective of the
value of 'partition.url.mode' in GeneratorJob.
  </description>
</property>

Sorry for my bad English.

Thanks
lufeng

On Thu, Feb 21, 2013 at 12:14 PM, Tejas Patil <te...@gmail.com>wrote:

> Hi Lufeng,
>
> On Wed, Feb 20, 2013 at 7:16 PM, feng lu <am...@gmail.com> wrote:
>
>> Hi Lewis
>>
>> Sorry, I am wrong, The GeneratorJob is only used in Nutch 2.x not 1.x.
>>
>> To the property of GENERATOR_COUNT_VALUE_IP, i think we can add a patch
>> to GeneratorJob, instead of deprecated it. patch may like this.
>>
>> if (GENERATOR_COUNT_VALUE_HOST.equalsIgnoreCase(mode)) {
>>       getConf().set(URLPartitioner.PARTITION_MODE_KEY,
>> URLPartitioner.PARTITION_MODE_HOST);
>>     } else if (GENERATOR_COUNT_VALUE_DOMAIN.equalsIgnoreCase(mode)) {
>>         getConf().set(URLPartitioner.PARTITION_MODE_KEY,
>> URLPartitioner.PARTITION_MODE_DOMAIN);
>>     }
>>     else if (GENERATOR_COUNT_VALUE_IP.equalsIgnoreCase(mode)) {
>>         getConf().set(URLPartitioner.PARTITION_MODE_KEY,
>> URLPartitioner.PARTITION_MODE_IP);
>>     }
>>     else {
>>       LOG.warn("Unknown generator.max.count mode '" + mode + "', using
>> mode=" + GENERATOR_COUNT_VALUE_HOST);
>>       getConf().set(GENERATOR_COUNT_MODE, GENERATOR_COUNT_VALUE_HOST);
>>       getConf().set(URLPartitioner.PARTITION_MODE_KEY,
>> URLPartitioner.PARTITION_MODE_HOST);
>>     }
>>
>> The description of property "generate.count.mode" says the IP based
> counting has been disabled in the newer Generator version. There might be
> some reason behind removing it before adding it back. I am searching out
> for any relevant discussion(s) over @user / @dev  or Jira about this. If
> you find anything, do share.
>
>
>
>> if we deprecated it, the URLPartitioner mode PARTITION_MODE_IP will never
>> be setting even we set the partition.url.mode property to byIP in
>> nutch-default.xml. Maybe the partition.url.mode property will be removed in
>> nutch-default.xml. Because it's depends on the value of
>> GENERATOR_COUNT_MODE.
>>
>> How do your think please?
>>
>
> The url partitioning is done not only in generate phase, but fetch phase
> too. The mode of the URLPartitioner is defined by the param
> "partition.url.mode" which can be by host, domain or ip. This works out
> well for fetch phase as it supports partitioning of urls in all these
> modes. For generate phase, the mode of the URLPartitioner is governed by
> the value of "generate.count.mode" (irrespective of the value of "partition.url.mode").
> This "hack" is implemented in GeneratorJob [0] at lines 176-183.
>
> [0] :
> http://svn.apache.org/viewvc/nutch/branches/2.x/src/java/org/apache/nutch/crawl/GeneratorJob.java?view=markup
>
>>
>> Thanks,
>> lufeng
>>
>>
>>
>> On Thu, Feb 21, 2013 at 10:26 AM, Tejas Patil <te...@gmail.com>wrote:
>>
>>> Hey Lewis,
>>>
>>> On Wed, Feb 20, 2013 at 1:05 PM, Lewis John Mcgibbney <
>>> lewis.mcgibbney@gmail.com> wrote:
>>>
>>>> Hi,
>>>> Following on from a discussion on user@ I dived into the GeneratorJob
>>>> code and have the following general comment based on my observation...
>>>> Usage of configuration options is really unstructured and loosely applied.
>>>> This should not be the case. For example
>>>>
>>>> Observations
>>>> ===========
>>>>
>>>> nutch-default.xml
>>>> ---------------------
>>>>  - generate.max.count property appears here but I cannot see for the
>>>> life of me where it actually is used in the GeneratorJob, Mapper or Reducer.
>>>>
>>>
>>> Not sure if you are talking in terms of usage of the value of the param
>>> in the code logic or practical application of the param for some use case.
>>> The GeneratorJob stores "generate.max.count" as "GENERATOR_MAX_COUNT" and
>>> later this is picked up by GeneratorReducer in its local variable
>>> "maxcount" which is used in reduce method. So I think that its been used in
>>> generate phase. To be honest, I have never faced a situation where I had to
>>> use it but I think that it might be helpful for some class of (rare)
>>> scenarios.
>>>
>>>>
>>>> Unused in GeneratorJob
>>>> --------------------------------
>>>>  - GENERATOR_MIN_SCORE - seems not be to used
>>>>  - GENERATOR_MAX_COUNT - seems not be to used
>>>>
>>>
>>> You are right. These are used in 1.X but not in 2.X. Not sure if this is
>>> something that was intentionally left out in 2.x or got missed while 2.x
>>> due to overlook. Do you have any idea ?
>>>
>>>>
>>>> Missing in nutch-default.xml
>>>> ------------------------------------
>>>>  - generate.min.score - but used in GeneratorJob
>>>>
>>> Well as per earlier point, GeneratorJob  just picks this property and
>>> stores in its local variable. Later aint used be either map or reduce for
>>> any processing.
>>>
>>>  - generate.filter - set to true by default and available as a CLI
>>>> override but should also be specified in nutch-default.xml
>>>>  - generate.normalise - set to true by default and available as a CLI
>>>> override but should also be specified in nutch-default.xml
>>>>  - generate.topN - set to 263-1 by default and available as a CLI
>>>> override but should also be specified in nutch-default.xml
>>>>
>>>> Suggestions to add
>>>> --------------------------
>>>>  - GENERATOR_COUNT_VALUE_IP - We should add a @Deprecated on this
>>>> static element. I am not sure if it is used... I don't think it is.
>>>>
>>>> It is not used. In my opinion, I would favor removal of such things.
>>> There was some discussion going on over the user group to remove such
>>> deprecated properties from nutch-default.xml to avoid confusion. (see [1]).
>>> The corresponding jira [2] was limited to the configs discussed over [1].
>>> Maybe this discussion can be regarded as an extension/continuation for that
>>> jira. What say ?
>>>
>>> Any comments on this please?
>>>>
>>>> [0] http://www.mail-archive.com/user%40nutch.apache.org/msg08854.html
>>>
>>>
>>>
>>>>
>>>>
>>>> --
>>>> *Lewis*
>>>>
>>>
>>> [1] :
>>> http://lucene.472066.n3.nabble.com/generate-max-count-was-not-affected-td4031013.html
>>> [2] : https://issues.apache.org/jira/browse/NUTCH-1409
>>>
>>> Thanks,
>>> Tejas Patil
>>>
>>
>>
>>
>> --
>> Don't Grow Old, Grow Up... :-)
>>
>
> Thanks,
> Tejas Patil
>



-- 
Don't Grow Old, Grow Up... :-)

Re: Configuration improvements to GeneratorJob

Posted by Tejas Patil <te...@gmail.com>.
Hi Lufeng,

On Wed, Feb 20, 2013 at 7:16 PM, feng lu <am...@gmail.com> wrote:

> Hi Lewis
>
> Sorry, I am wrong, The GeneratorJob is only used in Nutch 2.x not 1.x.
>
> To the property of GENERATOR_COUNT_VALUE_IP, i think we can add a patch to
> GeneratorJob, instead of deprecated it. patch may like this.
>
> if (GENERATOR_COUNT_VALUE_HOST.equalsIgnoreCase(mode)) {
>       getConf().set(URLPartitioner.PARTITION_MODE_KEY,
> URLPartitioner.PARTITION_MODE_HOST);
>     } else if (GENERATOR_COUNT_VALUE_DOMAIN.equalsIgnoreCase(mode)) {
>         getConf().set(URLPartitioner.PARTITION_MODE_KEY,
> URLPartitioner.PARTITION_MODE_DOMAIN);
>     }
>     else if (GENERATOR_COUNT_VALUE_IP.equalsIgnoreCase(mode)) {
>         getConf().set(URLPartitioner.PARTITION_MODE_KEY,
> URLPartitioner.PARTITION_MODE_IP);
>     }
>     else {
>       LOG.warn("Unknown generator.max.count mode '" + mode + "', using
> mode=" + GENERATOR_COUNT_VALUE_HOST);
>       getConf().set(GENERATOR_COUNT_MODE, GENERATOR_COUNT_VALUE_HOST);
>       getConf().set(URLPartitioner.PARTITION_MODE_KEY,
> URLPartitioner.PARTITION_MODE_HOST);
>     }
>
> The description of property "generate.count.mode" says the IP based
counting has been disabled in the newer Generator version. There might be
some reason behind removing it before adding it back. I am searching out
for any relevant discussion(s) over @user / @dev  or Jira about this. If
you find anything, do share.



> if we deprecated it, the URLPartitioner mode PARTITION_MODE_IP will never
> be setting even we set the partition.url.mode property to byIP in
> nutch-default.xml. Maybe the partition.url.mode property will be removed in
> nutch-default.xml. Because it's depends on the value of
> GENERATOR_COUNT_MODE.
>
> How do your think please?
>

The url partitioning is done not only in generate phase, but fetch phase
too. The mode of the URLPartitioner is defined by the param
"partition.url.mode" which can be by host, domain or ip. This works out
well for fetch phase as it supports partitioning of urls in all these
modes. For generate phase, the mode of the URLPartitioner is governed by
the value of "generate.count.mode" (irrespective of the value of
"partition.url.mode").
This "hack" is implemented in GeneratorJob [0] at lines 176-183.

[0] :
http://svn.apache.org/viewvc/nutch/branches/2.x/src/java/org/apache/nutch/crawl/GeneratorJob.java?view=markup

>
> Thanks,
> lufeng
>
>
>
> On Thu, Feb 21, 2013 at 10:26 AM, Tejas Patil <te...@gmail.com>wrote:
>
>> Hey Lewis,
>>
>> On Wed, Feb 20, 2013 at 1:05 PM, Lewis John Mcgibbney <
>> lewis.mcgibbney@gmail.com> wrote:
>>
>>> Hi,
>>> Following on from a discussion on user@ I dived into the GeneratorJob
>>> code and have the following general comment based on my observation...
>>> Usage of configuration options is really unstructured and loosely applied.
>>> This should not be the case. For example
>>>
>>> Observations
>>> ===========
>>>
>>> nutch-default.xml
>>> ---------------------
>>>  - generate.max.count property appears here but I cannot see for the
>>> life of me where it actually is used in the GeneratorJob, Mapper or Reducer.
>>>
>>
>> Not sure if you are talking in terms of usage of the value of the param
>> in the code logic or practical application of the param for some use case.
>> The GeneratorJob stores "generate.max.count" as "GENERATOR_MAX_COUNT" and
>> later this is picked up by GeneratorReducer in its local variable
>> "maxcount" which is used in reduce method. So I think that its been used in
>> generate phase. To be honest, I have never faced a situation where I had to
>> use it but I think that it might be helpful for some class of (rare)
>> scenarios.
>>
>>>
>>> Unused in GeneratorJob
>>> --------------------------------
>>>  - GENERATOR_MIN_SCORE - seems not be to used
>>>  - GENERATOR_MAX_COUNT - seems not be to used
>>>
>>
>> You are right. These are used in 1.X but not in 2.X. Not sure if this is
>> something that was intentionally left out in 2.x or got missed while 2.x
>> due to overlook. Do you have any idea ?
>>
>>>
>>> Missing in nutch-default.xml
>>> ------------------------------------
>>>  - generate.min.score - but used in GeneratorJob
>>>
>> Well as per earlier point, GeneratorJob  just picks this property and
>> stores in its local variable. Later aint used be either map or reduce for
>> any processing.
>>
>>  - generate.filter - set to true by default and available as a CLI
>>> override but should also be specified in nutch-default.xml
>>>  - generate.normalise - set to true by default and available as a CLI
>>> override but should also be specified in nutch-default.xml
>>>  - generate.topN - set to 263-1 by default and available as a CLI
>>> override but should also be specified in nutch-default.xml
>>>
>>> Suggestions to add
>>> --------------------------
>>>  - GENERATOR_COUNT_VALUE_IP - We should add a @Deprecated on this static
>>> element. I am not sure if it is used... I don't think it is.
>>>
>>> It is not used. In my opinion, I would favor removal of such things.
>> There was some discussion going on over the user group to remove such
>> deprecated properties from nutch-default.xml to avoid confusion. (see [1]).
>> The corresponding jira [2] was limited to the configs discussed over [1].
>> Maybe this discussion can be regarded as an extension/continuation for that
>> jira. What say ?
>>
>> Any comments on this please?
>>>
>>> [0] http://www.mail-archive.com/user%40nutch.apache.org/msg08854.html
>>
>>
>>
>>>
>>>
>>> --
>>> *Lewis*
>>>
>>
>> [1] :
>> http://lucene.472066.n3.nabble.com/generate-max-count-was-not-affected-td4031013.html
>> [2] : https://issues.apache.org/jira/browse/NUTCH-1409
>>
>> Thanks,
>> Tejas Patil
>>
>
>
>
> --
> Don't Grow Old, Grow Up... :-)
>

Thanks,
Tejas Patil

Re: Configuration improvements to GeneratorJob

Posted by feng lu <am...@gmail.com>.
Hi Lewis

Sorry, I am wrong, The GeneratorJob is only used in Nutch 2.x not 1.x.

To the property of GENERATOR_COUNT_VALUE_IP, i think we can add a patch to
GeneratorJob, instead of deprecated it. patch may like this.

if (GENERATOR_COUNT_VALUE_HOST.equalsIgnoreCase(mode)) {
      getConf().set(URLPartitioner.PARTITION_MODE_KEY,
URLPartitioner.PARTITION_MODE_HOST);
    } else if (GENERATOR_COUNT_VALUE_DOMAIN.equalsIgnoreCase(mode)) {
        getConf().set(URLPartitioner.PARTITION_MODE_KEY,
URLPartitioner.PARTITION_MODE_DOMAIN);
    }
    else if (GENERATOR_COUNT_VALUE_IP.equalsIgnoreCase(mode)) {
        getConf().set(URLPartitioner.PARTITION_MODE_KEY,
URLPartitioner.PARTITION_MODE_IP);
    }
    else {
      LOG.warn("Unknown generator.max.count mode '" + mode + "', using
mode=" + GENERATOR_COUNT_VALUE_HOST);
      getConf().set(GENERATOR_COUNT_MODE, GENERATOR_COUNT_VALUE_HOST);
      getConf().set(URLPartitioner.PARTITION_MODE_KEY,
URLPartitioner.PARTITION_MODE_HOST);
    }

if we deprecated it, the URLPartitioner mode PARTITION_MODE_IP will never
be setting even we set the partition.url.mode property to byIP in
nutch-default.xml. Maybe the partition.url.mode property will be removed in
nutch-default.xml. Because it's depends on the value of
GENERATOR_COUNT_MODE.

How do your think please?

Thanks,
lufeng



On Thu, Feb 21, 2013 at 10:26 AM, Tejas Patil <te...@gmail.com>wrote:

> Hey Lewis,
>
> On Wed, Feb 20, 2013 at 1:05 PM, Lewis John Mcgibbney <
> lewis.mcgibbney@gmail.com> wrote:
>
>> Hi,
>> Following on from a discussion on user@ I dived into the GeneratorJob
>> code and have the following general comment based on my observation...
>> Usage of configuration options is really unstructured and loosely applied.
>> This should not be the case. For example
>>
>> Observations
>> ===========
>>
>> nutch-default.xml
>> ---------------------
>>  - generate.max.count property appears here but I cannot see for the life
>> of me where it actually is used in the GeneratorJob, Mapper or Reducer.
>>
>
> Not sure if you are talking in terms of usage of the value of the param in
> the code logic or practical application of the param for some use case.
> The GeneratorJob stores "generate.max.count" as "GENERATOR_MAX_COUNT" and
> later this is picked up by GeneratorReducer in its local variable
> "maxcount" which is used in reduce method. So I think that its been used in
> generate phase. To be honest, I have never faced a situation where I had to
> use it but I think that it might be helpful for some class of (rare)
> scenarios.
>
>>
>> Unused in GeneratorJob
>> --------------------------------
>>  - GENERATOR_MIN_SCORE - seems not be to used
>>  - GENERATOR_MAX_COUNT - seems not be to used
>>
>
> You are right. These are used in 1.X but not in 2.X. Not sure if this is
> something that was intentionally left out in 2.x or got missed while 2.x
> due to overlook. Do you have any idea ?
>
>>
>> Missing in nutch-default.xml
>> ------------------------------------
>>  - generate.min.score - but used in GeneratorJob
>>
> Well as per earlier point, GeneratorJob  just picks this property and
> stores in its local variable. Later aint used be either map or reduce for
> any processing.
>
>  - generate.filter - set to true by default and available as a CLI
>> override but should also be specified in nutch-default.xml
>>  - generate.normalise - set to true by default and available as a CLI
>> override but should also be specified in nutch-default.xml
>>  - generate.topN - set to 263-1 by default and available as a CLI
>> override but should also be specified in nutch-default.xml
>>
>> Suggestions to add
>> --------------------------
>>  - GENERATOR_COUNT_VALUE_IP - We should add a @Deprecated on this static
>> element. I am not sure if it is used... I don't think it is.
>>
>> It is not used. In my opinion, I would favor removal of such things.
> There was some discussion going on over the user group to remove such
> deprecated properties from nutch-default.xml to avoid confusion. (see [1]).
> The corresponding jira [2] was limited to the configs discussed over [1].
> Maybe this discussion can be regarded as an extension/continuation for that
> jira. What say ?
>
> Any comments on this please?
>>
>> [0] http://www.mail-archive.com/user%40nutch.apache.org/msg08854.html
>
>
>
>>
>>
>> --
>> *Lewis*
>>
>
> [1] :
> http://lucene.472066.n3.nabble.com/generate-max-count-was-not-affected-td4031013.html
> [2] : https://issues.apache.org/jira/browse/NUTCH-1409
>
> Thanks,
> Tejas Patil
>



-- 
Don't Grow Old, Grow Up... :-)

Re: Configuration improvements to GeneratorJob

Posted by Tejas Patil <te...@gmail.com>.
Hey Lewis,

On Wed, Feb 20, 2013 at 1:05 PM, Lewis John Mcgibbney <
lewis.mcgibbney@gmail.com> wrote:

> Hi,
> Following on from a discussion on user@ I dived into the GeneratorJob
> code and have the following general comment based on my observation...
> Usage of configuration options is really unstructured and loosely applied.
> This should not be the case. For example
>
> Observations
> ===========
>
> nutch-default.xml
> ---------------------
>  - generate.max.count property appears here but I cannot see for the life
> of me where it actually is used in the GeneratorJob, Mapper or Reducer.
>

Not sure if you are talking in terms of usage of the value of the param in
the code logic or practical application of the param for some use case.
The GeneratorJob stores "generate.max.count" as "GENERATOR_MAX_COUNT" and
later this is picked up by GeneratorReducer in its local variable
"maxcount" which is used in reduce method. So I think that its been used in
generate phase. To be honest, I have never faced a situation where I had to
use it but I think that it might be helpful for some class of (rare)
scenarios.

>
> Unused in GeneratorJob
> --------------------------------
>  - GENERATOR_MIN_SCORE - seems not be to used
>  - GENERATOR_MAX_COUNT - seems not be to used
>

You are right. These are used in 1.X but not in 2.X. Not sure if this is
something that was intentionally left out in 2.x or got missed while 2.x
due to overlook. Do you have any idea ?

>
> Missing in nutch-default.xml
> ------------------------------------
>  - generate.min.score - but used in GeneratorJob
>
Well as per earlier point, GeneratorJob  just picks this property and
stores in its local variable. Later aint used be either map or reduce for
any processing.

 - generate.filter - set to true by default and available as a CLI override
> but should also be specified in nutch-default.xml
>  - generate.normalise - set to true by default and available as a CLI
> override but should also be specified in nutch-default.xml
>  - generate.topN - set to 263-1 by default and available as a CLI
> override but should also be specified in nutch-default.xml
>
> Suggestions to add
> --------------------------
>  - GENERATOR_COUNT_VALUE_IP - We should add a @Deprecated on this static
> element. I am not sure if it is used... I don't think it is.
>
> It is not used. In my opinion, I would favor removal of such things. There
was some discussion going on over the user group to remove such deprecated
properties from nutch-default.xml to avoid confusion. (see [1]). The
corresponding jira [2] was limited to the configs discussed over [1]. Maybe
this discussion can be regarded as an extension/continuation for that jira.
What say ?

Any comments on this please?
>
> [0] http://www.mail-archive.com/user%40nutch.apache.org/msg08854.html



>
>
> --
> *Lewis*
>

[1] :
http://lucene.472066.n3.nabble.com/generate-max-count-was-not-affected-td4031013.html
[2] : https://issues.apache.org/jira/browse/NUTCH-1409

Thanks,
Tejas Patil

Re: Configuration improvements to GeneratorJob

Posted by feng lu <am...@gmail.com>.
Hi Lewis

i think generate.max.count is used by someone who want to limits the number
urls per domain (host). see
http://wiki.apache.org/nutch/Nutch2Crawling#Reducer

The generate.min.score property is already defined in nutch-default.xml.

The generate.(filter|normalise|topN) can be passed through Generator
command line. So i think it's a little more flexible than to defined in
nutch-default.xml.

i see generate.count.mode property description in nutch-default.xml

<property>
  <name>generate.count.mode</name>
  <value>host</value>
  <description>Determines how the URLs are counted for generator.max.count.
  Default value is 'host' but can be 'domain'. Note that we do not count
  per IP in the new version of the Generator.
  </description>
</property>

May be the GENERATOR_COUNT_VALUE_IP mode will be add in next new Generator
version.



On Thu, Feb 21, 2013 at 5:05 AM, Lewis John Mcgibbney <
lewis.mcgibbney@gmail.com> wrote:

> Hi,
> Following on from a discussion on user@ I dived into the GeneratorJob
> code and have the following general comment based on my observation...
> Usage of configuration options is really unstructured and loosely applied.
> This should not be the case. For example
>
> Observations
> ===========
>
> nutch-default.xml
> ---------------------
>  - generate.max.count property appears here but I cannot see for the life
> of me where it actually is used in the GeneratorJob, Mapper or Reducer.
>
> Unused in GeneratorJob
> --------------------------------
>  - GENERATOR_MIN_SCORE - seems not be to used
>  - GENERATOR_MAX_COUNT - seems not be to used
>
> Missing in nutch-default.xml
> ------------------------------------
>  - generate.min.score - but used in GeneratorJob
>  - generate.filter - set to true by default and available as a CLI
> override but should also be specified in nutch-default.xml
>  - generate.normalise - set to true by default and available as a CLI
> override but should also be specified in nutch-default.xml
>  - generate.topN - set to 263-1 by default and available as a CLI
> override but should also be specified in nutch-default.xml
>
> Suggestions to add
> --------------------------
>  - GENERATOR_COUNT_VALUE_IP - We should add a @Deprecated on this static
> element. I am not sure if it is used... I don't think it is.
>
> Any comments on this please?
>
> [0] http://www.mail-archive.com/user%40nutch.apache.org/msg08854.html
>
> --
> *Lewis*
>



-- 
Don't Grow Old, Grow Up... :-)