You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Fabio Ricci <fa...@gmail.com> on 2017/04/10 23:12:24 UTC

Nutch 1.13 @Sierra - Java -D parameters not passed to nutch

Hello

I am a newbie in NUTCH and I need a crawler in order to fetch some urls inside a given depth and to index dound pages into SOLR 6.5 .

On my OSX I got NUTCH running. I was hoping to use it directly for indexing. Instead I am wondering why the script /runtime/local/bin/crawl does not pass the depth and topN parameter to the software.

Expetially I use the following example call:

./bin/crawl -i -D solr.server.url=http://127.0.0.1:8983/solr/demo/ -D depth=2 -D topN=2 /Users/fabio/NUTCH/urls/ /Users/fabio/NUTCH/crawl/ 1

With one single url inside /ursl/seed.txt

Expecting the crawling process will go into max depth = 2. 

Instead, it runs and runs … and I suppose something runs ***differently*** as described.

For example I noticed in the output the following text (this is just a segment, the output "does not stop"):

Injecting seed URLs
/Users/fabio/Documents/workspace/NUTCH/runtime/local/bin/nutch inject /Users/fabio/NUTCH/crawl//crawldb /Users/fabio/NUTCH/urls/
Injector: starting at 2017-04-11 00:54:56
Injector: crawlDb: /Users/fabio/NUTCH/crawl/crawldb
Injector: urlDir: /Users/fabio/NUTCH/urls
Injector: Converting injected urls to crawl db entries.
Injector: overwrite: false
Injector: update: false
Injector: Total urls rejected by filters: 0
Injector: Total urls injected after normalization and filtering: 1
Injector: Total urls injected but already in CrawlDb: 1
Injector: Total new urls injected: 0
Injector: finished at 2017-04-11 00:54:58, elapsed: 00:00:01
Tue Apr 11 00:54:58 CEST 2017 : Iteration 1 of 1
Generating a new segment
/Users/fabio/Documents/workspace/NUTCH/runtime/local/bin/nutch generate -D mapreduce.job.reduces=2 -D mapred.child.java.opts=-Xmx1000m -D mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D mapreduce.map.output.compress=true /Users/fabio/NUTCH/crawl//crawldb /Users/fabio/NUTCH/crawl//segments -topN 50000 -numFetchers 1 -noFilter
Generator: starting at 2017-04-11 00:54:59
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: false
Generator: normalizing: true
Generator: topN: 50000
Generator: Partitioning selected urls for politeness.
Generator: segment: /Users/fabio/NUTCH/crawl/segments/20170411005501
Generator: finished at 2017-04-11 00:55:02, elapsed: 00:00:03
Operating on segment : 20170411005501
Fetching : 20170411005501
/Users/fabio/Documents/workspace/NUTCH/runtime/local/bin/nutch fetch -D mapreduce.job.reduces=2 -D mapred.child.java.opts=-Xmx1000m -D mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D mapreduce.map.output.compress=true -D fetcher.timelimit.mins=180 /Users/fabio/NUTCH/crawl//segments/20170411005501 -noParsing -threads 50

Here - although I am a newbie - I notice that there is one line saying “Generator: topN: 50000” - slightely more than -D topN=2 … and there are no indications on the depth. So this nice script /bin/crawl seems not to pass the -D parameters to the Java application. And maybe not even the solr.server-url value … 

Googling for “depth” finds a lot of explanations on the deprecated form /bin/nutch crawl -depth, … etc… so I feel a little confused and need help.

What is wrong with my call example above please? 

Thank you for any hint which can help me understanging why the -D parameters are not passed.

Regards
Fabio Ricci

Re: Nutch 1.13 @Sierra - Java -D parameters not passed to nutch

Posted by Sebastian Nagel <wa...@googlemail.com>.

> 1) Why is nutch performing a "never ending output\u201d (a part of it is attached) fetching hundreds of
> urls

The only explanation is that these URLs have been already in
  /Users/fabio/NUTCH/crawl/crawldb/

- only one URL is injected (but it was already in CrawlDb):

Injector: Total urls rejected by filters: 0
Injector: Total urls injected after normalization and filtering: 1
Injector: Total urls injected but already in CrawlDb: 1
Injector: Total new urls injected: 0

- then 203 URLs are fetched:

QueueFeeder finished: total 203 records + hit by time limit :0


Just delete the crawl folder to start a crawl from scratch.
Otherwise the previous crawl is continued (one cycle added).


> 2) If -D Java parameters are only passed to hadoop (why), *is this still the right way to
>    integrate nutch productively with SOLR* as the documentation says ?

Good question what the *right way* is.  You could also set the property in nutch-site.xml.
Properties are set and overwritten along the hierarchy
  nutch-default.xml  <  nutch-site.xml  <  command-line -D ...
You have the choice. :)


On 04/12/2017 10:53 AM, Fabio Ricci wrote:
> Hi Sebastian
> 
> Maybe a second consideration:
> 
> if the latest crawl parameter \u201cnumRounds" is 1, like in my original command (without the depth and
> topN parameters)
> ./bin/crawl -i -D
> solr.server.url=http://127.0.0.1:8983/solr/demo/ /Users/fabio/NUTCH/urls/ /Users/fabio/NUTCH/crawl/ 1
> 
> There are a couple of questions which arise - considering one only seed URL: https://rdflink.ch and
> 1 as depth and 1 as scoring.depth.max in the properties as kindly suggested to me !!!
> 
> 1) Why is nutch performing a "never ending output\u201d (a part of it is attached) fetching hundreds of
> urls (in this case just one shot is expected, one web site and stop) - a very quick story actually.
> if I grep http index.php | wc -l which is the only top page behind that domain, I get 38 lines, so
> maximum (unduplicated) 38 URL\u2019s - \u201cdespite" of this, nutch is apparently fetching hundreds of URLs \u2026
> I really do not understand this and it seems I cannot control it, it seems nutch starts and does
> "what it will\u201d - I attached a zipped output here to see what I mean here.
> 
> 
> 
> 
> 2) If -D Java parameters are only passed to hadoop (why), *is this still the right way to integrate
> nutch productively with SOLR* as the documentation says ?
> 
> Thanks a lot again
> Regards
> Fabio
> 
> 
>> On 11 Apr 2017, at 22:26, Sebastian Nagel <wastl.nagel@googlemail.com
>> <ma...@googlemail.com>> wrote:
>>
>> Hi,
>>
>> "generate.max.distance" is for Nutch 2.x, for Nutch 1.x there is the plugin scoring-depth:
>> - add it to the property "plugin.includes"
>> - configure the property "scoring.depth.max"
>>
>> But depth and cycles/rounds are equivalent if topN is large. During the first cycle all seeds (depth
>> 1) are fetched, the second cycle fetches all links of depth 2, and so on. Only if there are more
>> URLs to fetch than topN, you get a different behavior for depth and cycles.
>>
>>>> Maybe I should use a lower NUTCH version (which) ?
>> 1.13 is a good choice.
>>
>> Best,
>> Sebastian
>>
>> On 04/11/2017 03:39 PM, Ben Vachon wrote:
>>> Hi Fabio,
>>>
>>> I believe there is a property generate.max.distance in nutch-site.xml in the newest releases that
>>> you can use to configure max depth.
>>>
>>>
>>> On 04/11/2017 06:20 AM, Fabio Ricci wrote:
>>>> Hi Sebastian
>>>>
>>>> thank you for your message. That does not help me really\u2026
>>>>
>>>> Yes I new the output of ./crawl without parameters (Synopsis) - but even then there are some
>>>> assumptions, only an insider can understand. And under
>>>> https://wiki.apache.org/nutch/NutchTutorial#A4._Setup_Solr_for_search
>>>> <https://wiki.apache.org/nutch/NutchTutorial#A4._Setup_Solr_for_search> there is an indication to
>>>> use it like I tried.
>>>>
>>>> Num Rounds is not a Depth. A Depth is the depth in traversing links starting from Seed.
>>>> I admit I feel overwhelmed by all that parameters which in my case do not help me\u2026
>>>>
>>>> I just need a tool which navigate using a seed url inside a certain depth. Do not need topN
>>>> parameters \u2026
>>>>
>>>> Maybe I should use a lower NUTCH version (which) ?
>>>>
>>>> ...
>>>>
>>>> Thanks
>>>> Fabio
>>>>
>>>>
>>>>> On 11 Apr 2017, at 10:26, Sebastian Nagel <wa...@googlemail.com> wrote:
>>>>>
>>>>> Hi Fabio,
>>>>>
>>>>> only Java/Hadoop properties can be passed via -D...
>>>>>
>>>>> Command-line parameters (such as -topN) cannot be passed to Nutch tools/steps this way, see:
>>>>>
>>>>> % bin/crawl
>>>>> Usage: crawl [-i|--index] [-D "key=value"] [-w|--wait] <Seed Dir> <Crawl Dir> <Num Rounds>
>>>>>        -i|--index      Indexes crawl results into a configured indexer
>>>>>        -D              A Java property to pass to Nutch calls
>>>>>        -w|--wait       NUMBER[SUFFIX] Time to wait before generating a new segment when no URLs
>>>>>                        are scheduled for fetching. Suffix can be: s for second,
>>>>>                        m for minute, h for hour and d for day. If no suffix is
>>>>>                        specified second is used by default.
>>>>>        Seed Dir        Directory in which to look for a seeds file
>>>>>        Crawl Dir       Directory where the crawl/link/segments dirs are saved
>>>>>        Num Rounds      The number of rounds to run this crawl for
>>>>>
>>>>> In case of -topN : you need to modify bin/crawl (that's easy to do). There are also other ways to
>>>>> limit the length of the fetch list (see, e.g., "generate.max.count").
>>>>>
>>>>> Regarding -depth : I suppose that's the same as <Num Rounds>
>>>>>
>>>>> Best,
>>>>> Sebastian
>>>>>
>>>>> On 04/11/2017 01:12 AM, Fabio Ricci wrote:
>>>>>> Hello
>>>>>>
>>>>>> I am a newbie in NUTCH and I need a crawler in order to fetch some urls inside a given depth and
>>>>>> to index dound pages into SOLR 6.5 .
>>>>>>
>>>>>> On my OSX I got NUTCH running. I was hoping to use it directly for indexing. Instead I am
>>>>>> wondering why the script /runtime/local/bin/crawl does not pass the depth and topN parameter to
>>>>>> the software.
>>>>>>
>>>>>> Expetially I use the following example call:
>>>>>>
>>>>>> ./bin/crawl -i -D solr.server.url=http://127.0.0.1:8983/solr/demo/ -D depth=2 -D topN=2
>>>>>> /Users/fabio/NUTCH/urls/ /Users/fabio/NUTCH/crawl/ 1
>>>>>>
>>>>>> With one single url inside /ursl/seed.txt
>>>>>>
>>>>>> Expecting the crawling process will go into max depth = 2.
>>>>>>
>>>>>> Instead, it runs and runs \u2026 and I suppose something runs ***differently*** as described.
>>>>>>
>>>>>> For example I noticed in the output the following text (this is just a segment, the output "does
>>>>>> not stop"):
>>>>>>
>>>>>> Injecting seed URLs
>>>>>> /Users/fabio/Documents/workspace/NUTCH/runtime/local/bin/nutch inject
>>>>>> /Users/fabio/NUTCH/crawl//crawldb /Users/fabio/NUTCH/urls/
>>>>>> Injector: starting at 2017-04-11 00:54:56
>>>>>> Injector: crawlDb: /Users/fabio/NUTCH/crawl/crawldb
>>>>>> Injector: urlDir: /Users/fabio/NUTCH/urls
>>>>>> Injector: Converting injected urls to crawl db entries.
>>>>>> Injector: overwrite: false
>>>>>> Injector: update: false
>>>>>> Injector: Total urls rejected by filters: 0
>>>>>> Injector: Total urls injected after normalization and filtering: 1
>>>>>> Injector: Total urls injected but already in CrawlDb: 1
>>>>>> Injector: Total new urls injected: 0
>>>>>> Injector: finished at 2017-04-11 00:54:58, elapsed: 00:00:01
>>>>>> Tue Apr 11 00:54:58 CEST 2017 : Iteration 1 of 1
>>>>>> Generating a new segment
>>>>>> /Users/fabio/Documents/workspace/NUTCH/runtime/local/bin/nutch generate -D
>>>>>> mapreduce.job.reduces=2 -D mapred.child.java.opts=-Xmx1000m -D
>>>>>> mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D
>>>>>> mapreduce.map.output.compress=true /Users/fabio/NUTCH/crawl//crawldb
>>>>>> /Users/fabio/NUTCH/crawl//segments -topN 50000 -numFetchers 1 -noFilter
>>>>>> Generator: starting at 2017-04-11 00:54:59
>>>>>> Generator: Selecting best-scoring urls due for fetch.
>>>>>> Generator: filtering: false
>>>>>> Generator: normalizing: true
>>>>>> Generator: topN: 50000
>>>>>> Generator: Partitioning selected urls for politeness.
>>>>>> Generator: segment: /Users/fabio/NUTCH/crawl/segments/20170411005501
>>>>>> Generator: finished at 2017-04-11 00:55:02, elapsed: 00:00:03
>>>>>> Operating on segment : 20170411005501
>>>>>> Fetching : 20170411005501
>>>>>> /Users/fabio/Documents/workspace/NUTCH/runtime/local/bin/nutch fetch -D mapreduce.job.reduces=2
>>>>>> -D mapred.child.java.opts=-Xmx1000m -D mapreduce.reduce.speculative=false -D
>>>>>> mapreduce.map.speculative=false -D mapreduce.map.output.compress=true -D
>>>>>> fetcher.timelimit.mins=180 /Users/fabio/NUTCH/crawl//segments/20170411005501 -noParsing
>>>>>> -threads 50
>>>>>>
>>>>>> Here - although I am a newbie - I notice that there is one line saying \u201cGenerator: topN: 50000\u201d
>>>>>> - slightely more than -D topN=2 \u2026 and there are no indications on the depth. So this nice script
>>>>>> /bin/crawl seems not to pass the -D parameters to the Java application. And maybe not even the
>>>>>> solr.server-url value \u2026
>>>>>>
>>>>>> Googling for \u201cdepth\u201d finds a lot of explanations on the deprecated form /bin/nutch crawl -depth,
>>>>>> \u2026 etc\u2026 so I feel a little confused and need help.
>>>>>>
>>>>>> What is wrong with my call example above please?
>>>>>>
>>>>>> Thank you for any hint which can help me understanging why the -D parameters are not passed.
>>>>>>
>>>>>> Regards
>>>>>> Fabio Ricci
>>>>>>
>>>>>>
>>>>
>>>
>>
>

Re: Nutch 1.13 @Sierra - Java -D parameters not passed to nutch

Posted by Fabio Ricci <fa...@gmail.com>.

Hi Sebastian

Maybe a second consideration:

if the latest crawl parameter “numRounds" is 1, like in my original command (without the depth and topN parameters)
./bin/crawl -i -D solr.server.url=http://127.0.0.1:8983/solr/demo/ <http://127.0.0.1:8983/solr/demo/> /Users/fabio/NUTCH/urls/ /Users/fabio/NUTCH/crawl/ 1

There are a couple of questions which arise - considering one only seed URL: https://rdflink.ch <https://rdflink.ch/> and 1 as depth and 1 as scoring.depth.max in the properties as kindly suggested to me !!!

1) Why is nutch performing a "never ending output” (a part of it is attached) fetching hundreds of urls (in this case just one shot is expected, one web site and stop) - a very quick story actually. if I grep http index.php | wc -l which is the only top page behind that domain, I get 38 lines, so maximum (unduplicated) 38 URL’s - “despite" of this, nutch is apparently fetching hundreds of URLs … I really do not understand this and it seems I cannot control it, it seems nutch starts and does "what it will” - I attached a zipped output here to see what I mean here.


2) If -D Java parameters are only passed to hadoop (why), is this still the right way to integrate nutch productively with SOLR as the documentation says ?

Thanks a lot again
Regards
Fabio


> On 11 Apr 2017, at 22:26, Sebastian Nagel <wa...@googlemail.com> wrote:
> 
> Hi,
> 
> "generate.max.distance" is for Nutch 2.x, for Nutch 1.x there is the plugin scoring-depth:
> - add it to the property "plugin.includes"
> - configure the property "scoring.depth.max"
> 
> But depth and cycles/rounds are equivalent if topN is large. During the first cycle all seeds (depth
> 1) are fetched, the second cycle fetches all links of depth 2, and so on. Only if there are more
> URLs to fetch than topN, you get a different behavior for depth and cycles.
> 
>>> Maybe I should use a lower NUTCH version (which) ?
> 1.13 is a good choice.
> 
> Best,
> Sebastian
> 
> On 04/11/2017 03:39 PM, Ben Vachon wrote:
>> Hi Fabio,
>> 
>> I believe there is a property generate.max.distance in nutch-site.xml in the newest releases that
>> you can use to configure max depth.
>> 
>> 
>> On 04/11/2017 06:20 AM, Fabio Ricci wrote:
>>> Hi Sebastian
>>> 
>>> thank you for your message. That does not help me really…
>>> 
>>> Yes I new the output of ./crawl without parameters (Synopsis) - but even then there are some
>>> assumptions, only an insider can understand. And under
>>> https://wiki.apache.org/nutch/NutchTutorial#A4._Setup_Solr_for_search
>>> <https://wiki.apache.org/nutch/NutchTutorial#A4._Setup_Solr_for_search> there is an indication to
>>> use it like I tried.
>>> 
>>> Num Rounds is not a Depth. A Depth is the depth in traversing links starting from Seed.
>>> I admit I feel overwhelmed by all that parameters which in my case do not help me…
>>> 
>>> I just need a tool which navigate using a seed url inside a certain depth. Do not need topN
>>> parameters …
>>> 
>>> Maybe I should use a lower NUTCH version (which) ?
>>> 
>>> ...
>>> 
>>> Thanks
>>> Fabio
>>> 
>>> 
>>>> On 11 Apr 2017, at 10:26, Sebastian Nagel <wa...@googlemail.com> wrote:
>>>> 
>>>> Hi Fabio,
>>>> 
>>>> only Java/Hadoop properties can be passed via -D...
>>>> 
>>>> Command-line parameters (such as -topN) cannot be passed to Nutch tools/steps this way, see:
>>>> 
>>>> % bin/crawl
>>>> Usage: crawl [-i|--index] [-D "key=value"] [-w|--wait] <Seed Dir> <Crawl Dir> <Num Rounds>
>>>>        -i|--index      Indexes crawl results into a configured indexer
>>>>        -D              A Java property to pass to Nutch calls
>>>>        -w|--wait       NUMBER[SUFFIX] Time to wait before generating a new segment when no URLs
>>>>                        are scheduled for fetching. Suffix can be: s for second,
>>>>                        m for minute, h for hour and d for day. If no suffix is
>>>>                        specified second is used by default.
>>>>        Seed Dir        Directory in which to look for a seeds file
>>>>        Crawl Dir       Directory where the crawl/link/segments dirs are saved
>>>>        Num Rounds      The number of rounds to run this crawl for
>>>> 
>>>> In case of -topN : you need to modify bin/crawl (that's easy to do). There are also other ways to
>>>> limit the length of the fetch list (see, e.g., "generate.max.count").
>>>> 
>>>> Regarding -depth : I suppose that's the same as <Num Rounds>
>>>> 
>>>> Best,
>>>> Sebastian
>>>> 
>>>> On 04/11/2017 01:12 AM, Fabio Ricci wrote:
>>>>> Hello
>>>>> 
>>>>> I am a newbie in NUTCH and I need a crawler in order to fetch some urls inside a given depth and
>>>>> to index dound pages into SOLR 6.5 .
>>>>> 
>>>>> On my OSX I got NUTCH running. I was hoping to use it directly for indexing. Instead I am
>>>>> wondering why the script /runtime/local/bin/crawl does not pass the depth and topN parameter to
>>>>> the software.
>>>>> 
>>>>> Expetially I use the following example call:
>>>>> 
>>>>> ./bin/crawl -i -D solr.server.url=http://127.0.0.1:8983/solr/demo/ -D depth=2 -D topN=2
>>>>> /Users/fabio/NUTCH/urls/ /Users/fabio/NUTCH/crawl/ 1
>>>>> 
>>>>> With one single url inside /ursl/seed.txt
>>>>> 
>>>>> Expecting the crawling process will go into max depth = 2.
>>>>> 
>>>>> Instead, it runs and runs … and I suppose something runs ***differently*** as described.
>>>>> 
>>>>> For example I noticed in the output the following text (this is just a segment, the output "does
>>>>> not stop"):
>>>>> 
>>>>> Injecting seed URLs
>>>>> /Users/fabio/Documents/workspace/NUTCH/runtime/local/bin/nutch inject
>>>>> /Users/fabio/NUTCH/crawl//crawldb /Users/fabio/NUTCH/urls/
>>>>> Injector: starting at 2017-04-11 00:54:56
>>>>> Injector: crawlDb: /Users/fabio/NUTCH/crawl/crawldb
>>>>> Injector: urlDir: /Users/fabio/NUTCH/urls
>>>>> Injector: Converting injected urls to crawl db entries.
>>>>> Injector: overwrite: false
>>>>> Injector: update: false
>>>>> Injector: Total urls rejected by filters: 0
>>>>> Injector: Total urls injected after normalization and filtering: 1
>>>>> Injector: Total urls injected but already in CrawlDb: 1
>>>>> Injector: Total new urls injected: 0
>>>>> Injector: finished at 2017-04-11 00:54:58, elapsed: 00:00:01
>>>>> Tue Apr 11 00:54:58 CEST 2017 : Iteration 1 of 1
>>>>> Generating a new segment
>>>>> /Users/fabio/Documents/workspace/NUTCH/runtime/local/bin/nutch generate -D
>>>>> mapreduce.job.reduces=2 -D mapred.child.java.opts=-Xmx1000m -D
>>>>> mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D
>>>>> mapreduce.map.output.compress=true /Users/fabio/NUTCH/crawl//crawldb
>>>>> /Users/fabio/NUTCH/crawl//segments -topN 50000 -numFetchers 1 -noFilter
>>>>> Generator: starting at 2017-04-11 00:54:59
>>>>> Generator: Selecting best-scoring urls due for fetch.
>>>>> Generator: filtering: false
>>>>> Generator: normalizing: true
>>>>> Generator: topN: 50000
>>>>> Generator: Partitioning selected urls for politeness.
>>>>> Generator: segment: /Users/fabio/NUTCH/crawl/segments/20170411005501
>>>>> Generator: finished at 2017-04-11 00:55:02, elapsed: 00:00:03
>>>>> Operating on segment : 20170411005501
>>>>> Fetching : 20170411005501
>>>>> /Users/fabio/Documents/workspace/NUTCH/runtime/local/bin/nutch fetch -D mapreduce.job.reduces=2
>>>>> -D mapred.child.java.opts=-Xmx1000m -D mapreduce.reduce.speculative=false -D
>>>>> mapreduce.map.speculative=false -D mapreduce.map.output.compress=true -D
>>>>> fetcher.timelimit.mins=180 /Users/fabio/NUTCH/crawl//segments/20170411005501 -noParsing -threads 50
>>>>> 
>>>>> Here - although I am a newbie - I notice that there is one line saying “Generator: topN: 50000”
>>>>> - slightely more than -D topN=2 … and there are no indications on the depth. So this nice script
>>>>> /bin/crawl seems not to pass the -D parameters to the Java application. And maybe not even the
>>>>> solr.server-url value …
>>>>> 
>>>>> Googling for “depth” finds a lot of explanations on the deprecated form /bin/nutch crawl -depth,
>>>>> … etc… so I feel a little confused and need help.
>>>>> 
>>>>> What is wrong with my call example above please?
>>>>> 
>>>>> Thank you for any hint which can help me understanging why the -D parameters are not passed.
>>>>> 
>>>>> Regards
>>>>> Fabio Ricci
>>>>> 
>>>>> 
>>> 
>> 
>

Re: Nutch 1.13 @Sierra - Java -D parameters not passed to nutch

Posted by Sebastian Nagel <wa...@googlemail.com>.

> It seems it is a science, not a tool :)))

... grown over time (since 2002) to cover many use case always with scale (millions, billions) in mind.

> How exactely should "_maxdepth=2\u201d as seed metadata be (where) specified

In the seed URLs file separated by a tab character (\u0009):

http://example.com/	_maxdepth_=2

Best,
Sebastian


On 04/12/2017 10:13 AM, Fabio Ricci wrote:
> Dear Sebastian and Ben - thank you so far for your hints!
> It seems it is a science, not a tool :)))
> I was considering simply a graph (built upon URL\u2019s) which is built (=\u201cinjected\u201d in the nutch universum) and explored with a radius (a depth).
> 
> Instead, and surely because of other considerations (mass crawling aspects) there seem to be other control parameters which rather \u201capproximate\u201d this simple concept \u2026 
> 
> Anyway in nutch 1.13 /conf/nutch-site.xml - thanks to your kind hint - there is a section like:
> 
> <property>
>   <name>scoring.depth.max</name>
>   <value>1000</value>
>   <description>Max depth value from seed allowed by default.
>   Can be overridden on a per-seed basis by specifying "_maxdepth_=VALUE"
>   as a seed metadata. This plugin adds a "_depth_" metadatum to the pages
>   to track the distance from the seed it was found from. 
>   The depth is used to prioritise URLs in the generation step so that
>   shallower pages are fetched first.
>   </description>
> </property>
> 
> Considering https://wiki.apache.org/nutch/NutchTutorial#Create_a_URL_seed_list <https://wiki.apache.org/nutch/NutchTutorial#Create_a_URL_seed_list> now the question is: 
> How exactely should "_maxdepth=2\u201d as seed metadata be (where) specified, so that the \u201cdepth\u201d be specified for (or before) each NUTCH run (instead of beeing changed in the properties) ?
> 
> Best
> Fabio
> 
> 
> 
>> On 11 Apr 2017, at 22:26, Sebastian Nagel <wa...@googlemail.com> wrote:
>>
>> Hi,
>>
>> "generate.max.distance" is for Nutch 2.x, for Nutch 1.x there is the plugin scoring-depth:
>> - add it to the property "plugin.includes"
>> - configure the property "scoring.depth.max"
>>
>> But depth and cycles/rounds are equivalent if topN is large. During the first cycle all seeds (depth
>> 1) are fetched, the second cycle fetches all links of depth 2, and so on. Only if there are more
>> URLs to fetch than topN, you get a different behavior for depth and cycles.
>>
>>>> Maybe I should use a lower NUTCH version (which) ?
>> 1.13 is a good choice.
>>
>> Best,
>> Sebastian
>>
>> On 04/11/2017 03:39 PM, Ben Vachon wrote:
>>> Hi Fabio,
>>>
>>> I believe there is a property generate.max.distance in nutch-site.xml in the newest releases that
>>> you can use to configure max depth.
>>>
>>>
>>> On 04/11/2017 06:20 AM, Fabio Ricci wrote:
>>>> Hi Sebastian
>>>>
>>>> thank you for your message. That does not help me really\u2026
>>>>
>>>> Yes I new the output of ./crawl without parameters (Synopsis) - but even then there are some
>>>> assumptions, only an insider can understand. And under
>>>> https://wiki.apache.org/nutch/NutchTutorial#A4._Setup_Solr_for_search
>>>> <https://wiki.apache.org/nutch/NutchTutorial#A4._Setup_Solr_for_search> there is an indication to
>>>> use it like I tried.
>>>>
>>>> Num Rounds is not a Depth. A Depth is the depth in traversing links starting from Seed.
>>>> I admit I feel overwhelmed by all that parameters which in my case do not help me\u2026
>>>>
>>>> I just need a tool which navigate using a seed url inside a certain depth. Do not need topN
>>>> parameters \u2026
>>>>
>>>> Maybe I should use a lower NUTCH version (which) ?
>>>>
>>>> ...
>>>>
>>>> Thanks
>>>> Fabio
>>>>
>>>>
>>>>> On 11 Apr 2017, at 10:26, Sebastian Nagel <wa...@googlemail.com> wrote:
>>>>>
>>>>> Hi Fabio,
>>>>>
>>>>> only Java/Hadoop properties can be passed via -D...
>>>>>
>>>>> Command-line parameters (such as -topN) cannot be passed to Nutch tools/steps this way, see:
>>>>>
>>>>> % bin/crawl
>>>>> Usage: crawl [-i|--index] [-D "key=value"] [-w|--wait] <Seed Dir> <Crawl Dir> <Num Rounds>
>>>>>        -i|--index      Indexes crawl results into a configured indexer
>>>>>        -D              A Java property to pass to Nutch calls
>>>>>        -w|--wait       NUMBER[SUFFIX] Time to wait before generating a new segment when no URLs
>>>>>                        are scheduled for fetching. Suffix can be: s for second,
>>>>>                        m for minute, h for hour and d for day. If no suffix is
>>>>>                        specified second is used by default.
>>>>>        Seed Dir        Directory in which to look for a seeds file
>>>>>        Crawl Dir       Directory where the crawl/link/segments dirs are saved
>>>>>        Num Rounds      The number of rounds to run this crawl for
>>>>>
>>>>> In case of -topN : you need to modify bin/crawl (that's easy to do). There are also other ways to
>>>>> limit the length of the fetch list (see, e.g., "generate.max.count").
>>>>>
>>>>> Regarding -depth : I suppose that's the same as <Num Rounds>
>>>>>
>>>>> Best,
>>>>> Sebastian
>>>>>
>>>>> On 04/11/2017 01:12 AM, Fabio Ricci wrote:
>>>>>> Hello
>>>>>>
>>>>>> I am a newbie in NUTCH and I need a crawler in order to fetch some urls inside a given depth and
>>>>>> to index dound pages into SOLR 6.5 .
>>>>>>
>>>>>> On my OSX I got NUTCH running. I was hoping to use it directly for indexing. Instead I am
>>>>>> wondering why the script /runtime/local/bin/crawl does not pass the depth and topN parameter to
>>>>>> the software.
>>>>>>
>>>>>> Expetially I use the following example call:
>>>>>>
>>>>>> ./bin/crawl -i -D solr.server.url=http://127.0.0.1:8983/solr/demo/ -D depth=2 -D topN=2
>>>>>> /Users/fabio/NUTCH/urls/ /Users/fabio/NUTCH/crawl/ 1
>>>>>>
>>>>>> With one single url inside /ursl/seed.txt
>>>>>>
>>>>>> Expecting the crawling process will go into max depth = 2.
>>>>>>
>>>>>> Instead, it runs and runs \u2026 and I suppose something runs ***differently*** as described.
>>>>>>
>>>>>> For example I noticed in the output the following text (this is just a segment, the output "does
>>>>>> not stop"):
>>>>>>
>>>>>> Injecting seed URLs
>>>>>> /Users/fabio/Documents/workspace/NUTCH/runtime/local/bin/nutch inject
>>>>>> /Users/fabio/NUTCH/crawl//crawldb /Users/fabio/NUTCH/urls/
>>>>>> Injector: starting at 2017-04-11 00:54:56
>>>>>> Injector: crawlDb: /Users/fabio/NUTCH/crawl/crawldb
>>>>>> Injector: urlDir: /Users/fabio/NUTCH/urls
>>>>>> Injector: Converting injected urls to crawl db entries.
>>>>>> Injector: overwrite: false
>>>>>> Injector: update: false
>>>>>> Injector: Total urls rejected by filters: 0
>>>>>> Injector: Total urls injected after normalization and filtering: 1
>>>>>> Injector: Total urls injected but already in CrawlDb: 1
>>>>>> Injector: Total new urls injected: 0
>>>>>> Injector: finished at 2017-04-11 00:54:58, elapsed: 00:00:01
>>>>>> Tue Apr 11 00:54:58 CEST 2017 : Iteration 1 of 1
>>>>>> Generating a new segment
>>>>>> /Users/fabio/Documents/workspace/NUTCH/runtime/local/bin/nutch generate -D
>>>>>> mapreduce.job.reduces=2 -D mapred.child.java.opts=-Xmx1000m -D
>>>>>> mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D
>>>>>> mapreduce.map.output.compress=true /Users/fabio/NUTCH/crawl//crawldb
>>>>>> /Users/fabio/NUTCH/crawl//segments -topN 50000 -numFetchers 1 -noFilter
>>>>>> Generator: starting at 2017-04-11 00:54:59
>>>>>> Generator: Selecting best-scoring urls due for fetch.
>>>>>> Generator: filtering: false
>>>>>> Generator: normalizing: true
>>>>>> Generator: topN: 50000
>>>>>> Generator: Partitioning selected urls for politeness.
>>>>>> Generator: segment: /Users/fabio/NUTCH/crawl/segments/20170411005501
>>>>>> Generator: finished at 2017-04-11 00:55:02, elapsed: 00:00:03
>>>>>> Operating on segment : 20170411005501
>>>>>> Fetching : 20170411005501
>>>>>> /Users/fabio/Documents/workspace/NUTCH/runtime/local/bin/nutch fetch -D mapreduce.job.reduces=2
>>>>>> -D mapred.child.java.opts=-Xmx1000m -D mapreduce.reduce.speculative=false -D
>>>>>> mapreduce.map.speculative=false -D mapreduce.map.output.compress=true -D
>>>>>> fetcher.timelimit.mins=180 /Users/fabio/NUTCH/crawl//segments/20170411005501 -noParsing -threads 50
>>>>>>
>>>>>> Here - although I am a newbie - I notice that there is one line saying \u201cGenerator: topN: 50000\u201d
>>>>>> - slightely more than -D topN=2 \u2026 and there are no indications on the depth. So this nice script
>>>>>> /bin/crawl seems not to pass the -D parameters to the Java application. And maybe not even the
>>>>>> solr.server-url value \u2026
>>>>>>
>>>>>> Googling for \u201cdepth\u201d finds a lot of explanations on the deprecated form /bin/nutch crawl -depth,
>>>>>> \u2026 etc\u2026 so I feel a little confused and need help.
>>>>>>
>>>>>> What is wrong with my call example above please?
>>>>>>
>>>>>> Thank you for any hint which can help me understanging why the -D parameters are not passed.
>>>>>>
>>>>>> Regards
>>>>>> Fabio Ricci
>>>>>>
>>>>>>
>>>>
>>>
>>
> 
>

Re: Nutch 1.13 @Sierra - Java -D parameters not passed to nutch

Posted by Fabio Ricci <fa...@gmail.com>.

Dear Sebastian and Ben - thank you so far for your hints!
It seems it is a science, not a tool :)))
I was considering simply a graph (built upon URL’s) which is built (=“injected” in the nutch universum) and explored with a radius (a depth).

Instead, and surely because of other considerations (mass crawling aspects) there seem to be other control parameters which rather “approximate” this simple concept … 

Anyway in nutch 1.13 /conf/nutch-site.xml - thanks to your kind hint - there is a section like:

<property>
  <name>scoring.depth.max</name>
  <value>1000</value>
  <description>Max depth value from seed allowed by default.
  Can be overridden on a per-seed basis by specifying "_maxdepth_=VALUE"
  as a seed metadata. This plugin adds a "_depth_" metadatum to the pages
  to track the distance from the seed it was found from. 
  The depth is used to prioritise URLs in the generation step so that
  shallower pages are fetched first.
  </description>
</property>

Considering https://wiki.apache.org/nutch/NutchTutorial#Create_a_URL_seed_list <https://wiki.apache.org/nutch/NutchTutorial#Create_a_URL_seed_list> now the question is: 
How exactely should "_maxdepth=2” as seed metadata be (where) specified, so that the “depth” be specified for (or before) each NUTCH run (instead of beeing changed in the properties) ?

Best
Fabio



> On 11 Apr 2017, at 22:26, Sebastian Nagel <wa...@googlemail.com> wrote:
> 
> Hi,
> 
> "generate.max.distance" is for Nutch 2.x, for Nutch 1.x there is the plugin scoring-depth:
> - add it to the property "plugin.includes"
> - configure the property "scoring.depth.max"
> 
> But depth and cycles/rounds are equivalent if topN is large. During the first cycle all seeds (depth
> 1) are fetched, the second cycle fetches all links of depth 2, and so on. Only if there are more
> URLs to fetch than topN, you get a different behavior for depth and cycles.
> 
>>> Maybe I should use a lower NUTCH version (which) ?
> 1.13 is a good choice.
> 
> Best,
> Sebastian
> 
> On 04/11/2017 03:39 PM, Ben Vachon wrote:
>> Hi Fabio,
>> 
>> I believe there is a property generate.max.distance in nutch-site.xml in the newest releases that
>> you can use to configure max depth.
>> 
>> 
>> On 04/11/2017 06:20 AM, Fabio Ricci wrote:
>>> Hi Sebastian
>>> 
>>> thank you for your message. That does not help me really…
>>> 
>>> Yes I new the output of ./crawl without parameters (Synopsis) - but even then there are some
>>> assumptions, only an insider can understand. And under
>>> https://wiki.apache.org/nutch/NutchTutorial#A4._Setup_Solr_for_search
>>> <https://wiki.apache.org/nutch/NutchTutorial#A4._Setup_Solr_for_search> there is an indication to
>>> use it like I tried.
>>> 
>>> Num Rounds is not a Depth. A Depth is the depth in traversing links starting from Seed.
>>> I admit I feel overwhelmed by all that parameters which in my case do not help me…
>>> 
>>> I just need a tool which navigate using a seed url inside a certain depth. Do not need topN
>>> parameters …
>>> 
>>> Maybe I should use a lower NUTCH version (which) ?
>>> 
>>> ...
>>> 
>>> Thanks
>>> Fabio
>>> 
>>> 
>>>> On 11 Apr 2017, at 10:26, Sebastian Nagel <wa...@googlemail.com> wrote:
>>>> 
>>>> Hi Fabio,
>>>> 
>>>> only Java/Hadoop properties can be passed via -D...
>>>> 
>>>> Command-line parameters (such as -topN) cannot be passed to Nutch tools/steps this way, see:
>>>> 
>>>> % bin/crawl
>>>> Usage: crawl [-i|--index] [-D "key=value"] [-w|--wait] <Seed Dir> <Crawl Dir> <Num Rounds>
>>>>        -i|--index      Indexes crawl results into a configured indexer
>>>>        -D              A Java property to pass to Nutch calls
>>>>        -w|--wait       NUMBER[SUFFIX] Time to wait before generating a new segment when no URLs
>>>>                        are scheduled for fetching. Suffix can be: s for second,
>>>>                        m for minute, h for hour and d for day. If no suffix is
>>>>                        specified second is used by default.
>>>>        Seed Dir        Directory in which to look for a seeds file
>>>>        Crawl Dir       Directory where the crawl/link/segments dirs are saved
>>>>        Num Rounds      The number of rounds to run this crawl for
>>>> 
>>>> In case of -topN : you need to modify bin/crawl (that's easy to do). There are also other ways to
>>>> limit the length of the fetch list (see, e.g., "generate.max.count").
>>>> 
>>>> Regarding -depth : I suppose that's the same as <Num Rounds>
>>>> 
>>>> Best,
>>>> Sebastian
>>>> 
>>>> On 04/11/2017 01:12 AM, Fabio Ricci wrote:
>>>>> Hello
>>>>> 
>>>>> I am a newbie in NUTCH and I need a crawler in order to fetch some urls inside a given depth and
>>>>> to index dound pages into SOLR 6.5 .
>>>>> 
>>>>> On my OSX I got NUTCH running. I was hoping to use it directly for indexing. Instead I am
>>>>> wondering why the script /runtime/local/bin/crawl does not pass the depth and topN parameter to
>>>>> the software.
>>>>> 
>>>>> Expetially I use the following example call:
>>>>> 
>>>>> ./bin/crawl -i -D solr.server.url=http://127.0.0.1:8983/solr/demo/ -D depth=2 -D topN=2
>>>>> /Users/fabio/NUTCH/urls/ /Users/fabio/NUTCH/crawl/ 1
>>>>> 
>>>>> With one single url inside /ursl/seed.txt
>>>>> 
>>>>> Expecting the crawling process will go into max depth = 2.
>>>>> 
>>>>> Instead, it runs and runs … and I suppose something runs ***differently*** as described.
>>>>> 
>>>>> For example I noticed in the output the following text (this is just a segment, the output "does
>>>>> not stop"):
>>>>> 
>>>>> Injecting seed URLs
>>>>> /Users/fabio/Documents/workspace/NUTCH/runtime/local/bin/nutch inject
>>>>> /Users/fabio/NUTCH/crawl//crawldb /Users/fabio/NUTCH/urls/
>>>>> Injector: starting at 2017-04-11 00:54:56
>>>>> Injector: crawlDb: /Users/fabio/NUTCH/crawl/crawldb
>>>>> Injector: urlDir: /Users/fabio/NUTCH/urls
>>>>> Injector: Converting injected urls to crawl db entries.
>>>>> Injector: overwrite: false
>>>>> Injector: update: false
>>>>> Injector: Total urls rejected by filters: 0
>>>>> Injector: Total urls injected after normalization and filtering: 1
>>>>> Injector: Total urls injected but already in CrawlDb: 1
>>>>> Injector: Total new urls injected: 0
>>>>> Injector: finished at 2017-04-11 00:54:58, elapsed: 00:00:01
>>>>> Tue Apr 11 00:54:58 CEST 2017 : Iteration 1 of 1
>>>>> Generating a new segment
>>>>> /Users/fabio/Documents/workspace/NUTCH/runtime/local/bin/nutch generate -D
>>>>> mapreduce.job.reduces=2 -D mapred.child.java.opts=-Xmx1000m -D
>>>>> mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D
>>>>> mapreduce.map.output.compress=true /Users/fabio/NUTCH/crawl//crawldb
>>>>> /Users/fabio/NUTCH/crawl//segments -topN 50000 -numFetchers 1 -noFilter
>>>>> Generator: starting at 2017-04-11 00:54:59
>>>>> Generator: Selecting best-scoring urls due for fetch.
>>>>> Generator: filtering: false
>>>>> Generator: normalizing: true
>>>>> Generator: topN: 50000
>>>>> Generator: Partitioning selected urls for politeness.
>>>>> Generator: segment: /Users/fabio/NUTCH/crawl/segments/20170411005501
>>>>> Generator: finished at 2017-04-11 00:55:02, elapsed: 00:00:03
>>>>> Operating on segment : 20170411005501
>>>>> Fetching : 20170411005501
>>>>> /Users/fabio/Documents/workspace/NUTCH/runtime/local/bin/nutch fetch -D mapreduce.job.reduces=2
>>>>> -D mapred.child.java.opts=-Xmx1000m -D mapreduce.reduce.speculative=false -D
>>>>> mapreduce.map.speculative=false -D mapreduce.map.output.compress=true -D
>>>>> fetcher.timelimit.mins=180 /Users/fabio/NUTCH/crawl//segments/20170411005501 -noParsing -threads 50
>>>>> 
>>>>> Here - although I am a newbie - I notice that there is one line saying “Generator: topN: 50000”
>>>>> - slightely more than -D topN=2 … and there are no indications on the depth. So this nice script
>>>>> /bin/crawl seems not to pass the -D parameters to the Java application. And maybe not even the
>>>>> solr.server-url value …
>>>>> 
>>>>> Googling for “depth” finds a lot of explanations on the deprecated form /bin/nutch crawl -depth,
>>>>> … etc… so I feel a little confused and need help.
>>>>> 
>>>>> What is wrong with my call example above please?
>>>>> 
>>>>> Thank you for any hint which can help me understanging why the -D parameters are not passed.
>>>>> 
>>>>> Regards
>>>>> Fabio Ricci
>>>>> 
>>>>> 
>>> 
>> 
>

Re: Nutch 1.13 @Sierra - Java -D parameters not passed to nutch

Posted by Sebastian Nagel <wa...@googlemail.com>.

Hi,

"generate.max.distance" is for Nutch 2.x, for Nutch 1.x there is the plugin scoring-depth:
- add it to the property "plugin.includes"
- configure the property "scoring.depth.max"

But depth and cycles/rounds are equivalent if topN is large. During the first cycle all seeds (depth
1) are fetched, the second cycle fetches all links of depth 2, and so on. Only if there are more
URLs to fetch than topN, you get a different behavior for depth and cycles.

>> Maybe I should use a lower NUTCH version (which) ?
1.13 is a good choice.

Best,
Sebastian

On 04/11/2017 03:39 PM, Ben Vachon wrote:
> Hi Fabio,
> 
> I believe there is a property generate.max.distance in nutch-site.xml in the newest releases that
> you can use to configure max depth.
> 
> 
> On 04/11/2017 06:20 AM, Fabio Ricci wrote:
>> Hi Sebastian
>>
>> thank you for your message. That does not help me really\u2026
>>
>> Yes I new the output of ./crawl without parameters (Synopsis) - but even then there are some
>> assumptions, only an insider can understand. And under
>> https://wiki.apache.org/nutch/NutchTutorial#A4._Setup_Solr_for_search
>> <https://wiki.apache.org/nutch/NutchTutorial#A4._Setup_Solr_for_search> there is an indication to
>> use it like I tried.
>>
>> Num Rounds is not a Depth. A Depth is the depth in traversing links starting from Seed.
>> I admit I feel overwhelmed by all that parameters which in my case do not help me\u2026
>>
>> I just need a tool which navigate using a seed url inside a certain depth. Do not need topN
>> parameters \u2026
>>
>> Maybe I should use a lower NUTCH version (which) ?
>>
>> ...
>>
>> Thanks
>> Fabio
>>
>>
>>> On 11 Apr 2017, at 10:26, Sebastian Nagel <wa...@googlemail.com> wrote:
>>>
>>> Hi Fabio,
>>>
>>> only Java/Hadoop properties can be passed via -D...
>>>
>>> Command-line parameters (such as -topN) cannot be passed to Nutch tools/steps this way, see:
>>>
>>> % bin/crawl
>>> Usage: crawl [-i|--index] [-D "key=value"] [-w|--wait] <Seed Dir> <Crawl Dir> <Num Rounds>
>>>         -i|--index      Indexes crawl results into a configured indexer
>>>         -D              A Java property to pass to Nutch calls
>>>         -w|--wait       NUMBER[SUFFIX] Time to wait before generating a new segment when no URLs
>>>                         are scheduled for fetching. Suffix can be: s for second,
>>>                         m for minute, h for hour and d for day. If no suffix is
>>>                         specified second is used by default.
>>>         Seed Dir        Directory in which to look for a seeds file
>>>         Crawl Dir       Directory where the crawl/link/segments dirs are saved
>>>         Num Rounds      The number of rounds to run this crawl for
>>>
>>> In case of -topN : you need to modify bin/crawl (that's easy to do). There are also other ways to
>>> limit the length of the fetch list (see, e.g., "generate.max.count").
>>>
>>> Regarding -depth : I suppose that's the same as <Num Rounds>
>>>
>>> Best,
>>> Sebastian
>>>
>>> On 04/11/2017 01:12 AM, Fabio Ricci wrote:
>>>> Hello
>>>>
>>>> I am a newbie in NUTCH and I need a crawler in order to fetch some urls inside a given depth and
>>>> to index dound pages into SOLR 6.5 .
>>>>
>>>> On my OSX I got NUTCH running. I was hoping to use it directly for indexing. Instead I am
>>>> wondering why the script /runtime/local/bin/crawl does not pass the depth and topN parameter to
>>>> the software.
>>>>
>>>> Expetially I use the following example call:
>>>>
>>>> ./bin/crawl -i -D solr.server.url=http://127.0.0.1:8983/solr/demo/ -D depth=2 -D topN=2
>>>> /Users/fabio/NUTCH/urls/ /Users/fabio/NUTCH/crawl/ 1
>>>>
>>>> With one single url inside /ursl/seed.txt
>>>>
>>>> Expecting the crawling process will go into max depth = 2.
>>>>
>>>> Instead, it runs and runs \u2026 and I suppose something runs ***differently*** as described.
>>>>
>>>> For example I noticed in the output the following text (this is just a segment, the output "does
>>>> not stop"):
>>>>
>>>> Injecting seed URLs
>>>> /Users/fabio/Documents/workspace/NUTCH/runtime/local/bin/nutch inject
>>>> /Users/fabio/NUTCH/crawl//crawldb /Users/fabio/NUTCH/urls/
>>>> Injector: starting at 2017-04-11 00:54:56
>>>> Injector: crawlDb: /Users/fabio/NUTCH/crawl/crawldb
>>>> Injector: urlDir: /Users/fabio/NUTCH/urls
>>>> Injector: Converting injected urls to crawl db entries.
>>>> Injector: overwrite: false
>>>> Injector: update: false
>>>> Injector: Total urls rejected by filters: 0
>>>> Injector: Total urls injected after normalization and filtering: 1
>>>> Injector: Total urls injected but already in CrawlDb: 1
>>>> Injector: Total new urls injected: 0
>>>> Injector: finished at 2017-04-11 00:54:58, elapsed: 00:00:01
>>>> Tue Apr 11 00:54:58 CEST 2017 : Iteration 1 of 1
>>>> Generating a new segment
>>>> /Users/fabio/Documents/workspace/NUTCH/runtime/local/bin/nutch generate -D
>>>> mapreduce.job.reduces=2 -D mapred.child.java.opts=-Xmx1000m -D
>>>> mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D
>>>> mapreduce.map.output.compress=true /Users/fabio/NUTCH/crawl//crawldb
>>>> /Users/fabio/NUTCH/crawl//segments -topN 50000 -numFetchers 1 -noFilter
>>>> Generator: starting at 2017-04-11 00:54:59
>>>> Generator: Selecting best-scoring urls due for fetch.
>>>> Generator: filtering: false
>>>> Generator: normalizing: true
>>>> Generator: topN: 50000
>>>> Generator: Partitioning selected urls for politeness.
>>>> Generator: segment: /Users/fabio/NUTCH/crawl/segments/20170411005501
>>>> Generator: finished at 2017-04-11 00:55:02, elapsed: 00:00:03
>>>> Operating on segment : 20170411005501
>>>> Fetching : 20170411005501
>>>> /Users/fabio/Documents/workspace/NUTCH/runtime/local/bin/nutch fetch -D mapreduce.job.reduces=2
>>>> -D mapred.child.java.opts=-Xmx1000m -D mapreduce.reduce.speculative=false -D
>>>> mapreduce.map.speculative=false -D mapreduce.map.output.compress=true -D
>>>> fetcher.timelimit.mins=180 /Users/fabio/NUTCH/crawl//segments/20170411005501 -noParsing -threads 50
>>>>
>>>> Here - although I am a newbie - I notice that there is one line saying \u201cGenerator: topN: 50000\u201d
>>>> - slightely more than -D topN=2 \u2026 and there are no indications on the depth. So this nice script
>>>> /bin/crawl seems not to pass the -D parameters to the Java application. And maybe not even the
>>>> solr.server-url value \u2026
>>>>
>>>> Googling for \u201cdepth\u201d finds a lot of explanations on the deprecated form /bin/nutch crawl -depth,
>>>> \u2026 etc\u2026 so I feel a little confused and need help.
>>>>
>>>> What is wrong with my call example above please?
>>>>
>>>> Thank you for any hint which can help me understanging why the -D parameters are not passed.
>>>>
>>>> Regards
>>>> Fabio Ricci
>>>>
>>>>
>>
>

Re: Nutch 1.13 @Sierra - Java -D parameters not passed to nutch

Posted by Ben Vachon <bv...@attivio.com>.

Hi Fabio,

I believe there is a property generate.max.distance in nutch-site.xml in 
the newest releases that you can use to configure max depth.


On 04/11/2017 06:20 AM, Fabio Ricci wrote:
> Hi Sebastian
>
> thank you for your message. That does not help me really\u2026
>
> Yes I new the output of ./crawl without parameters (Synopsis) - but even then there are some assumptions, only an insider can understand. And under https://wiki.apache.org/nutch/NutchTutorial#A4._Setup_Solr_for_search <https://wiki.apache.org/nutch/NutchTutorial#A4._Setup_Solr_for_search> there is an indication to use it like I tried.
>
> Num Rounds is not a Depth. A Depth is the depth in traversing links starting from Seed.
> I admit I feel overwhelmed by all that parameters which in my case do not help me\u2026
>
> I just need a tool which navigate using a seed url inside a certain depth. Do not need topN parameters \u2026
>
> Maybe I should use a lower NUTCH version (which) ?
>
> ...
>
> Thanks
> Fabio
>
>
>> On 11 Apr 2017, at 10:26, Sebastian Nagel <wa...@googlemail.com> wrote:
>>
>> Hi Fabio,
>>
>> only Java/Hadoop properties can be passed via -D...
>>
>> Command-line parameters (such as -topN) cannot be passed to Nutch tools/steps this way, see:
>>
>> % bin/crawl
>> Usage: crawl [-i|--index] [-D "key=value"] [-w|--wait] <Seed Dir> <Crawl Dir> <Num Rounds>
>>         -i|--index      Indexes crawl results into a configured indexer
>>         -D              A Java property to pass to Nutch calls
>>         -w|--wait       NUMBER[SUFFIX] Time to wait before generating a new segment when no URLs
>>                         are scheduled for fetching. Suffix can be: s for second,
>>                         m for minute, h for hour and d for day. If no suffix is
>>                         specified second is used by default.
>>         Seed Dir        Directory in which to look for a seeds file
>>         Crawl Dir       Directory where the crawl/link/segments dirs are saved
>>         Num Rounds      The number of rounds to run this crawl for
>>
>> In case of -topN : you need to modify bin/crawl (that's easy to do). There are also other ways to
>> limit the length of the fetch list (see, e.g., "generate.max.count").
>>
>> Regarding -depth : I suppose that's the same as <Num Rounds>
>>
>> Best,
>> Sebastian
>>
>> On 04/11/2017 01:12 AM, Fabio Ricci wrote:
>>> Hello
>>>
>>> I am a newbie in NUTCH and I need a crawler in order to fetch some urls inside a given depth and to index dound pages into SOLR 6.5 .
>>>
>>> On my OSX I got NUTCH running. I was hoping to use it directly for indexing. Instead I am wondering why the script /runtime/local/bin/crawl does not pass the depth and topN parameter to the software.
>>>
>>> Expetially I use the following example call:
>>>
>>> ./bin/crawl -i -D solr.server.url=http://127.0.0.1:8983/solr/demo/ -D depth=2 -D topN=2 /Users/fabio/NUTCH/urls/ /Users/fabio/NUTCH/crawl/ 1
>>>
>>> With one single url inside /ursl/seed.txt
>>>
>>> Expecting the crawling process will go into max depth = 2.
>>>
>>> Instead, it runs and runs \u2026 and I suppose something runs ***differently*** as described.
>>>
>>> For example I noticed in the output the following text (this is just a segment, the output "does not stop"):
>>>
>>> Injecting seed URLs
>>> /Users/fabio/Documents/workspace/NUTCH/runtime/local/bin/nutch inject /Users/fabio/NUTCH/crawl//crawldb /Users/fabio/NUTCH/urls/
>>> Injector: starting at 2017-04-11 00:54:56
>>> Injector: crawlDb: /Users/fabio/NUTCH/crawl/crawldb
>>> Injector: urlDir: /Users/fabio/NUTCH/urls
>>> Injector: Converting injected urls to crawl db entries.
>>> Injector: overwrite: false
>>> Injector: update: false
>>> Injector: Total urls rejected by filters: 0
>>> Injector: Total urls injected after normalization and filtering: 1
>>> Injector: Total urls injected but already in CrawlDb: 1
>>> Injector: Total new urls injected: 0
>>> Injector: finished at 2017-04-11 00:54:58, elapsed: 00:00:01
>>> Tue Apr 11 00:54:58 CEST 2017 : Iteration 1 of 1
>>> Generating a new segment
>>> /Users/fabio/Documents/workspace/NUTCH/runtime/local/bin/nutch generate -D mapreduce.job.reduces=2 -D mapred.child.java.opts=-Xmx1000m -D mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D mapreduce.map.output.compress=true /Users/fabio/NUTCH/crawl//crawldb /Users/fabio/NUTCH/crawl//segments -topN 50000 -numFetchers 1 -noFilter
>>> Generator: starting at 2017-04-11 00:54:59
>>> Generator: Selecting best-scoring urls due for fetch.
>>> Generator: filtering: false
>>> Generator: normalizing: true
>>> Generator: topN: 50000
>>> Generator: Partitioning selected urls for politeness.
>>> Generator: segment: /Users/fabio/NUTCH/crawl/segments/20170411005501
>>> Generator: finished at 2017-04-11 00:55:02, elapsed: 00:00:03
>>> Operating on segment : 20170411005501
>>> Fetching : 20170411005501
>>> /Users/fabio/Documents/workspace/NUTCH/runtime/local/bin/nutch fetch -D mapreduce.job.reduces=2 -D mapred.child.java.opts=-Xmx1000m -D mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D mapreduce.map.output.compress=true -D fetcher.timelimit.mins=180 /Users/fabio/NUTCH/crawl//segments/20170411005501 -noParsing -threads 50
>>>
>>> Here - although I am a newbie - I notice that there is one line saying \u201cGenerator: topN: 50000\u201d - slightely more than -D topN=2 \u2026 and there are no indications on the depth. So this nice script /bin/crawl seems not to pass the -D parameters to the Java application. And maybe not even the solr.server-url value \u2026
>>>
>>> Googling for \u201cdepth\u201d finds a lot of explanations on the deprecated form /bin/nutch crawl -depth, \u2026 etc\u2026 so I feel a little confused and need help.
>>>
>>> What is wrong with my call example above please?
>>>
>>> Thank you for any hint which can help me understanging why the -D parameters are not passed.
>>>
>>> Regards
>>> Fabio Ricci
>>>
>>>
>

Re: Nutch 1.13 @Sierra - Java -D parameters not passed to nutch

Posted by Fabio Ricci <fa...@gmail.com>.

Hi Sebastian

thank you for your message. That does not help me really… 

Yes I new the output of ./crawl without parameters (Synopsis) - but even then there are some assumptions, only an insider can understand. And under https://wiki.apache.org/nutch/NutchTutorial#A4._Setup_Solr_for_search <https://wiki.apache.org/nutch/NutchTutorial#A4._Setup_Solr_for_search> there is an indication to use it like I tried.

Num Rounds is not a Depth. A Depth is the depth in traversing links starting from Seed.
I admit I feel overwhelmed by all that parameters which in my case do not help me… 

I just need a tool which navigate using a seed url inside a certain depth. Do not need topN parameters …

Maybe I should use a lower NUTCH version (which) ?

...

Thanks
Fabio


> On 11 Apr 2017, at 10:26, Sebastian Nagel <wa...@googlemail.com> wrote:
> 
> Hi Fabio,
> 
> only Java/Hadoop properties can be passed via -D...
> 
> Command-line parameters (such as -topN) cannot be passed to Nutch tools/steps this way, see:
> 
> % bin/crawl
> Usage: crawl [-i|--index] [-D "key=value"] [-w|--wait] <Seed Dir> <Crawl Dir> <Num Rounds>
>        -i|--index      Indexes crawl results into a configured indexer
>        -D              A Java property to pass to Nutch calls
>        -w|--wait       NUMBER[SUFFIX] Time to wait before generating a new segment when no URLs
>                        are scheduled for fetching. Suffix can be: s for second,
>                        m for minute, h for hour and d for day. If no suffix is
>                        specified second is used by default.
>        Seed Dir        Directory in which to look for a seeds file
>        Crawl Dir       Directory where the crawl/link/segments dirs are saved
>        Num Rounds      The number of rounds to run this crawl for
> 
> In case of -topN : you need to modify bin/crawl (that's easy to do). There are also other ways to
> limit the length of the fetch list (see, e.g., "generate.max.count").
> 
> Regarding -depth : I suppose that's the same as <Num Rounds>
> 
> Best,
> Sebastian
> 
> On 04/11/2017 01:12 AM, Fabio Ricci wrote:
>> Hello
>> 
>> I am a newbie in NUTCH and I need a crawler in order to fetch some urls inside a given depth and to index dound pages into SOLR 6.5 .
>> 
>> On my OSX I got NUTCH running. I was hoping to use it directly for indexing. Instead I am wondering why the script /runtime/local/bin/crawl does not pass the depth and topN parameter to the software.
>> 
>> Expetially I use the following example call:
>> 
>> ./bin/crawl -i -D solr.server.url=http://127.0.0.1:8983/solr/demo/ -D depth=2 -D topN=2 /Users/fabio/NUTCH/urls/ /Users/fabio/NUTCH/crawl/ 1
>> 
>> With one single url inside /ursl/seed.txt
>> 
>> Expecting the crawling process will go into max depth = 2. 
>> 
>> Instead, it runs and runs … and I suppose something runs ***differently*** as described.
>> 
>> For example I noticed in the output the following text (this is just a segment, the output "does not stop"):
>> 
>> Injecting seed URLs
>> /Users/fabio/Documents/workspace/NUTCH/runtime/local/bin/nutch inject /Users/fabio/NUTCH/crawl//crawldb /Users/fabio/NUTCH/urls/
>> Injector: starting at 2017-04-11 00:54:56
>> Injector: crawlDb: /Users/fabio/NUTCH/crawl/crawldb
>> Injector: urlDir: /Users/fabio/NUTCH/urls
>> Injector: Converting injected urls to crawl db entries.
>> Injector: overwrite: false
>> Injector: update: false
>> Injector: Total urls rejected by filters: 0
>> Injector: Total urls injected after normalization and filtering: 1
>> Injector: Total urls injected but already in CrawlDb: 1
>> Injector: Total new urls injected: 0
>> Injector: finished at 2017-04-11 00:54:58, elapsed: 00:00:01
>> Tue Apr 11 00:54:58 CEST 2017 : Iteration 1 of 1
>> Generating a new segment
>> /Users/fabio/Documents/workspace/NUTCH/runtime/local/bin/nutch generate -D mapreduce.job.reduces=2 -D mapred.child.java.opts=-Xmx1000m -D mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D mapreduce.map.output.compress=true /Users/fabio/NUTCH/crawl//crawldb /Users/fabio/NUTCH/crawl//segments -topN 50000 -numFetchers 1 -noFilter
>> Generator: starting at 2017-04-11 00:54:59
>> Generator: Selecting best-scoring urls due for fetch.
>> Generator: filtering: false
>> Generator: normalizing: true
>> Generator: topN: 50000
>> Generator: Partitioning selected urls for politeness.
>> Generator: segment: /Users/fabio/NUTCH/crawl/segments/20170411005501
>> Generator: finished at 2017-04-11 00:55:02, elapsed: 00:00:03
>> Operating on segment : 20170411005501
>> Fetching : 20170411005501
>> /Users/fabio/Documents/workspace/NUTCH/runtime/local/bin/nutch fetch -D mapreduce.job.reduces=2 -D mapred.child.java.opts=-Xmx1000m -D mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D mapreduce.map.output.compress=true -D fetcher.timelimit.mins=180 /Users/fabio/NUTCH/crawl//segments/20170411005501 -noParsing -threads 50
>> 
>> Here - although I am a newbie - I notice that there is one line saying “Generator: topN: 50000” - slightely more than -D topN=2 … and there are no indications on the depth. So this nice script /bin/crawl seems not to pass the -D parameters to the Java application. And maybe not even the solr.server-url value … 
>> 
>> Googling for “depth” finds a lot of explanations on the deprecated form /bin/nutch crawl -depth, … etc… so I feel a little confused and need help.
>> 
>> What is wrong with my call example above please? 
>> 
>> Thank you for any hint which can help me understanging why the -D parameters are not passed.
>> 
>> Regards
>> Fabio Ricci
>> 
>> 
>

Re: Nutch 1.13 @Sierra - Java -D parameters not passed to nutch

Posted by Sebastian Nagel <wa...@googlemail.com>.

Hi Fabio,

only Java/Hadoop properties can be passed via -D...

Command-line parameters (such as -topN) cannot be passed to Nutch tools/steps this way, see:

% bin/crawl
Usage: crawl [-i|--index] [-D "key=value"] [-w|--wait] <Seed Dir> <Crawl Dir> <Num Rounds>
        -i|--index      Indexes crawl results into a configured indexer
        -D              A Java property to pass to Nutch calls
        -w|--wait       NUMBER[SUFFIX] Time to wait before generating a new segment when no URLs
                        are scheduled for fetching. Suffix can be: s for second,
                        m for minute, h for hour and d for day. If no suffix is
                        specified second is used by default.
        Seed Dir        Directory in which to look for a seeds file
        Crawl Dir       Directory where the crawl/link/segments dirs are saved
        Num Rounds      The number of rounds to run this crawl for

In case of -topN : you need to modify bin/crawl (that's easy to do). There are also other ways to
limit the length of the fetch list (see, e.g., "generate.max.count").

Regarding -depth : I suppose that's the same as <Num Rounds>

Best,
Sebastian

On 04/11/2017 01:12 AM, Fabio Ricci wrote:
> Hello
> 
> I am a newbie in NUTCH and I need a crawler in order to fetch some urls inside a given depth and to index dound pages into SOLR 6.5 .
> 
> On my OSX I got NUTCH running. I was hoping to use it directly for indexing. Instead I am wondering why the script /runtime/local/bin/crawl does not pass the depth and topN parameter to the software.
> 
> Expetially I use the following example call:
> 
> ./bin/crawl -i -D solr.server.url=http://127.0.0.1:8983/solr/demo/ -D depth=2 -D topN=2 /Users/fabio/NUTCH/urls/ /Users/fabio/NUTCH/crawl/ 1
> 
> With one single url inside /ursl/seed.txt
> 
> Expecting the crawling process will go into max depth = 2. 
> 
> Instead, it runs and runs \u2026 and I suppose something runs ***differently*** as described.
> 
> For example I noticed in the output the following text (this is just a segment, the output "does not stop"):
> 
> Injecting seed URLs
> /Users/fabio/Documents/workspace/NUTCH/runtime/local/bin/nutch inject /Users/fabio/NUTCH/crawl//crawldb /Users/fabio/NUTCH/urls/
> Injector: starting at 2017-04-11 00:54:56
> Injector: crawlDb: /Users/fabio/NUTCH/crawl/crawldb
> Injector: urlDir: /Users/fabio/NUTCH/urls
> Injector: Converting injected urls to crawl db entries.
> Injector: overwrite: false
> Injector: update: false
> Injector: Total urls rejected by filters: 0
> Injector: Total urls injected after normalization and filtering: 1
> Injector: Total urls injected but already in CrawlDb: 1
> Injector: Total new urls injected: 0
> Injector: finished at 2017-04-11 00:54:58, elapsed: 00:00:01
> Tue Apr 11 00:54:58 CEST 2017 : Iteration 1 of 1
> Generating a new segment
> /Users/fabio/Documents/workspace/NUTCH/runtime/local/bin/nutch generate -D mapreduce.job.reduces=2 -D mapred.child.java.opts=-Xmx1000m -D mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D mapreduce.map.output.compress=true /Users/fabio/NUTCH/crawl//crawldb /Users/fabio/NUTCH/crawl//segments -topN 50000 -numFetchers 1 -noFilter
> Generator: starting at 2017-04-11 00:54:59
> Generator: Selecting best-scoring urls due for fetch.
> Generator: filtering: false
> Generator: normalizing: true
> Generator: topN: 50000
> Generator: Partitioning selected urls for politeness.
> Generator: segment: /Users/fabio/NUTCH/crawl/segments/20170411005501
> Generator: finished at 2017-04-11 00:55:02, elapsed: 00:00:03
> Operating on segment : 20170411005501
> Fetching : 20170411005501
> /Users/fabio/Documents/workspace/NUTCH/runtime/local/bin/nutch fetch -D mapreduce.job.reduces=2 -D mapred.child.java.opts=-Xmx1000m -D mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D mapreduce.map.output.compress=true -D fetcher.timelimit.mins=180 /Users/fabio/NUTCH/crawl//segments/20170411005501 -noParsing -threads 50
> 
> Here - although I am a newbie - I notice that there is one line saying \u201cGenerator: topN: 50000\u201d - slightely more than -D topN=2 \u2026 and there are no indications on the depth. So this nice script /bin/crawl seems not to pass the -D parameters to the Java application. And maybe not even the solr.server-url value \u2026 
> 
> Googling for \u201cdepth\u201d finds a lot of explanations on the deprecated form /bin/nutch crawl -depth, \u2026 etc\u2026 so I feel a little confused and need help.
> 
> What is wrong with my call example above please? 
> 
> Thank you for any hint which can help me understanging why the -D parameters are not passed.
> 
> Regards
> Fabio Ricci
> 
>