You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by "Rod Taylor (JIRA)" <ji...@apache.org> on 2006/01/12 00:27:19 UTC

[jira] Created: (NUTCH-171) Bring back multiple segment support for Generate / Update

Bring back multiple segment support for Generate / Update
---------------------------------------------------------

         Key: NUTCH-171
         URL: http://issues.apache.org/jira/browse/NUTCH-171
     Project: Nutch
        Type: Improvement
    Versions: 0.8-dev    
    Reporter: Rod Taylor
    Priority: Minor
 Attachments: multi_segment.patch

We find it convenient to be able to run generate once for -topN 300M and have multiple independent segments to work with (lower overhead) -- then run update on all segments which succeeded simultaneously.

This reactivates -numFetchers and fixes updatedb to handle multiple provided segments again.



Radu Mateescu wrote the attached patch for us with the below description (lightly edited):

The implementation of -numFetchers in 0.8 improperly plays with the number of reduce tasks in order to generate a given number of fetch lists. Basically, what it does is this: before the second reduce (map-reduce is applied twice for generate), it sets the number of reduce tasks to numFetchers and ideally, because each reduce will create a file like part-00000, part-00001, etc in the ndfs, we'll end up with the number of desired fetched lists. But this behaviour is incorrect for the following reasons:
1. the number of reduce tasks is orthogonal to the number of segments somebody wants to create. The number of reduce tasks should be chosen based on the physical topology rather then the number of segments someone might want in ndfs
2. if in nutch-site.xml you specify a value for mapred.reduce.tasks property, the numFetchers seems to be ignored
 
Therefore , I changed this behaviour to work like this: 
 - generate will create numFetchers segments
 - each reduce task will write in all segments (assuming there are enough values to be written) in a round-robin fashion
The end results for 3 reduce tasks and 2 segments will look like this :
 
/opt/nutch/bin>./nutch ndfs -ls segments
060111 122227 parsing file:/opt/nutch/conf/nutch-default.xml
060111 122228 parsing file:/opt/nutch/conf/nutch-site.xml
060111 122228 Client connection to 192.168.0.1:5466: starting
060111 122228 No FS indicated, using default:master:5466
Found 2 items
/user/root/segments/20060111122144-0    <dir>
/user/root/segments/20060111122144-1    <dir>

 
/opt/nutch/bin>./nutch ndfs -ls segments/20060111122144-0/crawl_generate
060111 122317 parsing file:/opt/nutch/conf/nutch-default.xml
060111 122317 parsing file:/opt/nutch/conf/nutch-site.xml
060111 122318 No FS indicated, using default:master:5466
060111 122318 Client connection to 192.168.0.1:5466: starting
Found 3 items
/user/root/segments/20060111122144-0/crawl_generate/part-00000  1276
/user/root/segments/20060111122144-0/crawl_generate/part-00001  1289
/user/root/segments/20060111122144-0/crawl_generate/part-00002  1858

 
/opt/nutch/bin>./nutch ndfs -ls segments/20060111122144-1/crawl_generate
060111 122333 parsing file:/opt/nutch/conf/nutch-default.xml
060111 122334 parsing file:/opt/nutch/conf/nutch-site.xml
060111 122334 Client connection to 192.168.0.1:5466: starting
060111 122334 No FS indicated, using default:master:5466
Found 3 items
/user/root/segments/20060111122144-1/crawl_generate/part-00000  1207
/user/root/segments/20060111122144-1/crawl_generate/part-00001  1236
/user/root/segments/20060111122144-1/crawl_generate/part-00002  1841



-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-171) Bring back multiple segment support for Generate / Update

Posted by "Rod Taylor (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-171?page=comments#action_12372588 ] 

Rod Taylor commented on NUTCH-171:
----------------------------------

"One thing that's needed is the ability to mark urls as "being fetched", which was in 0.7 but has not yet made it into 0.8. In addition, we need to be able to prioritize jobs."

Agreed. Ideally I could say a maximum of X simultaneous fetch map tasks to be executed simultaneously.

This would allow other work to happen in the background and along with the bandwidth limiter patch (per task) it would allow a specific amount of bandwidth to be used.



"Ideally crawling should work something like:
1. generate segment 1
2. start fetching segment 1
3. generate segment 2;
4. wait for segment 1 fetch to complete
5. start fetching segment 2;
6. update db with output from fetch 1
7. generate segment 3;
8. wait for segment 2 fetch to complete"

This could work, but with 1 Billion URLs in the database generate and update both take a significant amount of time. Hate to see what it will be like with more than that.

Generate for 20 Segments of 10M in size is almost as fast as 1 segment that is 10M in size. A single 200M URL segment is unweildly from an error management perspective.  I actually prefer 1M URL segments.


Ditto for updatedb. Updating 20 segments of 10M URLs in size is pretty much as fast as dealing with a single 10M segment.


Ideally, in my eyes:

1) Generate a batch of segments (a few days worth of fetching) -- Xa
2) Fetch Xa/2 segments (literally run 2 of these at once -- have Hadoop limit number of simultaneous MAP jobs)
3) UpdateDB for Xa/2 segments
4) Generate a new batch of segments -- Xb
5) Fetch Xa/2 (second half of first set) and Xb/2 (first half of second set)
6) Single UpdateDB for segments Xa/2 and Xb/2
7) All of Xa have been completed. Complete the job on these (merge into 1 unit and index or whatever else needs to be done)

Hadoop should also be given the ability to limit the number of jobs of a single type: MapFetch -> X, ReduceFetch -> Y, MapGenerate -> Z, etc. AND give a priority based on job type. MapFetch is more important than ReduceFetch, which is more important than pretty much anything else.


> Bring back multiple segment support for Generate / Update
> ---------------------------------------------------------
>
>          Key: NUTCH-171
>          URL: http://issues.apache.org/jira/browse/NUTCH-171
>      Project: Nutch
>         Type: Improvement
>     Versions: 0.8-dev
>     Reporter: Rod Taylor
>     Priority: Minor
>  Attachments: multi_segment.patch
>
> We find it convenient to be able to run generate once for -topN 300M and have multiple independent segments to work with (lower overhead) -- then run update on all segments which succeeded simultaneously.
> This reactivates -numFetchers and fixes updatedb to handle multiple provided segments again.
> Radu Mateescu wrote the attached patch for us with the below description (lightly edited):
> The implementation of -numFetchers in 0.8 improperly plays with the number of reduce tasks in order to generate a given number of fetch lists. Basically, what it does is this: before the second reduce (map-reduce is applied twice for generate), it sets the number of reduce tasks to numFetchers and ideally, because each reduce will create a file like part-00000, part-00001, etc in the ndfs, we'll end up with the number of desired fetched lists. But this behaviour is incorrect for the following reasons:
> 1. the number of reduce tasks is orthogonal to the number of segments somebody wants to create. The number of reduce tasks should be chosen based on the physical topology rather then the number of segments someone might want in ndfs
> 2. if in nutch-site.xml you specify a value for mapred.reduce.tasks property, the numFetchers seems to be ignored
>  
> Therefore , I changed this behaviour to work like this: 
>  - generate will create numFetchers segments
>  - each reduce task will write in all segments (assuming there are enough values to be written) in a round-robin fashion
> The end results for 3 reduce tasks and 2 segments will look like this :
>  
> /opt/nutch/bin>./nutch ndfs -ls segments
> 060111 122227 parsing file:/opt/nutch/conf/nutch-default.xml
> 060111 122228 parsing file:/opt/nutch/conf/nutch-site.xml
> 060111 122228 Client connection to 192.168.0.1:5466: starting
> 060111 122228 No FS indicated, using default:master:5466
> Found 2 items
> /user/root/segments/20060111122144-0    <dir>
> /user/root/segments/20060111122144-1    <dir>
>  
> /opt/nutch/bin>./nutch ndfs -ls segments/20060111122144-0/crawl_generate
> 060111 122317 parsing file:/opt/nutch/conf/nutch-default.xml
> 060111 122317 parsing file:/opt/nutch/conf/nutch-site.xml
> 060111 122318 No FS indicated, using default:master:5466
> 060111 122318 Client connection to 192.168.0.1:5466: starting
> Found 3 items
> /user/root/segments/20060111122144-0/crawl_generate/part-00000  1276
> /user/root/segments/20060111122144-0/crawl_generate/part-00001  1289
> /user/root/segments/20060111122144-0/crawl_generate/part-00002  1858
>  
> /opt/nutch/bin>./nutch ndfs -ls segments/20060111122144-1/crawl_generate
> 060111 122333 parsing file:/opt/nutch/conf/nutch-default.xml
> 060111 122334 parsing file:/opt/nutch/conf/nutch-site.xml
> 060111 122334 Client connection to 192.168.0.1:5466: starting
> 060111 122334 No FS indicated, using default:master:5466
> Found 3 items
> /user/root/segments/20060111122144-1/crawl_generate/part-00000  1207
> /user/root/segments/20060111122144-1/crawl_generate/part-00001  1236
> /user/root/segments/20060111122144-1/crawl_generate/part-00002  1841

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Updated: (NUTCH-171) Bring back multiple segment support for Generate / Update

Posted by "Rod Taylor (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/NUTCH-171?page=all ]

Rod Taylor updated NUTCH-171:
-----------------------------

    Attachment: multi_segment.patch

Perhaps -numFetchers should be renamed to -numSegments ?

> Bring back multiple segment support for Generate / Update
> ---------------------------------------------------------
>
>          Key: NUTCH-171
>          URL: http://issues.apache.org/jira/browse/NUTCH-171
>      Project: Nutch
>         Type: Improvement
>     Versions: 0.8-dev
>     Reporter: Rod Taylor
>     Priority: Minor
>  Attachments: multi_segment.patch
>
> We find it convenient to be able to run generate once for -topN 300M and have multiple independent segments to work with (lower overhead) -- then run update on all segments which succeeded simultaneously.
> This reactivates -numFetchers and fixes updatedb to handle multiple provided segments again.
> Radu Mateescu wrote the attached patch for us with the below description (lightly edited):
> The implementation of -numFetchers in 0.8 improperly plays with the number of reduce tasks in order to generate a given number of fetch lists. Basically, what it does is this: before the second reduce (map-reduce is applied twice for generate), it sets the number of reduce tasks to numFetchers and ideally, because each reduce will create a file like part-00000, part-00001, etc in the ndfs, we'll end up with the number of desired fetched lists. But this behaviour is incorrect for the following reasons:
> 1. the number of reduce tasks is orthogonal to the number of segments somebody wants to create. The number of reduce tasks should be chosen based on the physical topology rather then the number of segments someone might want in ndfs
> 2. if in nutch-site.xml you specify a value for mapred.reduce.tasks property, the numFetchers seems to be ignored
>  
> Therefore , I changed this behaviour to work like this: 
>  - generate will create numFetchers segments
>  - each reduce task will write in all segments (assuming there are enough values to be written) in a round-robin fashion
> The end results for 3 reduce tasks and 2 segments will look like this :
>  
> /opt/nutch/bin>./nutch ndfs -ls segments
> 060111 122227 parsing file:/opt/nutch/conf/nutch-default.xml
> 060111 122228 parsing file:/opt/nutch/conf/nutch-site.xml
> 060111 122228 Client connection to 192.168.0.1:5466: starting
> 060111 122228 No FS indicated, using default:master:5466
> Found 2 items
> /user/root/segments/20060111122144-0    <dir>
> /user/root/segments/20060111122144-1    <dir>
>  
> /opt/nutch/bin>./nutch ndfs -ls segments/20060111122144-0/crawl_generate
> 060111 122317 parsing file:/opt/nutch/conf/nutch-default.xml
> 060111 122317 parsing file:/opt/nutch/conf/nutch-site.xml
> 060111 122318 No FS indicated, using default:master:5466
> 060111 122318 Client connection to 192.168.0.1:5466: starting
> Found 3 items
> /user/root/segments/20060111122144-0/crawl_generate/part-00000  1276
> /user/root/segments/20060111122144-0/crawl_generate/part-00001  1289
> /user/root/segments/20060111122144-0/crawl_generate/part-00002  1858
>  
> /opt/nutch/bin>./nutch ndfs -ls segments/20060111122144-1/crawl_generate
> 060111 122333 parsing file:/opt/nutch/conf/nutch-default.xml
> 060111 122334 parsing file:/opt/nutch/conf/nutch-site.xml
> 060111 122334 Client connection to 192.168.0.1:5466: starting
> 060111 122334 No FS indicated, using default:master:5466
> Found 3 items
> /user/root/segments/20060111122144-1/crawl_generate/part-00000  1207
> /user/root/segments/20060111122144-1/crawl_generate/part-00001  1236
> /user/root/segments/20060111122144-1/crawl_generate/part-00002  1841

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-171) Bring back multiple segment support for Generate / Update

Posted by "Rod Taylor (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-171?page=comments#action_12372602 ] 

Rod Taylor commented on NUTCH-171:
----------------------------------

> How is a 200M url segment unweildy?

There are two reasons why I have found this. First, Nutch still has a bad habit of not completing a segment once in a while. It retries the component that failed, but after a second or third failure it throws away the entire job.


The second is by design. Nutch sucks at IO and much prefers it when the data is in memory -- most programs are that way.

50 small segments barely needs to touch disk (initial creation and final write), infact I could run them on diskless machines with gobs of memory, but a single segment 50 times the size uses a significant amount of IO for temporary work space.

I have not yet figured out how to get the map/reduce settings for a large segment to have the same IO patterns as small segments.

> Bring back multiple segment support for Generate / Update
> ---------------------------------------------------------
>
>          Key: NUTCH-171
>          URL: http://issues.apache.org/jira/browse/NUTCH-171
>      Project: Nutch
>         Type: Improvement
>     Versions: 0.8-dev
>     Reporter: Rod Taylor
>     Priority: Minor
>  Attachments: multi_segment.patch
>
> We find it convenient to be able to run generate once for -topN 300M and have multiple independent segments to work with (lower overhead) -- then run update on all segments which succeeded simultaneously.
> This reactivates -numFetchers and fixes updatedb to handle multiple provided segments again.
> Radu Mateescu wrote the attached patch for us with the below description (lightly edited):
> The implementation of -numFetchers in 0.8 improperly plays with the number of reduce tasks in order to generate a given number of fetch lists. Basically, what it does is this: before the second reduce (map-reduce is applied twice for generate), it sets the number of reduce tasks to numFetchers and ideally, because each reduce will create a file like part-00000, part-00001, etc in the ndfs, we'll end up with the number of desired fetched lists. But this behaviour is incorrect for the following reasons:
> 1. the number of reduce tasks is orthogonal to the number of segments somebody wants to create. The number of reduce tasks should be chosen based on the physical topology rather then the number of segments someone might want in ndfs
> 2. if in nutch-site.xml you specify a value for mapred.reduce.tasks property, the numFetchers seems to be ignored
>  
> Therefore , I changed this behaviour to work like this: 
>  - generate will create numFetchers segments
>  - each reduce task will write in all segments (assuming there are enough values to be written) in a round-robin fashion
> The end results for 3 reduce tasks and 2 segments will look like this :
>  
> /opt/nutch/bin>./nutch ndfs -ls segments
> 060111 122227 parsing file:/opt/nutch/conf/nutch-default.xml
> 060111 122228 parsing file:/opt/nutch/conf/nutch-site.xml
> 060111 122228 Client connection to 192.168.0.1:5466: starting
> 060111 122228 No FS indicated, using default:master:5466
> Found 2 items
> /user/root/segments/20060111122144-0    <dir>
> /user/root/segments/20060111122144-1    <dir>
>  
> /opt/nutch/bin>./nutch ndfs -ls segments/20060111122144-0/crawl_generate
> 060111 122317 parsing file:/opt/nutch/conf/nutch-default.xml
> 060111 122317 parsing file:/opt/nutch/conf/nutch-site.xml
> 060111 122318 No FS indicated, using default:master:5466
> 060111 122318 Client connection to 192.168.0.1:5466: starting
> Found 3 items
> /user/root/segments/20060111122144-0/crawl_generate/part-00000  1276
> /user/root/segments/20060111122144-0/crawl_generate/part-00001  1289
> /user/root/segments/20060111122144-0/crawl_generate/part-00002  1858
>  
> /opt/nutch/bin>./nutch ndfs -ls segments/20060111122144-1/crawl_generate
> 060111 122333 parsing file:/opt/nutch/conf/nutch-default.xml
> 060111 122334 parsing file:/opt/nutch/conf/nutch-site.xml
> 060111 122334 Client connection to 192.168.0.1:5466: starting
> 060111 122334 No FS indicated, using default:master:5466
> Found 3 items
> /user/root/segments/20060111122144-1/crawl_generate/part-00000  1207
> /user/root/segments/20060111122144-1/crawl_generate/part-00001  1236
> /user/root/segments/20060111122144-1/crawl_generate/part-00002  1841

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-171) Bring back multiple segment support for Generate / Update

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-171?page=comments#action_12372597 ] 

Doug Cutting commented on NUTCH-171:
------------------------------------

> Generate for 20 Segments of 10M in size is almost as fast as 1 segment that is 10M in size. A single 200M URL segment is unweildly from an error management perspective. I actually prefer 1M URL segments.

How is a 200M url segment unweildy?  The whole premise of MapReduce is that the system breaks things into restartable chunks for you.  That's why a 200M url segment has lots of separate fetch lists.  The fetch of each may be restarted, as before, to permit error management, but the error management is now automatic: when a fetch task fails it is restarted automatically.  This is equivalent to generating a bunch of separate segments and fetching them separately.  Is MapReduce's automatic error managment not good enough?  If so we should fix that, rather than work around it.


> Bring back multiple segment support for Generate / Update
> ---------------------------------------------------------
>
>          Key: NUTCH-171
>          URL: http://issues.apache.org/jira/browse/NUTCH-171
>      Project: Nutch
>         Type: Improvement
>     Versions: 0.8-dev
>     Reporter: Rod Taylor
>     Priority: Minor
>  Attachments: multi_segment.patch
>
> We find it convenient to be able to run generate once for -topN 300M and have multiple independent segments to work with (lower overhead) -- then run update on all segments which succeeded simultaneously.
> This reactivates -numFetchers and fixes updatedb to handle multiple provided segments again.
> Radu Mateescu wrote the attached patch for us with the below description (lightly edited):
> The implementation of -numFetchers in 0.8 improperly plays with the number of reduce tasks in order to generate a given number of fetch lists. Basically, what it does is this: before the second reduce (map-reduce is applied twice for generate), it sets the number of reduce tasks to numFetchers and ideally, because each reduce will create a file like part-00000, part-00001, etc in the ndfs, we'll end up with the number of desired fetched lists. But this behaviour is incorrect for the following reasons:
> 1. the number of reduce tasks is orthogonal to the number of segments somebody wants to create. The number of reduce tasks should be chosen based on the physical topology rather then the number of segments someone might want in ndfs
> 2. if in nutch-site.xml you specify a value for mapred.reduce.tasks property, the numFetchers seems to be ignored
>  
> Therefore , I changed this behaviour to work like this: 
>  - generate will create numFetchers segments
>  - each reduce task will write in all segments (assuming there are enough values to be written) in a round-robin fashion
> The end results for 3 reduce tasks and 2 segments will look like this :
>  
> /opt/nutch/bin>./nutch ndfs -ls segments
> 060111 122227 parsing file:/opt/nutch/conf/nutch-default.xml
> 060111 122228 parsing file:/opt/nutch/conf/nutch-site.xml
> 060111 122228 Client connection to 192.168.0.1:5466: starting
> 060111 122228 No FS indicated, using default:master:5466
> Found 2 items
> /user/root/segments/20060111122144-0    <dir>
> /user/root/segments/20060111122144-1    <dir>
>  
> /opt/nutch/bin>./nutch ndfs -ls segments/20060111122144-0/crawl_generate
> 060111 122317 parsing file:/opt/nutch/conf/nutch-default.xml
> 060111 122317 parsing file:/opt/nutch/conf/nutch-site.xml
> 060111 122318 No FS indicated, using default:master:5466
> 060111 122318 Client connection to 192.168.0.1:5466: starting
> Found 3 items
> /user/root/segments/20060111122144-0/crawl_generate/part-00000  1276
> /user/root/segments/20060111122144-0/crawl_generate/part-00001  1289
> /user/root/segments/20060111122144-0/crawl_generate/part-00002  1858
>  
> /opt/nutch/bin>./nutch ndfs -ls segments/20060111122144-1/crawl_generate
> 060111 122333 parsing file:/opt/nutch/conf/nutch-default.xml
> 060111 122334 parsing file:/opt/nutch/conf/nutch-site.xml
> 060111 122334 Client connection to 192.168.0.1:5466: starting
> 060111 122334 No FS indicated, using default:master:5466
> Found 3 items
> /user/root/segments/20060111122144-1/crawl_generate/part-00000  1207
> /user/root/segments/20060111122144-1/crawl_generate/part-00001  1236
> /user/root/segments/20060111122144-1/crawl_generate/part-00002  1841

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-171) Bring back multiple segment support for Generate / Update

Posted by "Stefan Groschupf (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-171?page=comments#action_12372600 ] 

Stefan Groschupf commented on NUTCH-171:
----------------------------------------

Doug I agree that 200M segment should work and would be the best way to go, but just for your information we note that larger segment more likely crash until reducing than smaller segments. May this is already solved with one of the many patches of hadoop until last 2 weeks. 
So in any case I see some needs (as already discussed) to get the automatic error managment extended.


> Bring back multiple segment support for Generate / Update
> ---------------------------------------------------------
>
>          Key: NUTCH-171
>          URL: http://issues.apache.org/jira/browse/NUTCH-171
>      Project: Nutch
>         Type: Improvement
>     Versions: 0.8-dev
>     Reporter: Rod Taylor
>     Priority: Minor
>  Attachments: multi_segment.patch
>
> We find it convenient to be able to run generate once for -topN 300M and have multiple independent segments to work with (lower overhead) -- then run update on all segments which succeeded simultaneously.
> This reactivates -numFetchers and fixes updatedb to handle multiple provided segments again.
> Radu Mateescu wrote the attached patch for us with the below description (lightly edited):
> The implementation of -numFetchers in 0.8 improperly plays with the number of reduce tasks in order to generate a given number of fetch lists. Basically, what it does is this: before the second reduce (map-reduce is applied twice for generate), it sets the number of reduce tasks to numFetchers and ideally, because each reduce will create a file like part-00000, part-00001, etc in the ndfs, we'll end up with the number of desired fetched lists. But this behaviour is incorrect for the following reasons:
> 1. the number of reduce tasks is orthogonal to the number of segments somebody wants to create. The number of reduce tasks should be chosen based on the physical topology rather then the number of segments someone might want in ndfs
> 2. if in nutch-site.xml you specify a value for mapred.reduce.tasks property, the numFetchers seems to be ignored
>  
> Therefore , I changed this behaviour to work like this: 
>  - generate will create numFetchers segments
>  - each reduce task will write in all segments (assuming there are enough values to be written) in a round-robin fashion
> The end results for 3 reduce tasks and 2 segments will look like this :
>  
> /opt/nutch/bin>./nutch ndfs -ls segments
> 060111 122227 parsing file:/opt/nutch/conf/nutch-default.xml
> 060111 122228 parsing file:/opt/nutch/conf/nutch-site.xml
> 060111 122228 Client connection to 192.168.0.1:5466: starting
> 060111 122228 No FS indicated, using default:master:5466
> Found 2 items
> /user/root/segments/20060111122144-0    <dir>
> /user/root/segments/20060111122144-1    <dir>
>  
> /opt/nutch/bin>./nutch ndfs -ls segments/20060111122144-0/crawl_generate
> 060111 122317 parsing file:/opt/nutch/conf/nutch-default.xml
> 060111 122317 parsing file:/opt/nutch/conf/nutch-site.xml
> 060111 122318 No FS indicated, using default:master:5466
> 060111 122318 Client connection to 192.168.0.1:5466: starting
> Found 3 items
> /user/root/segments/20060111122144-0/crawl_generate/part-00000  1276
> /user/root/segments/20060111122144-0/crawl_generate/part-00001  1289
> /user/root/segments/20060111122144-0/crawl_generate/part-00002  1858
>  
> /opt/nutch/bin>./nutch ndfs -ls segments/20060111122144-1/crawl_generate
> 060111 122333 parsing file:/opt/nutch/conf/nutch-default.xml
> 060111 122334 parsing file:/opt/nutch/conf/nutch-site.xml
> 060111 122334 Client connection to 192.168.0.1:5466: starting
> 060111 122334 No FS indicated, using default:master:5466
> Found 3 items
> /user/root/segments/20060111122144-1/crawl_generate/part-00000  1207
> /user/root/segments/20060111122144-1/crawl_generate/part-00001  1236
> /user/root/segments/20060111122144-1/crawl_generate/part-00002  1841

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-171) Bring back multiple segment support for Generate / Update

Posted by "Rod Taylor (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-171?page=comments#action_12362508 ] 

Rod Taylor commented on NUTCH-171:
----------------------------------

Overhead of generate/update versus fetch is the big one. A smaller segment size fits easily into memory reducing the amount of disk accesses required. 10 3M batches can be fetched, parsed, and outputted by SegmentReader in a far shorter time than a single 30M batch.

When you overlap their execution so that as soon as a tasktracker has space it begins working on the next segment -- no delays. The benefit comes in the reduce time required. Ideally we could overlap segment2 map with segment1 reduce to keep bandwidth usage constant.

We are working on a patch to allow specifing a target bandwidth on a per task basis with a varying thread count to try to keep the bandwidth usage maxed out at a defined limit (about 50Mb/sec for us). Being able to specify the number of map tasktrackers separately from the number of reduce tasktrackers would make this possible.


With a linear process like crawl there is a 20% to 50% gap in fetching while generate/update run. We want fetches to overlap with these -- again ideally filling our bandwidth. At the beginning it was lower but as the number of URLs in our database grows the overhead grows with it. I don't really care how long generate takes if there is a steady stream of new data being downloaded at the same time.


Finally, since we don't use Nutch as a crawler only (no indexer) there is no performance penalty for having a large number of small segments. It makes a number of things like scheduling maintenance, error recovery, debugging errors, and of course gives the previously mentioned IO reduction possible.

> Bring back multiple segment support for Generate / Update
> ---------------------------------------------------------
>
>          Key: NUTCH-171
>          URL: http://issues.apache.org/jira/browse/NUTCH-171
>      Project: Nutch
>         Type: Improvement
>     Versions: 0.8-dev
>     Reporter: Rod Taylor
>     Priority: Minor
>  Attachments: multi_segment.patch
>
> We find it convenient to be able to run generate once for -topN 300M and have multiple independent segments to work with (lower overhead) -- then run update on all segments which succeeded simultaneously.
> This reactivates -numFetchers and fixes updatedb to handle multiple provided segments again.
> Radu Mateescu wrote the attached patch for us with the below description (lightly edited):
> The implementation of -numFetchers in 0.8 improperly plays with the number of reduce tasks in order to generate a given number of fetch lists. Basically, what it does is this: before the second reduce (map-reduce is applied twice for generate), it sets the number of reduce tasks to numFetchers and ideally, because each reduce will create a file like part-00000, part-00001, etc in the ndfs, we'll end up with the number of desired fetched lists. But this behaviour is incorrect for the following reasons:
> 1. the number of reduce tasks is orthogonal to the number of segments somebody wants to create. The number of reduce tasks should be chosen based on the physical topology rather then the number of segments someone might want in ndfs
> 2. if in nutch-site.xml you specify a value for mapred.reduce.tasks property, the numFetchers seems to be ignored
>  
> Therefore , I changed this behaviour to work like this: 
>  - generate will create numFetchers segments
>  - each reduce task will write in all segments (assuming there are enough values to be written) in a round-robin fashion
> The end results for 3 reduce tasks and 2 segments will look like this :
>  
> /opt/nutch/bin>./nutch ndfs -ls segments
> 060111 122227 parsing file:/opt/nutch/conf/nutch-default.xml
> 060111 122228 parsing file:/opt/nutch/conf/nutch-site.xml
> 060111 122228 Client connection to 192.168.0.1:5466: starting
> 060111 122228 No FS indicated, using default:master:5466
> Found 2 items
> /user/root/segments/20060111122144-0    <dir>
> /user/root/segments/20060111122144-1    <dir>
>  
> /opt/nutch/bin>./nutch ndfs -ls segments/20060111122144-0/crawl_generate
> 060111 122317 parsing file:/opt/nutch/conf/nutch-default.xml
> 060111 122317 parsing file:/opt/nutch/conf/nutch-site.xml
> 060111 122318 No FS indicated, using default:master:5466
> 060111 122318 Client connection to 192.168.0.1:5466: starting
> Found 3 items
> /user/root/segments/20060111122144-0/crawl_generate/part-00000  1276
> /user/root/segments/20060111122144-0/crawl_generate/part-00001  1289
> /user/root/segments/20060111122144-0/crawl_generate/part-00002  1858
>  
> /opt/nutch/bin>./nutch ndfs -ls segments/20060111122144-1/crawl_generate
> 060111 122333 parsing file:/opt/nutch/conf/nutch-default.xml
> 060111 122334 parsing file:/opt/nutch/conf/nutch-site.xml
> 060111 122334 Client connection to 192.168.0.1:5466: starting
> 060111 122334 No FS indicated, using default:master:5466
> Found 3 items
> /user/root/segments/20060111122144-1/crawl_generate/part-00000  1207
> /user/root/segments/20060111122144-1/crawl_generate/part-00001  1236
> /user/root/segments/20060111122144-1/crawl_generate/part-00002  1841

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-171) Bring back multiple segment support for Generate / Update

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-171?page=comments#action_12372556 ] 

Doug Cutting commented on NUTCH-171:
------------------------------------

> Ideally we could overlap segment2 map with segment1 reduce to keep bandwidth usage constant.

Overlapping map2 with reduce1 should work today.  But I think we need more than that.

One thing that's needed is the ability to mark urls as "being fetched", which was in 0.7 but has not yet made it into 0.8.  In addition, we need to be able to prioritize jobs.

Ideally crawling should work something like:

1. generate segment 1
2. start fetching segment 1
3. generate segment 2;
4. wait for segment 1 fetch to complete
5. start fetching segment 2;
6. update db with output from fetch 1
7. generate segment 3;
8. wait for segment 2 fetch to complete
...

For this to work one job must be able to "pass" another.  While a fetch is running, an update/generate cycle must be able to complete.  This would be a better solution, no?  A crude way to do this would be to run the update/generate on a separate jobtracker/tasktracker configuration, running on the same machines.  But ideally we could configure a fetching job so that, e.g., a tasktracker would only run one of its tasks at a time.  Then, when a update or generate job is submitted, there would be available task slots on the task trackers for these jobs.  Does that sound like a reasonable approach?

> Bring back multiple segment support for Generate / Update
> ---------------------------------------------------------
>
>          Key: NUTCH-171
>          URL: http://issues.apache.org/jira/browse/NUTCH-171
>      Project: Nutch
>         Type: Improvement
>     Versions: 0.8-dev
>     Reporter: Rod Taylor
>     Priority: Minor
>  Attachments: multi_segment.patch
>
> We find it convenient to be able to run generate once for -topN 300M and have multiple independent segments to work with (lower overhead) -- then run update on all segments which succeeded simultaneously.
> This reactivates -numFetchers and fixes updatedb to handle multiple provided segments again.
> Radu Mateescu wrote the attached patch for us with the below description (lightly edited):
> The implementation of -numFetchers in 0.8 improperly plays with the number of reduce tasks in order to generate a given number of fetch lists. Basically, what it does is this: before the second reduce (map-reduce is applied twice for generate), it sets the number of reduce tasks to numFetchers and ideally, because each reduce will create a file like part-00000, part-00001, etc in the ndfs, we'll end up with the number of desired fetched lists. But this behaviour is incorrect for the following reasons:
> 1. the number of reduce tasks is orthogonal to the number of segments somebody wants to create. The number of reduce tasks should be chosen based on the physical topology rather then the number of segments someone might want in ndfs
> 2. if in nutch-site.xml you specify a value for mapred.reduce.tasks property, the numFetchers seems to be ignored
>  
> Therefore , I changed this behaviour to work like this: 
>  - generate will create numFetchers segments
>  - each reduce task will write in all segments (assuming there are enough values to be written) in a round-robin fashion
> The end results for 3 reduce tasks and 2 segments will look like this :
>  
> /opt/nutch/bin>./nutch ndfs -ls segments
> 060111 122227 parsing file:/opt/nutch/conf/nutch-default.xml
> 060111 122228 parsing file:/opt/nutch/conf/nutch-site.xml
> 060111 122228 Client connection to 192.168.0.1:5466: starting
> 060111 122228 No FS indicated, using default:master:5466
> Found 2 items
> /user/root/segments/20060111122144-0    <dir>
> /user/root/segments/20060111122144-1    <dir>
>  
> /opt/nutch/bin>./nutch ndfs -ls segments/20060111122144-0/crawl_generate
> 060111 122317 parsing file:/opt/nutch/conf/nutch-default.xml
> 060111 122317 parsing file:/opt/nutch/conf/nutch-site.xml
> 060111 122318 No FS indicated, using default:master:5466
> 060111 122318 Client connection to 192.168.0.1:5466: starting
> Found 3 items
> /user/root/segments/20060111122144-0/crawl_generate/part-00000  1276
> /user/root/segments/20060111122144-0/crawl_generate/part-00001  1289
> /user/root/segments/20060111122144-0/crawl_generate/part-00002  1858
>  
> /opt/nutch/bin>./nutch ndfs -ls segments/20060111122144-1/crawl_generate
> 060111 122333 parsing file:/opt/nutch/conf/nutch-default.xml
> 060111 122334 parsing file:/opt/nutch/conf/nutch-site.xml
> 060111 122334 Client connection to 192.168.0.1:5466: starting
> 060111 122334 No FS indicated, using default:master:5466
> Found 3 items
> /user/root/segments/20060111122144-1/crawl_generate/part-00000  1207
> /user/root/segments/20060111122144-1/crawl_generate/part-00001  1236
> /user/root/segments/20060111122144-1/crawl_generate/part-00002  1841

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-171) Bring back multiple segment support for Generate / Update

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-171?page=comments#action_12362507 ] 

Doug Cutting commented on NUTCH-171:
------------------------------------

I'd like to hear more about why you want multiple segments, what's motivating this patch.  The 0.7 -numFetchers parameter was designed to permit distributed fetching.  With MapReduce the fetcher runs as a distributed map task, so the number of fetchers is now set to the number of map tasks.  The crawl db is updated with the output of all fetcher tasks in a single step, as you desire.

There may be something with the current implementation that is causing you problems, I'm just not yet sure what it is and why this is the solution.


> Bring back multiple segment support for Generate / Update
> ---------------------------------------------------------
>
>          Key: NUTCH-171
>          URL: http://issues.apache.org/jira/browse/NUTCH-171
>      Project: Nutch
>         Type: Improvement
>     Versions: 0.8-dev
>     Reporter: Rod Taylor
>     Priority: Minor
>  Attachments: multi_segment.patch
>
> We find it convenient to be able to run generate once for -topN 300M and have multiple independent segments to work with (lower overhead) -- then run update on all segments which succeeded simultaneously.
> This reactivates -numFetchers and fixes updatedb to handle multiple provided segments again.
> Radu Mateescu wrote the attached patch for us with the below description (lightly edited):
> The implementation of -numFetchers in 0.8 improperly plays with the number of reduce tasks in order to generate a given number of fetch lists. Basically, what it does is this: before the second reduce (map-reduce is applied twice for generate), it sets the number of reduce tasks to numFetchers and ideally, because each reduce will create a file like part-00000, part-00001, etc in the ndfs, we'll end up with the number of desired fetched lists. But this behaviour is incorrect for the following reasons:
> 1. the number of reduce tasks is orthogonal to the number of segments somebody wants to create. The number of reduce tasks should be chosen based on the physical topology rather then the number of segments someone might want in ndfs
> 2. if in nutch-site.xml you specify a value for mapred.reduce.tasks property, the numFetchers seems to be ignored
>  
> Therefore , I changed this behaviour to work like this: 
>  - generate will create numFetchers segments
>  - each reduce task will write in all segments (assuming there are enough values to be written) in a round-robin fashion
> The end results for 3 reduce tasks and 2 segments will look like this :
>  
> /opt/nutch/bin>./nutch ndfs -ls segments
> 060111 122227 parsing file:/opt/nutch/conf/nutch-default.xml
> 060111 122228 parsing file:/opt/nutch/conf/nutch-site.xml
> 060111 122228 Client connection to 192.168.0.1:5466: starting
> 060111 122228 No FS indicated, using default:master:5466
> Found 2 items
> /user/root/segments/20060111122144-0    <dir>
> /user/root/segments/20060111122144-1    <dir>
>  
> /opt/nutch/bin>./nutch ndfs -ls segments/20060111122144-0/crawl_generate
> 060111 122317 parsing file:/opt/nutch/conf/nutch-default.xml
> 060111 122317 parsing file:/opt/nutch/conf/nutch-site.xml
> 060111 122318 No FS indicated, using default:master:5466
> 060111 122318 Client connection to 192.168.0.1:5466: starting
> Found 3 items
> /user/root/segments/20060111122144-0/crawl_generate/part-00000  1276
> /user/root/segments/20060111122144-0/crawl_generate/part-00001  1289
> /user/root/segments/20060111122144-0/crawl_generate/part-00002  1858
>  
> /opt/nutch/bin>./nutch ndfs -ls segments/20060111122144-1/crawl_generate
> 060111 122333 parsing file:/opt/nutch/conf/nutch-default.xml
> 060111 122334 parsing file:/opt/nutch/conf/nutch-site.xml
> 060111 122334 Client connection to 192.168.0.1:5466: starting
> 060111 122334 No FS indicated, using default:master:5466
> Found 3 items
> /user/root/segments/20060111122144-1/crawl_generate/part-00000  1207
> /user/root/segments/20060111122144-1/crawl_generate/part-00001  1236
> /user/root/segments/20060111122144-1/crawl_generate/part-00002  1841

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira