You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Venkatesh Babu <vb...@yahoo.com> on 2009/01/24 18:37:07 UTC

Issue with merging segments with s/w built from main trunk

Hello,
      I have a version built from trunk a couple of weeks back(based on
733228). I did a fetch on a website and now have half a dozen segments out
of the fetch. I want to merge these segments. When I try to merge two small
segments(upto 24000 pages), it seems to work fine(takes around 10 minutes).
The next round has close to a hundred thousand pages and final round around
two hundred thousand pages. Now when I tried to merge these segments, it
went on for close to 20 hours and filled my entire harddisk(atleast 440 GB
of new data generated). All the generated data was in the hadoop tmp folder.
By the way the size of my unmerged segments folder is around 3.5 GB. 
Questions: 
1. Does the map-reduce operation involve intermediate data which is as high
ratio wise?
2. Has anybody recently done a large merge of segments with the main trunk
build? 
I have noted the troubleshooting done so far, current state of my logs and
disk usage of the tmp folder below.

Thanks in advance,
VB


Troubleshooting done so far:
1. I tried resetting the hadoop tmp folder to default folder(instead of my
path) and that does not seem to have helped. 
2. I tried to run the same with stable version, but since my fetch was done
with this new version, there is a record version mismatch(expecting V5 found
V6). So unless I want to give up my fetched segments, I am stuck with the
main trunk version. 

Data:     
My logs are stuck at this stage when it goes into the file creation mode.
2009-01-24 13:08:47,455 WARN  mapred.JobClient - Use GenericOptionsParser
for parsing the arguments. Applications should implement Tool for the same.
2009-01-24 13:11:32,833 INFO  segment.SegmentMerger - Merging 6 segments to
crawl/MERGEDsegs/20090124131132
2009-01-24 13:11:32,841 INFO  segment.SegmentMerger - SegmentMerger:  
adding crawl/segments/20090117095538
2009-01-24 13:11:32,851 INFO  segment.SegmentMerger - SegmentMerger:  
adding crawl/segments/20090117095548
2009-01-24 13:11:32,861 INFO  segment.SegmentMerger - SegmentMerger:  
adding crawl/segments/20090117095645
2009-01-24 13:11:32,867 INFO  segment.SegmentMerger - SegmentMerger:  
adding crawl/segments/20090117101316
2009-01-24 13:11:32,872 INFO  segment.SegmentMerger - SegmentMerger:  
adding crawl/segments/20090117132239
2009-01-24 13:11:32,877 INFO  segment.SegmentMerger - SegmentMerger:  
adding crawl/segments/20090118030001
2009-01-24 13:11:32,884 INFO  segment.SegmentMerger - SegmentMerger: using
segment data from: 
content crawl_generate crawl_fetch crawl_parse parse_data parse_text
2009-01-24 13:11:32,918 WARN  mapred.JobClient - Use GenericOptionsParser
for parsing the arguments. Applications should implement Tool for the same.

The following is an excerpt from my hadoop tmp folder du command. As you can
see it has crossed 100GB. This eventually will go on to fill the hard disk.
139343952       ./mapred/local/taskTracker/jobcache/job_local_0001
139343972       ./mapred/local/taskTracker/jobcache
139343976       ./mapred/local/taskTracker
4       ./mapred/local/attempt_local_0001_m_000072_0
4       ./mapred/local/attempt_local_0001_m_000077_0
4       ./mapred/local/attempt_local_0001_m_000082_0
4       ./mapred/local/attempt_local_0001_m_000049_0
4       ./mapred/local/attempt_local_0001_m_000078_0
4       ./mapred/local/attempt_local_0001_m_000062_0
4       ./mapred/local/attempt_local_0001_m_000046_0
4       ./mapred/local/index
4       ./mapred/local/attempt_local_0001_m_000031_0
4       ./mapred/local/attempt_local_0001_m_000039_0
4       ./mapred/local/attempt_local_0001_m_000032_0
4       ./mapred/local/attempt_local_0001_m_000061_0
4       ./mapred/local/attempt_local_0001_m_000080_0
4       ./mapred/local/attempt_local_0001_m_000034_0
4       ./mapred/local/attempt_local_0001_m_000055_0
4       ./mapred/local/attempt_local_0001_m_000047_0
4       ./mapred/local/attempt_local_0001_m_000036_0
4       ./mapred/local/attempt_local_0001_m_000054_0
4       ./mapred/local/attempt_local_0001_m_000042_0
4       ./mapred/local/attempt_local_0001_r_000000_0
4       ./mapred/local/attempt_local_0001_m_000071_0
4       ./mapred/local/attempt_local_0001_m_000056_0
4       ./mapred/local/attempt_local_0001_m_000083_0
4       ./mapred/local/attempt_local_0001_m_000081_0
4       ./mapred/local/attempt_local_0001_m_000038_0
4       ./mapred/local/attempt_local_0001_m_000052_0
4       ./mapred/local/attempt_local_0001_m_000079_0
4       ./mapred/local/attempt_local_0001_m_000057_0
139344212       ./mapred/local
4       ./mapred/temp
23556   ./mapred/system/job_local_0001
23560   ./mapred/system
139367780       ./mapred
139367784       .

-- 
View this message in context: http://www.nabble.com/Issue-with-merging-segments-with-s-w-built-from-main-trunk-tp21641977p21641977.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Issue with merging segments with s/w built from main trunk

Posted by Doğacan Güney <do...@gmail.com>.

On Sun, Jan 25, 2009 at 1:17 PM, Venkatesh Babu <vb...@yahoo.com> wrote:
>
> Hello Doğacan,
>     Thanks for the reply. I had a query on the following:
> ">> 1. Does the map-reduce operation involve intermediate data which is as
> high
>>> ratio wise?
>
>>Yes."
> 1. In your past experience what is the general ratio of size of segments
> being merged to the maximum disk space required during the merge operation?
> 2. If you were using nutch prior to hadoop implementation was it any better
> when run without hadoop?
>

I have never used a hadoop-free nutch and I have not been working with large
segments for a looong time so I don't know :)

>>"Compressing temporary outputs may help you here. "
> 3. I guess compression would have a cost. Since it is already taking me more
> than a day to merge these segments which are only 3GB and I have a task to
> merge segments of 40GB or more, I was wondering how long this would take if
> I enable compression. Guess my question is would you have any data on how
> slow the merge would become if I enable compression of map output.
>

I have done some analysis in https://issues.apache.org/jira/browse/NUTCH-392

Short answer is: don't worry too much :) Especially if you use lzo,
performance will
be very good.

Another thing: If you do not need "content" directory in your merged
segment, just
rename "content" directories in your segments to something else. This way
mergesegs will not merge "content"s and this should reduce size
requirement a lot.

> Thanks,
> VB
>
> --
> View this message in context: http://www.nabble.com/Issue-with-merging-segments-with-s-w-built-from-main-trunk-tp21641977p21650571.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>



-- 
Doğacan Güney

Re: Issue with merging segments with s/w built from main trunk

Posted by Venkatesh Babu <vb...@yahoo.com>.

Hello Doğacan,
     Thanks for the reply. I had a query on the following:
">> 1. Does the map-reduce operation involve intermediate data which is as
high
>> ratio wise?

>Yes."
1. In your past experience what is the general ratio of size of segments
being merged to the maximum disk space required during the merge operation? 
2. If you were using nutch prior to hadoop implementation was it any better
when run without hadoop?

>"Compressing temporary outputs may help you here. "
3. I guess compression would have a cost. Since it is already taking me more
than a day to merge these segments which are only 3GB and I have a task to
merge segments of 40GB or more, I was wondering how long this would take if
I enable compression. Guess my question is would you have any data on how
slow the merge would become if I enable compression of map output.

Thanks,
VB

-- 
View this message in context: http://www.nabble.com/Issue-with-merging-segments-with-s-w-built-from-main-trunk-tp21641977p21650571.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Issue with merging segments with s/w built from main trunk

Posted by Doğacan Güney <do...@gmail.com>.

Hi,

On Sat, Jan 24, 2009 at 7:37 PM, Venkatesh Babu <vb...@yahoo.com> wrote:
>
> Hello,
>      I have a version built from trunk a couple of weeks back(based on
> 733228). I did a fetch on a website and now have half a dozen segments out
> of the fetch. I want to merge these segments. When I try to merge two small
> segments(upto 24000 pages), it seems to work fine(takes around 10 minutes).
> The next round has close to a hundred thousand pages and final round around
> two hundred thousand pages. Now when I tried to merge these segments, it
> went on for close to 20 hours and filled my entire harddisk(atleast 440 GB
> of new data generated). All the generated data was in the hadoop tmp folder.
> By the way the size of my unmerged segments folder is around 3.5 GB.
> Questions:
> 1. Does the map-reduce operation involve intermediate data which is as high
> ratio wise?

Yes.

> 2. Has anybody recently done a large merge of segments with the main trunk
> build?
> I have noted the troubleshooting done so far, current state of my logs and
> disk usage of the tmp folder below.
>
> Thanks in advance,
> VB
>
>
> Troubleshooting done so far:
> 1. I tried resetting the hadoop tmp folder to default folder(instead of my
> path) and that does not seem to have helped.
> 2. I tried to run the same with stable version, but since my fetch was done
> with this new version, there is a record version mismatch(expecting V5 found
> V6). So unless I want to give up my fetched segments, I am stuck with the
> main trunk version.
>

Trunk uses a more recent version of hadoop, so it is more likely that trunk
behaves better than 0.9 in this regard.

You may try playing with some hadoop options to see if they help.

<property>
  <name>mapred.output.compress</name>
  <value>false</value>
  <description>Should the job outputs be compressed?
  </description>
</property>

<property>
  <name>mapred.output.compression.type</name>
  <value>RECORD</value>
  <description>If the job outputs are to compressed as SequenceFiles, how should
               they be compressed? Should be one of NONE, RECORD or BLOCK.
  </description>
</property>

<property>
  <name>mapred.output.compression.codec</name>
  <value>org.apache.hadoop.io.compress.DefaultCodec</value>
  <description>If the job outputs are compressed, how should they be compressed?
  </description>
</property>

<property>
  <name>mapred.compress.map.output</name>
  <value>false</value>
  <description>Should the outputs of the maps be compressed before being
               sent across the network. Uses SequenceFile compression.
  </description>
</property>


Compressing temporary outputs may help you here.

Note that for these to work properly, your code should dynamically
link to the libraries
in lib/native.

> Data:
> My logs are stuck at this stage when it goes into the file creation mode.
> 2009-01-24 13:08:47,455 WARN  mapred.JobClient - Use GenericOptionsParser
> for parsing the arguments. Applications should implement Tool for the same.
> 2009-01-24 13:11:32,833 INFO  segment.SegmentMerger - Merging 6 segments to
> crawl/MERGEDsegs/20090124131132
> 2009-01-24 13:11:32,841 INFO  segment.SegmentMerger - SegmentMerger:
> adding crawl/segments/20090117095538
> 2009-01-24 13:11:32,851 INFO  segment.SegmentMerger - SegmentMerger:
> adding crawl/segments/20090117095548
> 2009-01-24 13:11:32,861 INFO  segment.SegmentMerger - SegmentMerger:
> adding crawl/segments/20090117095645
> 2009-01-24 13:11:32,867 INFO  segment.SegmentMerger - SegmentMerger:
> adding crawl/segments/20090117101316
> 2009-01-24 13:11:32,872 INFO  segment.SegmentMerger - SegmentMerger:
> adding crawl/segments/20090117132239
> 2009-01-24 13:11:32,877 INFO  segment.SegmentMerger - SegmentMerger:
> adding crawl/segments/20090118030001
> 2009-01-24 13:11:32,884 INFO  segment.SegmentMerger - SegmentMerger: using
> segment data from:
> content crawl_generate crawl_fetch crawl_parse parse_data parse_text
> 2009-01-24 13:11:32,918 WARN  mapred.JobClient - Use GenericOptionsParser
> for parsing the arguments. Applications should implement Tool for the same.
>
> The following is an excerpt from my hadoop tmp folder du command. As you can
> see it has crossed 100GB. This eventually will go on to fill the hard disk.
> 139343952       ./mapred/local/taskTracker/jobcache/job_local_0001
> 139343972       ./mapred/local/taskTracker/jobcache
> 139343976       ./mapred/local/taskTracker
> 4       ./mapred/local/attempt_local_0001_m_000072_0
> 4       ./mapred/local/attempt_local_0001_m_000077_0
> 4       ./mapred/local/attempt_local_0001_m_000082_0
> 4       ./mapred/local/attempt_local_0001_m_000049_0
> 4       ./mapred/local/attempt_local_0001_m_000078_0
> 4       ./mapred/local/attempt_local_0001_m_000062_0
> 4       ./mapred/local/attempt_local_0001_m_000046_0
> 4       ./mapred/local/index
> 4       ./mapred/local/attempt_local_0001_m_000031_0
> 4       ./mapred/local/attempt_local_0001_m_000039_0
> 4       ./mapred/local/attempt_local_0001_m_000032_0
> 4       ./mapred/local/attempt_local_0001_m_000061_0
> 4       ./mapred/local/attempt_local_0001_m_000080_0
> 4       ./mapred/local/attempt_local_0001_m_000034_0
> 4       ./mapred/local/attempt_local_0001_m_000055_0
> 4       ./mapred/local/attempt_local_0001_m_000047_0
> 4       ./mapred/local/attempt_local_0001_m_000036_0
> 4       ./mapred/local/attempt_local_0001_m_000054_0
> 4       ./mapred/local/attempt_local_0001_m_000042_0
> 4       ./mapred/local/attempt_local_0001_r_000000_0
> 4       ./mapred/local/attempt_local_0001_m_000071_0
> 4       ./mapred/local/attempt_local_0001_m_000056_0
> 4       ./mapred/local/attempt_local_0001_m_000083_0
> 4       ./mapred/local/attempt_local_0001_m_000081_0
> 4       ./mapred/local/attempt_local_0001_m_000038_0
> 4       ./mapred/local/attempt_local_0001_m_000052_0
> 4       ./mapred/local/attempt_local_0001_m_000079_0
> 4       ./mapred/local/attempt_local_0001_m_000057_0
> 139344212       ./mapred/local
> 4       ./mapred/temp
> 23556   ./mapred/system/job_local_0001
> 23560   ./mapred/system
> 139367780       ./mapred
> 139367784       .
>
> --
> View this message in context: http://www.nabble.com/Issue-with-merging-segments-with-s-w-built-from-main-trunk-tp21641977p21641977.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>



-- 
Doğacan Güney