You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Emmanuel <jo...@gmail.com> on 2007/10/07 17:01:37 UTC

Compression issue ?

I've made a huge crawl over 2M of urls and noticed that my folder was using
more than 40Go.
I found it very weird. I decided to make a simple test. I realised twice the
same crawl with different parameters:
1- First crawl, I've add the following property in hadoop-site.xml
<property>
  <name>io.seqfile.compression.type</name>
  <value>BLOCK</value>
  <description></description>
</property>

I launch a crawl over 50000 urls/pages

I checked the space used on the HD and i found
$ du -ks  /data/nutch/local/ctsite51/segments/20071007151116/*
301400  /data/nutch/local/ctsite51/segments/20071007151116/content
376     /data/nutch/local/ctsite51/segments/20071007151116/crawl_fetch
1580    /data/nutch/local/ctsite51/segments/20071007151116/crawl_generate
1488    /data/nutch/local/ctsite51/segments/20071007151116/crawl_parse
2424    /data/nutch/local/ctsite51/segments/20071007151116/parse_data
17060   /data/nutch/local/ctsite51/segments/20071007151116/parse_text

2- Second crawl, I've removed the property in hadoop-site.xml
I launch the same crawl over 50000 urls/pages

I checked the space used on the HD and i found
$ du -ks  /data/nutch/local/ctsite52/segments/20071007184837/*
318568  /data/nutch/local/ctsite52/segments/20071007184837/content
1536    /data/nutch/local/ctsite52/segments/20071007184837/crawl_fetch
1580    /data/nutch/local/ctsite52/segments/20071007184837/crawl_generate
41668   /data/nutch/local/ctsite52/segments/20071007184837/crawl_parse
9596    /data/nutch/local/ctsite52/segments/20071007184837/parse_data
17152   /data/nutch/local/ctsite52/segments/20071007184837/parse_text

It looks like the data within the folder CONTENT and PARSE_TEXT are not
compressed.
Is it normal ? do you have the same issue ?

E

Re: Compression issue ?

Posted by Andrzej Bialecki <ab...@getopt.org>.

Emmanuel wrote:
> I've made a huge crawl over 2M of urls and noticed that my folder was using
> more than 40Go.
> I found it very weird. I decided to make a simple test. I realised twice the
> same crawl with different parameters:
> 1- First crawl, I've add the following property in hadoop-site.xml
> <property>
>   <name>io.seqfile.compression.type</name>
>   <value>BLOCK</value>
>   <description></description>
> </property>
> 
> I launch a crawl over 50000 urls/pages
> 
> I checked the space used on the HD and i found
> $ du -ks  /data/nutch/local/ctsite51/segments/20071007151116/*
> 301400  /data/nutch/local/ctsite51/segments/20071007151116/content
> 376     /data/nutch/local/ctsite51/segments/20071007151116/crawl_fetch
> 1580    /data/nutch/local/ctsite51/segments/20071007151116/crawl_generate
> 1488    /data/nutch/local/ctsite51/segments/20071007151116/crawl_parse
> 2424    /data/nutch/local/ctsite51/segments/20071007151116/parse_data
> 17060   /data/nutch/local/ctsite51/segments/20071007151116/parse_text
> 
> 2- Second crawl, I've removed the property in hadoop-site.xml
> I launch the same crawl over 50000 urls/pages
> 
> I checked the space used on the HD and i found
> $ du -ks  /data/nutch/local/ctsite52/segments/20071007184837/*
> 318568  /data/nutch/local/ctsite52/segments/20071007184837/content
> 1536    /data/nutch/local/ctsite52/segments/20071007184837/crawl_fetch
> 1580    /data/nutch/local/ctsite52/segments/20071007184837/crawl_generate
> 41668   /data/nutch/local/ctsite52/segments/20071007184837/crawl_parse
> 9596    /data/nutch/local/ctsite52/segments/20071007184837/parse_data
> 17152   /data/nutch/local/ctsite52/segments/20071007184837/parse_text
> 
> It looks like the data within the folder CONTENT and PARSE_TEXT are not
> compressed.
> Is it normal ? do you have the same issue ?

What you see is a difference between RECORD and BLOCK compression. Nutch 
uses BLOCK compression for content/ and parse_text/, which is less 
efficient in terms of maximum compression ratio, but preserves record 
boundaries - which is crucial in order to achieve good performance of 
random access. BLOCK compressed data causes performance issues in case 
of random access, because the data needs to be read from the nearest 
sync mark (or from the beginning of the file) and decompressed.


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com