You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Doug Cutting (JIRA)" <ji...@apache.org> on 2006/10/25 19:44:17 UTC

[jira] Created: (NUTCH-392) OutputFormat implementations should pass on Progressable

OutputFormat implementations should pass on Progressable
--------------------------------------------------------

                 Key: NUTCH-392
                 URL: http://issues.apache.org/jira/browse/NUTCH-392
             Project: Nutch
          Issue Type: New Feature
          Components: fetcher
            Reporter: Doug Cutting


OutputFormat implementations should pass the Progressable they are passed to underlying SequenceFile implementations.  This will keep reduce tasks from timing out when block writes are slow.  This issue depends on http://issues.apache.org/jira/browse/HADOOP-636.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Commented: (NUTCH-392) OutputFormat implementations should pass on Progressable

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508818 ] 

Doğacan Güney commented on NUTCH-392:
-------------------------------------

> Re: Content versioning - we can use negative int values as version numbers. I'm still not sure what is the impact of 
> BLOCK compression on MapFile random access. 

Good idea! 

(Btw, I still believe that BLOCK compression's performance hit is irrelevant for anything but parse_text. That's why I am trying to do the second test. I was trying to test how fast random access on parse_text is under different compressions. BLOCK compression will probably be not fast enough for parse_text. But if the impact is minor, it can be used for everything else.)

>  Regarding the sizes: parse_text_record size is larger, because for small chunks of data the compression overhead may far
> outweigh the compression gains. Re: the large size of crawl_parse - is this related to your patch? It could be simply related to 
> the fact that there are many outlinks in those pages ... Or is crawl_parse using BLOCK compression too?

OK, I understand why parse_text_record is larger, thanks for the explanation. But why is parse_text_block's size so close to parse_text (why is content so different from parse_text? BLOCK creates wonders in content but does not even give a 10% in parse_text.). Feed plugin wasn't enabled so my patch shouldn't matter. Also, crawl_parse is using NONE compression.

> OutputFormat implementations should pass on Progressable
> --------------------------------------------------------
>
>                 Key: NUTCH-392
>                 URL: https://issues.apache.org/jira/browse/NUTCH-392
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>            Reporter: Doug Cutting
>            Assignee: Andrzej Bialecki 
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-392.patch
>
>
> OutputFormat implementations should pass the Progressable they are passed to underlying SequenceFile implementations.  This will keep reduce tasks from timing out when block writes are slow.  This issue depends on http://issues.apache.org/jira/browse/HADOOP-636.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-392) OutputFormat implementations should pass on Progressable

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12500951 ] 

Andrzej Bialecki  commented on NUTCH-392:
-----------------------------------------

I don't think it's a good idea, it's creating too many cryptic options ... Average users won't be able to assess what are the best choices there, and advanced users are able to change this directly in the source anyway ...

> OutputFormat implementations should pass on Progressable
> --------------------------------------------------------
>
>                 Key: NUTCH-392
>                 URL: https://issues.apache.org/jira/browse/NUTCH-392
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>            Reporter: Doug Cutting
>            Assignee: Andrzej Bialecki 
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-392.patch
>
>
> OutputFormat implementations should pass the Progressable they are passed to underlying SequenceFile implementations.  This will keep reduce tasks from timing out when block writes are slow.  This issue depends on http://issues.apache.org/jira/browse/HADOOP-636.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-392) OutputFormat implementations should pass on Progressable

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508823 ] 

Doğacan Güney commented on NUTCH-392:
-------------------------------------

> data of parse_text is already compressed so recompressing it does not give huge gains

Wow, I am certainly not at my sharpest today. Thanks for pointing out. I will change ParseText and report back with the results.

> OutputFormat implementations should pass on Progressable
> --------------------------------------------------------
>
>                 Key: NUTCH-392
>                 URL: https://issues.apache.org/jira/browse/NUTCH-392
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>            Reporter: Doug Cutting
>            Assignee: Andrzej Bialecki 
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-392.patch
>
>
> OutputFormat implementations should pass the Progressable they are passed to underlying SequenceFile implementations.  This will keep reduce tasks from timing out when block writes are slow.  This issue depends on http://issues.apache.org/jira/browse/HADOOP-636.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-392) OutputFormat implementations should pass on Progressable

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12500935 ] 

Doğacan Güney commented on NUTCH-392:
-------------------------------------

Perhaps we can allow a user to configure this on a per-structure basis by adding new properties: compression.type.{parse_text,crawldb,parse_data,linkdb} or whatever. Then we can make such a property take one of the 4 valid values: BLOCK, NONE, RECORD, DEFAULT where DEFAULT is the value of io.sequence.file.compression.

> OutputFormat implementations should pass on Progressable
> --------------------------------------------------------
>
>                 Key: NUTCH-392
>                 URL: https://issues.apache.org/jira/browse/NUTCH-392
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>            Reporter: Doug Cutting
>            Assignee: Andrzej Bialecki 
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-392.patch
>
>
> OutputFormat implementations should pass the Progressable they are passed to underlying SequenceFile implementations.  This will keep reduce tasks from timing out when block writes are slow.  This issue depends on http://issues.apache.org/jira/browse/HADOOP-636.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-392) OutputFormat implementations should pass on Progressable

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12500822 ] 

Doug Cutting commented on NUTCH-392:
------------------------------------

Anchors, explain, and the cache are used relatively infrequently, considerably less than once per query, and hence *much* less than once per displayed hit.  So it might be acceptable if they're somewhat slower.  Block compression should still be fast-enough for interactive use, and these uses would never dominate CPU use in an application, would they?

> OutputFormat implementations should pass on Progressable
> --------------------------------------------------------
>
>                 Key: NUTCH-392
>                 URL: https://issues.apache.org/jira/browse/NUTCH-392
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>            Reporter: Doug Cutting
>            Assignee: Andrzej Bialecki 
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-392.patch
>
>
> OutputFormat implementations should pass the Progressable they are passed to underlying SequenceFile implementations.  This will keep reduce tasks from timing out when block writes are slow.  This issue depends on http://issues.apache.org/jira/browse/HADOOP-636.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-392) OutputFormat implementations should pass on Progressable

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-392?page=comments#action_12444719 ] 
            
Doug Cutting commented on NUTCH-392:
------------------------------------

This should not be applied until Nutch uses Hadoop 0.8.  It also contains a patch required to make Nutch work correctly with Hadoop 0.8 (where LocalFileSystem.rename() of a non-existing file now throws an exception).

> OutputFormat implementations should pass on Progressable
> --------------------------------------------------------
>
>                 Key: NUTCH-392
>                 URL: http://issues.apache.org/jira/browse/NUTCH-392
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>            Reporter: Doug Cutting
>         Assigned To: Doug Cutting
>         Attachments: NUTCH-392.patch
>
>
> OutputFormat implementations should pass the Progressable they are passed to underlying SequenceFile implementations.  This will keep reduce tasks from timing out when block writes are slow.  This issue depends on http://issues.apache.org/jira/browse/HADOOP-636.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Updated: (NUTCH-392) OutputFormat implementations should pass on Progressable

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doğacan Güney updated NUTCH-392:
--------------------------------

    Attachment: ParseTextBenchmark.java

benchmark code.

> OutputFormat implementations should pass on Progressable
> --------------------------------------------------------
>
>                 Key: NUTCH-392
>                 URL: https://issues.apache.org/jira/browse/NUTCH-392
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>            Reporter: Doug Cutting
>            Assignee: Andrzej Bialecki 
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-392.patch, ParseTextBenchmark.java
>
>
> OutputFormat implementations should pass the Progressable they are passed to underlying SequenceFile implementations.  This will keep reduce tasks from timing out when block writes are slow.  This issue depends on http://issues.apache.org/jira/browse/HADOOP-636.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (NUTCH-392) OutputFormat implementations should pass on Progressable

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrzej Bialecki  resolved NUTCH-392.
-------------------------------------

       Resolution: Fixed
    Fix Version/s: 1.0.0
         Assignee: Andrzej Bialecki   (was: Doug Cutting)

Patch applied with small changes in rev. 543264. Thank you!

> OutputFormat implementations should pass on Progressable
> --------------------------------------------------------
>
>                 Key: NUTCH-392
>                 URL: https://issues.apache.org/jira/browse/NUTCH-392
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>            Reporter: Doug Cutting
>            Assignee: Andrzej Bialecki 
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-392.patch
>
>
> OutputFormat implementations should pass the Progressable they are passed to underlying SequenceFile implementations.  This will keep reduce tasks from timing out when block writes are slow.  This issue depends on http://issues.apache.org/jira/browse/HADOOP-636.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-392) OutputFormat implementations should pass on Progressable

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508900 ] 

Andrzej Bialecki  commented on NUTCH-392:
-----------------------------------------

Excellent work, Doğacan - thank you. The numbers for RECORD compression probably depend on some sweet spot in the environment, related to the CPU usage, how the OS pulls data from the disk / disk buffers, what is the hard drive cache, what is the size of internal mem buffers in Hadoop, etc, etc. I would venture a guess that compression NONE is raw disk I/O bound, whereas BOCK compression suffers from poor performance of seeking in compressed data.

I agree with your conclusions regarding the type of compression to use for each segment part.

Re: Nutch not doing any internal compression for Content and ParseText: Content is a versioned writable, so we can change its implementation and provide compatibility code to read older data. The same with ParseText.

> OutputFormat implementations should pass on Progressable
> --------------------------------------------------------
>
>                 Key: NUTCH-392
>                 URL: https://issues.apache.org/jira/browse/NUTCH-392
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>            Reporter: Doug Cutting
>            Assignee: Andrzej Bialecki 
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-392.patch, ParseTextBenchmark.java
>
>
> OutputFormat implementations should pass the Progressable they are passed to underlying SequenceFile implementations.  This will keep reduce tasks from timing out when block writes are slow.  This issue depends on http://issues.apache.org/jira/browse/HADOOP-636.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-392) OutputFormat implementations should pass on Progressable

Posted by "Sami Siren (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508820 ] 

Sami Siren commented on NUTCH-392:
----------------------------------

> But why is parse_text_block's size so close to parse_text 
data of parse_text is already compressed so recompressing it does not give huge gains

> OutputFormat implementations should pass on Progressable
> --------------------------------------------------------
>
>                 Key: NUTCH-392
>                 URL: https://issues.apache.org/jira/browse/NUTCH-392
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>            Reporter: Doug Cutting
>            Assignee: Andrzej Bialecki 
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-392.patch
>
>
> OutputFormat implementations should pass the Progressable they are passed to underlying SequenceFile implementations.  This will keep reduce tasks from timing out when block writes are slow.  This issue depends on http://issues.apache.org/jira/browse/HADOOP-636.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Assigned: (NUTCH-392) OutputFormat implementations should pass on Progressable

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/NUTCH-392?page=all ]

Doug Cutting reassigned NUTCH-392:
----------------------------------

    Assignee: Doug Cutting

> OutputFormat implementations should pass on Progressable
> --------------------------------------------------------
>
>                 Key: NUTCH-392
>                 URL: http://issues.apache.org/jira/browse/NUTCH-392
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>            Reporter: Doug Cutting
>         Assigned To: Doug Cutting
>
> OutputFormat implementations should pass the Progressable they are passed to underlying SequenceFile implementations.  This will keep reduce tasks from timing out when block writes are slow.  This issue depends on http://issues.apache.org/jira/browse/HADOOP-636.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Commented: (NUTCH-392) OutputFormat implementations should pass on Progressable

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12500635 ] 

Andrzej Bialecki  commented on NUTCH-392:
-----------------------------------------

Good point. We can change it to use the following pattern (as Hadoop uses internally), e.g.:

contentOut = new MapFile.Writer(job, fs, content.toString(), Text.class, Content.class, SequenceFile.getCompressionType(job), progress);

However, the original patch had some merits, too. Some types of data are not that compressible in themselves (using RECORD compression), i.e. it takes more effort to compress/decompress than space savings are worth. In case of crawl_parse and crawl_fetch it would make sense to enforce BLOCK or NONE compression type, and disallow the RECORD type.

 I know that BLOCK compression gives a better space savings, and incidentally may increase the writing speed. But I'm not sure what is the performance impact of using BLOCK compressed MapFile-s when doing random reading - this is the scenario in LinkDbInlinks, FetchedSegments and similar places. Could you perhaps test it? The original patch used RECORD compression for MapFile-s, probably for this reason.

> OutputFormat implementations should pass on Progressable
> --------------------------------------------------------
>
>                 Key: NUTCH-392
>                 URL: https://issues.apache.org/jira/browse/NUTCH-392
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>            Reporter: Doug Cutting
>            Assignee: Andrzej Bialecki 
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-392.patch
>
>
> OutputFormat implementations should pass the Progressable they are passed to underlying SequenceFile implementations.  This will keep reduce tasks from timing out when block writes are slow.  This issue depends on http://issues.apache.org/jira/browse/HADOOP-636.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-392) OutputFormat implementations should pass on Progressable

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/NUTCH-392?page=all ]

Doug Cutting updated NUTCH-392:
-------------------------------

    Attachment: NUTCH-392.patch

> OutputFormat implementations should pass on Progressable
> --------------------------------------------------------
>
>                 Key: NUTCH-392
>                 URL: http://issues.apache.org/jira/browse/NUTCH-392
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>            Reporter: Doug Cutting
>         Assigned To: Doug Cutting
>         Attachments: NUTCH-392.patch
>
>
> OutputFormat implementations should pass the Progressable they are passed to underlying SequenceFile implementations.  This will keep reduce tasks from timing out when block writes are slow.  This issue depends on http://issues.apache.org/jira/browse/HADOOP-636.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Commented: (NUTCH-392) OutputFormat implementations should pass on Progressable

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508861 ] 

Doğacan Güney commented on NUTCH-392:
-------------------------------------

After changing ParseText to not do any internal compression, segment directory looks like this:

828M    crawl/segments/20070626163143/content
35M     crawl/segments/20070626163143/crawl_fetch
23M     crawl/segments/20070626163143/crawl_generate
44M     crawl/segments/20070626163143/crawl_parse # BLOCK compression
218M    crawl/segments/20070626163143/parse_data
524M    crawl/segments/20070626163143/parse_text
192M    crawl/segments/20070626163143/parse_text_block
242M    crawl/segments/20070626163143/parse_text_record

As you can see parse_text_block is around %20 percent smaller than parse_text_record.

I also wrote a simple benchmark that randomly requests n urls from each parse text sequentially (but it requests the same urls in the same order from all parse texts). All parse texts contain a single part with ~250K urls. Here are the results (Trial 0 is NONE, trial is RECORD, trial 2 is BLOCK):

for n = 1000:
Trial 0 has taken 9947 ms.
Trial 1 has taken 6794 ms.
Trial 2 has taken 9717 ms.

for n = 5000:
Trial 0 has taken 40918 ms.
Trial 1 has taken 19969 ms.
Trial 2 has taken 52622 ms.

for n = 10000
Trial 0 has taken 57622 ms.
Trial 1 has taken 24291 ms.
Trial 2 has taken 96292 ms.

Overall RECORD compression is the fastest and BLOCK compression is the slowest (by a large margin).

Assuming my benchmark code is correct (feel free to show me where it is wrong), these are my conclusions:

* I don't know what others think, but to me it still looks like we can use BLOCK compression for structures like content, linkdb, etc. Even though, it is much slower than RECORD, it can still serve ~100 parse texts per second. While, this is certainly not good enough for parse text, it probably is good enough for others.

* We should definitely enable RECORD compression for parse text and BLOCK compression for crawl_*. For some reason, RECORD compression performs better than O(n) (which makes me think that something is wrong with my benchmark code) for parse text.

* Nutch should not do any compression internally. Hadoop can do this better with its native compression. Content and ParseText compress their data on their own (and they can be converted to hadoop's compression in a backward-compatible way). I don't know if anything else does compression.

PS: Native hadoop library is loaded. I haven't specified which compression codec to use so I guess it uses zlib. Lzo results would have probably been better.

> OutputFormat implementations should pass on Progressable
> --------------------------------------------------------
>
>                 Key: NUTCH-392
>                 URL: https://issues.apache.org/jira/browse/NUTCH-392
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>            Reporter: Doug Cutting
>            Assignee: Andrzej Bialecki 
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-392.patch, ParseTextBenchmark.java
>
>
> OutputFormat implementations should pass the Progressable they are passed to underlying SequenceFile implementations.  This will keep reduce tasks from timing out when block writes are slow.  This issue depends on http://issues.apache.org/jira/browse/HADOOP-636.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-392) OutputFormat implementations should pass on Progressable

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12500665 ] 

Doğacan Güney commented on NUTCH-392:
-------------------------------------

I think it is okay to allow BLOCK compression for linkdb, crawldb, crawl_*, content, parse_data. Because I don't think that people will need fast random-access on anything but parse_text. 

I agree that we need to test performance impact of BLOCK compression before committing such a change. Unfortunately, our  setup doesn't include BLOCK compression right now. I will try to test it and report some results once I get the chance.

PS: Compressing content will not have significant savings right now since it is already compressed internally but once content stops doing that I think there will be _huge_ savings there. 

> OutputFormat implementations should pass on Progressable
> --------------------------------------------------------
>
>                 Key: NUTCH-392
>                 URL: https://issues.apache.org/jira/browse/NUTCH-392
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>            Reporter: Doug Cutting
>            Assignee: Andrzej Bialecki 
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-392.patch
>
>
> OutputFormat implementations should pass the Progressable they are passed to underlying SequenceFile implementations.  This will keep reduce tasks from timing out when block writes are slow.  This issue depends on http://issues.apache.org/jira/browse/HADOOP-636.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-392) OutputFormat implementations should pass on Progressable

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508812 ] 

Doğacan Güney commented on NUTCH-392:
-------------------------------------

OK, I have done a bit of testing on compression but I'm stuck. Here it is:

* I changed Content to be a regular Writable instead of a CompressedWritable and turned on BLOCK compression. Results were pretty impressive. Content size went down from ~1GB to ~500MB. Unfortunately, I haven't figured out how we can change Content in a backward compatible way. Reading first byte as version won't work (because first byte is not version, the first thing written is the size of the compressed data as int).

* This is where it gets strange. I was trying to test the performance impact of BLOCK compression (when generating summaries).  I fetched a sample 250000 url segment (a subset of dmoz). Then I made a small modification to ParseOutputFormat so that it outputs parse_text in all three compression formats ( http://www.ceng.metu.edu.tr/~e1345172/comp_parse.patch ). After parsing, segment looks like this:

828M    crawl/segments/20070626163143/content
35M     crawl/segments/20070626163143/crawl_fetch
23M     crawl/segments/20070626163143/crawl_generate
345M    crawl/segments/20070626163143/crawl_parse
196M    crawl/segments/20070626163143/parse_data
244M    crawl/segments/20070626163143/parse_text # NONE
232M    crawl/segments/20070626163143/parse_text_block # BLOCK
246M    crawl/segments/20070626163143/parse_text_record # RECORD

Not only parse_text_record is larger than parse_text and parse_text_block is only slightly smaller, but also crawl_parse is larger than any of them!

I probably messed up somewhere and I can't see it. Any help would be welcome.

> OutputFormat implementations should pass on Progressable
> --------------------------------------------------------
>
>                 Key: NUTCH-392
>                 URL: https://issues.apache.org/jira/browse/NUTCH-392
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>            Reporter: Doug Cutting
>            Assignee: Andrzej Bialecki 
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-392.patch
>
>
> OutputFormat implementations should pass the Progressable they are passed to underlying SequenceFile implementations.  This will keep reduce tasks from timing out when block writes are slow.  This issue depends on http://issues.apache.org/jira/browse/HADOOP-636.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-392) OutputFormat implementations should pass on Progressable

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508816 ] 

Andrzej Bialecki  commented on NUTCH-392:
-----------------------------------------

Re: Content versioning - we can use negative int values as version numbers. I'm still not sure what is the impact of BLOCK compression on MapFile random access.

Regarding the sizes: parse_text_record size is larger, because for small chunks of data the compression overhead may far outweigh the compression gains. Re: the large size of crawl_parse - is this related to your patch? It could be simply related to the fact that there are many outlinks in those pages ... Or is crawl_parse using BLOCK compression too?

> OutputFormat implementations should pass on Progressable
> --------------------------------------------------------
>
>                 Key: NUTCH-392
>                 URL: https://issues.apache.org/jira/browse/NUTCH-392
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>            Reporter: Doug Cutting
>            Assignee: Andrzej Bialecki 
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-392.patch
>
>
> OutputFormat implementations should pass the Progressable they are passed to underlying SequenceFile implementations.  This will keep reduce tasks from timing out when block writes are slow.  This issue depends on http://issues.apache.org/jira/browse/HADOOP-636.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-392) OutputFormat implementations should pass on Progressable

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12500728 ] 

Andrzej Bialecki  commented on NUTCH-392:
-----------------------------------------

> I think it is okay to allow BLOCK compression for linkdb, crawldb, crawl_*,
> content, parse_data. Because I don't think that people will need fast random-access
>  on anything but parse_text.

LinkDb is accessed on-line randomly through LinkDbInlinks, when users request anchors. Similarly, parse_data is accessed when requesting "explain", and may be also accessed to retrieve other hit metadata. Content is accessed randomly when displaying cached preview. I think in all these cases we can use at most RECORD compression, or NONE.

> OutputFormat implementations should pass on Progressable
> --------------------------------------------------------
>
>                 Key: NUTCH-392
>                 URL: https://issues.apache.org/jira/browse/NUTCH-392
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>            Reporter: Doug Cutting
>            Assignee: Andrzej Bialecki 
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-392.patch
>
>
> OutputFormat implementations should pass the Progressable they are passed to underlying SequenceFile implementations.  This will keep reduce tasks from timing out when block writes are slow.  This issue depends on http://issues.apache.org/jira/browse/HADOOP-636.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-392) OutputFormat implementations should pass on Progressable

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/NUTCH-392?page=all ]

Doug Cutting updated NUTCH-392:
-------------------------------

    Attachment:     (was: NUTCH-392.patch)

> OutputFormat implementations should pass on Progressable
> --------------------------------------------------------
>
>                 Key: NUTCH-392
>                 URL: http://issues.apache.org/jira/browse/NUTCH-392
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>            Reporter: Doug Cutting
>         Assigned To: Doug Cutting
>         Attachments: NUTCH-392.patch
>
>
> OutputFormat implementations should pass the Progressable they are passed to underlying SequenceFile implementations.  This will keep reduce tasks from timing out when block writes are slow.  This issue depends on http://issues.apache.org/jira/browse/HADOOP-636.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Commented: (NUTCH-392) OutputFormat implementations should pass on Progressable

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12500603 ] 

Doğacan Güney commented on NUTCH-392:
-------------------------------------

>From what I  understand of MapFile.Writer code in hadoop, if you give CompressionType as an argument in its constructor it overwrites the compression value in config. So since nutch manually sets parse_text and parse_data to RECORD compression ( and crawl_parse to NONE), we will not get the advantages of BLOCK compression even if we set it in config. 

BLOCK compression seems to work really great if you got the native libraries in place, so IMHO it would be better to not manually set CompressionType and allow people to set it to whatever they want in config.

> OutputFormat implementations should pass on Progressable
> --------------------------------------------------------
>
>                 Key: NUTCH-392
>                 URL: https://issues.apache.org/jira/browse/NUTCH-392
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>            Reporter: Doug Cutting
>            Assignee: Andrzej Bialecki 
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-392.patch
>
>
> OutputFormat implementations should pass the Progressable they are passed to underlying SequenceFile implementations.  This will keep reduce tasks from timing out when block writes are slow.  This issue depends on http://issues.apache.org/jira/browse/HADOOP-636.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-392) OutputFormat implementations should pass on Progressable

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/NUTCH-392?page=all ]

Doug Cutting updated NUTCH-392:
-------------------------------

    Attachment: NUTCH-392.patch

Oops.  Attached the wrong patch.  Here's the right one.

> OutputFormat implementations should pass on Progressable
> --------------------------------------------------------
>
>                 Key: NUTCH-392
>                 URL: http://issues.apache.org/jira/browse/NUTCH-392
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>            Reporter: Doug Cutting
>         Assigned To: Doug Cutting
>         Attachments: NUTCH-392.patch
>
>
> OutputFormat implementations should pass the Progressable they are passed to underlying SequenceFile implementations.  This will keep reduce tasks from timing out when block writes are slow.  This issue depends on http://issues.apache.org/jira/browse/HADOOP-636.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira