You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2018/10/07 18:45:00 UTC

[jira] [Commented] (NUTCH-2635) Generator writes unneeded temporary output

    [ https://issues.apache.org/jira/browse/NUTCH-2635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16641168#comment-16641168 ] 

ASF GitHub Bot commented on NUTCH-2635:
---------------------------------------

sebastian-nagel closed pull request #376: NUTCH-2635 Generator writes unneeded temporary output
URL: https://github.com/apache/nutch/pull/376
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/src/java/org/apache/nutch/crawl/Generator.java b/src/java/org/apache/nutch/crawl/Generator.java
index a8512a3ff..da7e38ad8 100644
--- a/src/java/org/apache/nutch/crawl/Generator.java
+++ b/src/java/org/apache/nutch/crawl/Generator.java
@@ -514,7 +514,6 @@ public void reduce(FloatWritable key, Iterable<SelectorEntry> values,
 
           outputFile = generateFileName(entry);
           mos.write("sequenceFiles", key, entry, outputFile);
-          context.write(key,entry);
 
           // Count is incremented only when we keep the URL
           // maxCount may cause us to skip it.
@@ -572,8 +571,7 @@ public void reduce(Text key, Iterable<SelectorEntry> values,
         Context context)
         throws IOException, InterruptedException {
       // if using HashComparator, we get only one input key in case of
-      // hash collision
-      // so use only URLs from values
+      // hash collision so use only URLs from values
       for (SelectorEntry entry : values) {
         context.write(entry.url, entry.datum);
       }
@@ -605,8 +603,7 @@ public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) {
     private static int hash(byte[] bytes, int start, int length) {
       int hash = 1;
       // make later bytes more significant in hash code, so that sorting
-      // by
-      // hashcode correlates less with by-host ordering.
+      // by hashcode correlates less with by-host ordering.
       for (int i = length - 1; i >= 0; i--)
         hash = (31 * hash) + (int) bytes[start + i];
       return hash;


 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> Generator writes unneeded temporary output
> ------------------------------------------
>
>                 Key: NUTCH-2635
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2635
>             Project: Nutch
>          Issue Type: Bug
>          Components: generator
>    Affects Versions: 1.15
>            Reporter: Sebastian Nagel
>            Priority: Major
>             Fix For: 1.16
>
>
> Generator writes the temporary output of the Selector job/step twice (see [line 516|https://github.com/apache/nutch/blob/branch-1.15/src/java/org/apache/nutch/crawl/Generator.java#L516]). Not a big issue when generating small fetch lists but may be when working on large data. The temporary output looks like:
> {noformat}
> % tree -h generate-temp-fc27fe85-9ddc-4926-b6ba-dcd0066d5007/
> enerate-temp-fc27fe85-9ddc-4926-b6ba-dcd0066d5007/
> |-- [4.0K]  fetchlist-1
> |   `-- [ 25M]  part-r-00000
> `-- [ 77M]  part-r-00000
> 1 directory, 2 files
> % file generate-temp-fc27fe85-9ddc-4926-b6ba-dcd0066d5007/part-r-00000 
> generate-temp-fc27fe85-9ddc-4926-b6ba-dcd0066d5007/part-r-00000: ASCII text
> % file generate-temp-fc27fe85-9ddc-4926-b6ba-dcd0066d5007/fetchlist-1/part-r-00000 
> generate-temp-fc27fe85-9ddc-4926-b6ba-dcd0066d5007/fetchlist-1/part-r-00000: Apache Hadoop Sequence file version 6
> {noformat}
> The unneeded output is plain-text which explains its larger size compared to the Hadoop Sequence file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)