You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by Kim Chew <kc...@gmail.com> on 2014/03/27 19:43:48 UTC

Why is HDFS_BYTES_WRITTEN is much larger than HDFS_BYTES_READ in this case?

I have a simple M/R job using Mapper only thus no reducer. The mapper read
a timestamp from the value, generate a path to the output file and writes
the key and value to the output file.

The input file is a sequence file, not compressed and stored in the HDFS,
it has a size of 162.68 MB.

Output also is written as a sequence file.

However, after I ran my job, I have two output part files from the mapper.
One has a size of 835.12 MB and the other has a size of 224.77 MB. So why
is the total outputs size is so much larger? Shouldn't it be more or less
equal to the input's size of 162.68MB since I just write the key and value
passed to mapper to the output?

Here is the mapper code snippet,

public void map(BytesWritable key, BytesWritable value, Context context)
throws IOException, InterruptedException {

        long timestamp = bytesToInt(value.getBytes(), TIMESTAMP_INDEX);;
        String tsStr = sdf.format(new Date(timestamp * 1000L));

        mos.write(key, value, generateFileName(tsStr)); // mos is a
MultipleOutputs object.
    }

        private String generateFileName(String key) {
        return outputDir+"/"+key+"/raw-vectors";
    }

And here are the job outputs,

14/03/27 11:00:56 INFO mapred.JobClient:     Launched map tasks=2
14/03/27 11:00:56 INFO mapred.JobClient:     Data-local map tasks=2
14/03/27 11:00:56 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
14/03/27 11:00:56 INFO mapred.JobClient:   File Output Format Counters
14/03/27 11:00:56 INFO mapred.JobClient:     Bytes Written=0
14/03/27 11:00:56 INFO mapred.JobClient:   FileSystemCounters
14/03/27 11:00:56 INFO mapred.JobClient:     HDFS_BYTES_READ=171086386
14/03/27 11:00:56 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=54272
14/03/27 11:00:56 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=1111374798
14/03/27 11:00:56 INFO mapred.JobClient:   File Input Format Counters
14/03/27 11:00:56 INFO mapred.JobClient:     Bytes Read=170782415
14/03/27 11:00:56 INFO mapred.JobClient:   Map-Reduce Framework
14/03/27 11:00:56 INFO mapred.JobClient:     Map input records=547
14/03/27 11:00:56 INFO mapred.JobClient:     Physical memory (bytes)
snapshot=166428672
14/03/27 11:00:56 INFO mapred.JobClient:     Spilled Records=0
14/03/27 11:00:56 INFO mapred.JobClient:     Total committed heap usage
(bytes)=38351872
14/03/27 11:00:56 INFO mapred.JobClient:     CPU time spent (ms)=20080
14/03/27 11:00:56 INFO mapred.JobClient:     Virtual memory (bytes)
snapshot=1240104960
14/03/27 11:00:56 INFO mapred.JobClient:     SPLIT_RAW_BYTES=286
14/03/27 11:00:56 INFO mapred.JobClient:     Map output records=0

TIA,

Kim

Re: Why is HDFS_BYTES_WRITTEN is much larger than HDFS_BYTES_READ in this case?

Posted by Kim Chew <kc...@gmail.com>.
Yea, gonna do that. 8-)

Kim


On Thu, Mar 27, 2014 at 12:30 PM, Thomas Bentsen <th...@bentzn.com> wrote:

> Have you checked the content of the files you write?
>
>
> /th
>
> On Thu, 2014-03-27 at 11:43 -0700, Kim Chew wrote:
> > I have a simple M/R job using Mapper only thus no reducer. The mapper
> > read a timestamp from the value, generate a path to the output file
> > and writes the key and value to the output file.
> >
> >
> > The input file is a sequence file, not compressed and stored in the
> > HDFS, it has a size of 162.68 MB.
> >
> >
> > Output also is written as a sequence file.
> >
> >
> >
> > However, after I ran my job, I have two output part files from the
> > mapper. One has a size of 835.12 MB and the other has a size of 224.77
> > MB. So why is the total outputs size is so much larger? Shouldn't it
> > be more or less equal to the input's size of 162.68MB since I just
> > write the key and value passed to mapper to the output?
> >
> >
> > Here is the mapper code snippet,
> >
> > public void map(BytesWritable key, BytesWritable value, Context
> > context) throws IOException, InterruptedException {
> >
> >         long timestamp = bytesToInt(value.getBytes(),
> > TIMESTAMP_INDEX);;
> >         String tsStr = sdf.format(new Date(timestamp * 1000L));
> >
> >         mos.write(key, value, generateFileName(tsStr)); // mos is a
> > MultipleOutputs object.
> >     }
> >
> >         private String generateFileName(String key) {
> >         return outputDir+"/"+key+"/raw-vectors";
> >     }
> >
> >
> > And here are the job outputs,
> >
> > 14/03/27 11:00:56 INFO mapred.JobClient:     Launched map tasks=2
> > 14/03/27 11:00:56 INFO mapred.JobClient:     Data-local map tasks=2
> > 14/03/27 11:00:56 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
> > 14/03/27 11:00:56 INFO mapred.JobClient:   File Output Format
> > Counters
> > 14/03/27 11:00:56 INFO mapred.JobClient:     Bytes Written=0
> > 14/03/27 11:00:56 INFO mapred.JobClient:   FileSystemCounters
> > 14/03/27 11:00:56 INFO mapred.JobClient:     HDFS_BYTES_READ=171086386
> > 14/03/27 11:00:56 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=54272
> > 14/03/27 11:00:56 INFO mapred.JobClient:
> > HDFS_BYTES_WRITTEN=1111374798
> > 14/03/27 11:00:56 INFO mapred.JobClient:   File Input Format Counters
> > 14/03/27 11:00:56 INFO mapred.JobClient:     Bytes Read=170782415
> > 14/03/27 11:00:56 INFO mapred.JobClient:   Map-Reduce Framework
> > 14/03/27 11:00:56 INFO mapred.JobClient:     Map input records=547
> > 14/03/27 11:00:56 INFO mapred.JobClient:     Physical memory (bytes)
> > snapshot=166428672
> > 14/03/27 11:00:56 INFO mapred.JobClient:     Spilled Records=0
> > 14/03/27 11:00:56 INFO mapred.JobClient:     Total committed heap
> > usage (bytes)=38351872
> > 14/03/27 11:00:56 INFO mapred.JobClient:     CPU time spent (ms)=20080
> > 14/03/27 11:00:56 INFO mapred.JobClient:     Virtual memory (bytes)
> > snapshot=1240104960
> > 14/03/27 11:00:56 INFO mapred.JobClient:     SPLIT_RAW_BYTES=286
> > 14/03/27 11:00:56 INFO mapred.JobClient:     Map output records=0
> >
> >
> > TIA,
> >
> >
> > Kim
> >
>
>
>

Re: Why is HDFS_BYTES_WRITTEN is much larger than HDFS_BYTES_READ in this case?

Posted by Kim Chew <kc...@gmail.com>.
Yea, gonna do that. 8-)

Kim


On Thu, Mar 27, 2014 at 12:30 PM, Thomas Bentsen <th...@bentzn.com> wrote:

> Have you checked the content of the files you write?
>
>
> /th
>
> On Thu, 2014-03-27 at 11:43 -0700, Kim Chew wrote:
> > I have a simple M/R job using Mapper only thus no reducer. The mapper
> > read a timestamp from the value, generate a path to the output file
> > and writes the key and value to the output file.
> >
> >
> > The input file is a sequence file, not compressed and stored in the
> > HDFS, it has a size of 162.68 MB.
> >
> >
> > Output also is written as a sequence file.
> >
> >
> >
> > However, after I ran my job, I have two output part files from the
> > mapper. One has a size of 835.12 MB and the other has a size of 224.77
> > MB. So why is the total outputs size is so much larger? Shouldn't it
> > be more or less equal to the input's size of 162.68MB since I just
> > write the key and value passed to mapper to the output?
> >
> >
> > Here is the mapper code snippet,
> >
> > public void map(BytesWritable key, BytesWritable value, Context
> > context) throws IOException, InterruptedException {
> >
> >         long timestamp = bytesToInt(value.getBytes(),
> > TIMESTAMP_INDEX);;
> >         String tsStr = sdf.format(new Date(timestamp * 1000L));
> >
> >         mos.write(key, value, generateFileName(tsStr)); // mos is a
> > MultipleOutputs object.
> >     }
> >
> >         private String generateFileName(String key) {
> >         return outputDir+"/"+key+"/raw-vectors";
> >     }
> >
> >
> > And here are the job outputs,
> >
> > 14/03/27 11:00:56 INFO mapred.JobClient:     Launched map tasks=2
> > 14/03/27 11:00:56 INFO mapred.JobClient:     Data-local map tasks=2
> > 14/03/27 11:00:56 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
> > 14/03/27 11:00:56 INFO mapred.JobClient:   File Output Format
> > Counters
> > 14/03/27 11:00:56 INFO mapred.JobClient:     Bytes Written=0
> > 14/03/27 11:00:56 INFO mapred.JobClient:   FileSystemCounters
> > 14/03/27 11:00:56 INFO mapred.JobClient:     HDFS_BYTES_READ=171086386
> > 14/03/27 11:00:56 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=54272
> > 14/03/27 11:00:56 INFO mapred.JobClient:
> > HDFS_BYTES_WRITTEN=1111374798
> > 14/03/27 11:00:56 INFO mapred.JobClient:   File Input Format Counters
> > 14/03/27 11:00:56 INFO mapred.JobClient:     Bytes Read=170782415
> > 14/03/27 11:00:56 INFO mapred.JobClient:   Map-Reduce Framework
> > 14/03/27 11:00:56 INFO mapred.JobClient:     Map input records=547
> > 14/03/27 11:00:56 INFO mapred.JobClient:     Physical memory (bytes)
> > snapshot=166428672
> > 14/03/27 11:00:56 INFO mapred.JobClient:     Spilled Records=0
> > 14/03/27 11:00:56 INFO mapred.JobClient:     Total committed heap
> > usage (bytes)=38351872
> > 14/03/27 11:00:56 INFO mapred.JobClient:     CPU time spent (ms)=20080
> > 14/03/27 11:00:56 INFO mapred.JobClient:     Virtual memory (bytes)
> > snapshot=1240104960
> > 14/03/27 11:00:56 INFO mapred.JobClient:     SPLIT_RAW_BYTES=286
> > 14/03/27 11:00:56 INFO mapred.JobClient:     Map output records=0
> >
> >
> > TIA,
> >
> >
> > Kim
> >
>
>
>

Re: Why is HDFS_BYTES_WRITTEN is much larger than HDFS_BYTES_READ in this case?

Posted by Kim Chew <kc...@gmail.com>.
None of that.

I checked the the input file's SequenceFile Header and it says
"org.apache.hadoop.io.compress.zlib.BuiltInZlibDeflater"

Kim


On Fri, Mar 28, 2014 at 10:34 AM, Hardik Pandya <sm...@gmail.com>wrote:

> what is your compression format gzip, lzo or snappy
>
> for lzo final output
>
> FileOutputFormat.setCompressOutput(conf, true);
> FileOutputFormat.setOutputCompressorClass(conf, LzoCodec.class);
>
> In addition, to make LZO splittable, you need to make a LZO index file.
>
>
> On Thu, Mar 27, 2014 at 8:57 PM, Kim Chew <kc...@gmail.com> wrote:
>
>> Thanks folks.
>>
>> I am not awared my input data file has been compressed.
>> FileOutputFromat.setCompressOutput() is set to true when the file is
>> written. 8-(
>>
>> Kim
>>
>>
>> On Thu, Mar 27, 2014 at 5:46 PM, Mostafa Ead <mo...@gmail.com>wrote:
>>
>>> The following might answer you partially:
>>>
>>> Input key is not read from HDFS, it is auto generated as the offset of
>>> the input value in the input file. I think that is (partially) why read
>>> hdfs bytes is smaller than written hdfs bytes.
>>>  On Mar 27, 2014 1:34 PM, "Kim Chew" <kc...@gmail.com> wrote:
>>>
>>>> I am also wondering if, say, I have two identical timestamp so they are
>>>> going to be written to the same file. Does MulitpleOutputs handle appending?
>>>>
>>>> Thanks.
>>>>
>>>> Kim
>>>>
>>>>
>>>> On Thu, Mar 27, 2014 at 12:30 PM, Thomas Bentsen <th...@bentzn.com> wrote:
>>>>
>>>>> Have you checked the content of the files you write?
>>>>>
>>>>>
>>>>> /th
>>>>>
>>>>> On Thu, 2014-03-27 at 11:43 -0700, Kim Chew wrote:
>>>>> > I have a simple M/R job using Mapper only thus no reducer. The mapper
>>>>> > read a timestamp from the value, generate a path to the output file
>>>>> > and writes the key and value to the output file.
>>>>> >
>>>>> >
>>>>> > The input file is a sequence file, not compressed and stored in the
>>>>> > HDFS, it has a size of 162.68 MB.
>>>>> >
>>>>> >
>>>>> > Output also is written as a sequence file.
>>>>> >
>>>>> >
>>>>> >
>>>>> > However, after I ran my job, I have two output part files from the
>>>>> > mapper. One has a size of 835.12 MB and the other has a size of
>>>>> 224.77
>>>>> > MB. So why is the total outputs size is so much larger? Shouldn't it
>>>>> > be more or less equal to the input's size of 162.68MB since I just
>>>>> > write the key and value passed to mapper to the output?
>>>>> >
>>>>> >
>>>>> > Here is the mapper code snippet,
>>>>> >
>>>>> > public void map(BytesWritable key, BytesWritable value, Context
>>>>> > context) throws IOException, InterruptedException {
>>>>> >
>>>>> >         long timestamp = bytesToInt(value.getBytes(),
>>>>> > TIMESTAMP_INDEX);;
>>>>> >         String tsStr = sdf.format(new Date(timestamp * 1000L));
>>>>> >
>>>>> >         mos.write(key, value, generateFileName(tsStr)); // mos is a
>>>>> > MultipleOutputs object.
>>>>> >     }
>>>>> >
>>>>> >         private String generateFileName(String key) {
>>>>> >         return outputDir+"/"+key+"/raw-vectors";
>>>>> >     }
>>>>> >
>>>>> >
>>>>> > And here are the job outputs,
>>>>> >
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Launched map tasks=2
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Data-local map tasks=2
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:   File Output Format
>>>>> > Counters
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Bytes Written=0
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:   FileSystemCounters
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:
>>>>> HDFS_BYTES_READ=171086386
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=54272
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:
>>>>> > HDFS_BYTES_WRITTEN=1111374798
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:   File Input Format Counters
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Bytes Read=170782415
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:   Map-Reduce Framework
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Map input records=547
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Physical memory (bytes)
>>>>> > snapshot=166428672
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Spilled Records=0
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Total committed heap
>>>>> > usage (bytes)=38351872
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     CPU time spent
>>>>> (ms)=20080
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Virtual memory (bytes)
>>>>> > snapshot=1240104960
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     SPLIT_RAW_BYTES=286
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Map output records=0
>>>>> >
>>>>> >
>>>>> > TIA,
>>>>> >
>>>>> >
>>>>> > Kim
>>>>> >
>>>>>
>>>>>
>>>>>
>>>>
>>
>

Re: Why is HDFS_BYTES_WRITTEN is much larger than HDFS_BYTES_READ in this case?

Posted by Kim Chew <kc...@gmail.com>.
None of that.

I checked the the input file's SequenceFile Header and it says
"org.apache.hadoop.io.compress.zlib.BuiltInZlibDeflater"

Kim


On Fri, Mar 28, 2014 at 10:34 AM, Hardik Pandya <sm...@gmail.com>wrote:

> what is your compression format gzip, lzo or snappy
>
> for lzo final output
>
> FileOutputFormat.setCompressOutput(conf, true);
> FileOutputFormat.setOutputCompressorClass(conf, LzoCodec.class);
>
> In addition, to make LZO splittable, you need to make a LZO index file.
>
>
> On Thu, Mar 27, 2014 at 8:57 PM, Kim Chew <kc...@gmail.com> wrote:
>
>> Thanks folks.
>>
>> I am not awared my input data file has been compressed.
>> FileOutputFromat.setCompressOutput() is set to true when the file is
>> written. 8-(
>>
>> Kim
>>
>>
>> On Thu, Mar 27, 2014 at 5:46 PM, Mostafa Ead <mo...@gmail.com>wrote:
>>
>>> The following might answer you partially:
>>>
>>> Input key is not read from HDFS, it is auto generated as the offset of
>>> the input value in the input file. I think that is (partially) why read
>>> hdfs bytes is smaller than written hdfs bytes.
>>>  On Mar 27, 2014 1:34 PM, "Kim Chew" <kc...@gmail.com> wrote:
>>>
>>>> I am also wondering if, say, I have two identical timestamp so they are
>>>> going to be written to the same file. Does MulitpleOutputs handle appending?
>>>>
>>>> Thanks.
>>>>
>>>> Kim
>>>>
>>>>
>>>> On Thu, Mar 27, 2014 at 12:30 PM, Thomas Bentsen <th...@bentzn.com> wrote:
>>>>
>>>>> Have you checked the content of the files you write?
>>>>>
>>>>>
>>>>> /th
>>>>>
>>>>> On Thu, 2014-03-27 at 11:43 -0700, Kim Chew wrote:
>>>>> > I have a simple M/R job using Mapper only thus no reducer. The mapper
>>>>> > read a timestamp from the value, generate a path to the output file
>>>>> > and writes the key and value to the output file.
>>>>> >
>>>>> >
>>>>> > The input file is a sequence file, not compressed and stored in the
>>>>> > HDFS, it has a size of 162.68 MB.
>>>>> >
>>>>> >
>>>>> > Output also is written as a sequence file.
>>>>> >
>>>>> >
>>>>> >
>>>>> > However, after I ran my job, I have two output part files from the
>>>>> > mapper. One has a size of 835.12 MB and the other has a size of
>>>>> 224.77
>>>>> > MB. So why is the total outputs size is so much larger? Shouldn't it
>>>>> > be more or less equal to the input's size of 162.68MB since I just
>>>>> > write the key and value passed to mapper to the output?
>>>>> >
>>>>> >
>>>>> > Here is the mapper code snippet,
>>>>> >
>>>>> > public void map(BytesWritable key, BytesWritable value, Context
>>>>> > context) throws IOException, InterruptedException {
>>>>> >
>>>>> >         long timestamp = bytesToInt(value.getBytes(),
>>>>> > TIMESTAMP_INDEX);;
>>>>> >         String tsStr = sdf.format(new Date(timestamp * 1000L));
>>>>> >
>>>>> >         mos.write(key, value, generateFileName(tsStr)); // mos is a
>>>>> > MultipleOutputs object.
>>>>> >     }
>>>>> >
>>>>> >         private String generateFileName(String key) {
>>>>> >         return outputDir+"/"+key+"/raw-vectors";
>>>>> >     }
>>>>> >
>>>>> >
>>>>> > And here are the job outputs,
>>>>> >
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Launched map tasks=2
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Data-local map tasks=2
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:   File Output Format
>>>>> > Counters
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Bytes Written=0
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:   FileSystemCounters
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:
>>>>> HDFS_BYTES_READ=171086386
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=54272
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:
>>>>> > HDFS_BYTES_WRITTEN=1111374798
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:   File Input Format Counters
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Bytes Read=170782415
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:   Map-Reduce Framework
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Map input records=547
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Physical memory (bytes)
>>>>> > snapshot=166428672
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Spilled Records=0
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Total committed heap
>>>>> > usage (bytes)=38351872
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     CPU time spent
>>>>> (ms)=20080
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Virtual memory (bytes)
>>>>> > snapshot=1240104960
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     SPLIT_RAW_BYTES=286
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Map output records=0
>>>>> >
>>>>> >
>>>>> > TIA,
>>>>> >
>>>>> >
>>>>> > Kim
>>>>> >
>>>>>
>>>>>
>>>>>
>>>>
>>
>

Re: Why is HDFS_BYTES_WRITTEN is much larger than HDFS_BYTES_READ in this case?

Posted by Kim Chew <kc...@gmail.com>.
None of that.

I checked the the input file's SequenceFile Header and it says
"org.apache.hadoop.io.compress.zlib.BuiltInZlibDeflater"

Kim


On Fri, Mar 28, 2014 at 10:34 AM, Hardik Pandya <sm...@gmail.com>wrote:

> what is your compression format gzip, lzo or snappy
>
> for lzo final output
>
> FileOutputFormat.setCompressOutput(conf, true);
> FileOutputFormat.setOutputCompressorClass(conf, LzoCodec.class);
>
> In addition, to make LZO splittable, you need to make a LZO index file.
>
>
> On Thu, Mar 27, 2014 at 8:57 PM, Kim Chew <kc...@gmail.com> wrote:
>
>> Thanks folks.
>>
>> I am not awared my input data file has been compressed.
>> FileOutputFromat.setCompressOutput() is set to true when the file is
>> written. 8-(
>>
>> Kim
>>
>>
>> On Thu, Mar 27, 2014 at 5:46 PM, Mostafa Ead <mo...@gmail.com>wrote:
>>
>>> The following might answer you partially:
>>>
>>> Input key is not read from HDFS, it is auto generated as the offset of
>>> the input value in the input file. I think that is (partially) why read
>>> hdfs bytes is smaller than written hdfs bytes.
>>>  On Mar 27, 2014 1:34 PM, "Kim Chew" <kc...@gmail.com> wrote:
>>>
>>>> I am also wondering if, say, I have two identical timestamp so they are
>>>> going to be written to the same file. Does MulitpleOutputs handle appending?
>>>>
>>>> Thanks.
>>>>
>>>> Kim
>>>>
>>>>
>>>> On Thu, Mar 27, 2014 at 12:30 PM, Thomas Bentsen <th...@bentzn.com> wrote:
>>>>
>>>>> Have you checked the content of the files you write?
>>>>>
>>>>>
>>>>> /th
>>>>>
>>>>> On Thu, 2014-03-27 at 11:43 -0700, Kim Chew wrote:
>>>>> > I have a simple M/R job using Mapper only thus no reducer. The mapper
>>>>> > read a timestamp from the value, generate a path to the output file
>>>>> > and writes the key and value to the output file.
>>>>> >
>>>>> >
>>>>> > The input file is a sequence file, not compressed and stored in the
>>>>> > HDFS, it has a size of 162.68 MB.
>>>>> >
>>>>> >
>>>>> > Output also is written as a sequence file.
>>>>> >
>>>>> >
>>>>> >
>>>>> > However, after I ran my job, I have two output part files from the
>>>>> > mapper. One has a size of 835.12 MB and the other has a size of
>>>>> 224.77
>>>>> > MB. So why is the total outputs size is so much larger? Shouldn't it
>>>>> > be more or less equal to the input's size of 162.68MB since I just
>>>>> > write the key and value passed to mapper to the output?
>>>>> >
>>>>> >
>>>>> > Here is the mapper code snippet,
>>>>> >
>>>>> > public void map(BytesWritable key, BytesWritable value, Context
>>>>> > context) throws IOException, InterruptedException {
>>>>> >
>>>>> >         long timestamp = bytesToInt(value.getBytes(),
>>>>> > TIMESTAMP_INDEX);;
>>>>> >         String tsStr = sdf.format(new Date(timestamp * 1000L));
>>>>> >
>>>>> >         mos.write(key, value, generateFileName(tsStr)); // mos is a
>>>>> > MultipleOutputs object.
>>>>> >     }
>>>>> >
>>>>> >         private String generateFileName(String key) {
>>>>> >         return outputDir+"/"+key+"/raw-vectors";
>>>>> >     }
>>>>> >
>>>>> >
>>>>> > And here are the job outputs,
>>>>> >
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Launched map tasks=2
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Data-local map tasks=2
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:   File Output Format
>>>>> > Counters
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Bytes Written=0
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:   FileSystemCounters
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:
>>>>> HDFS_BYTES_READ=171086386
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=54272
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:
>>>>> > HDFS_BYTES_WRITTEN=1111374798
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:   File Input Format Counters
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Bytes Read=170782415
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:   Map-Reduce Framework
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Map input records=547
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Physical memory (bytes)
>>>>> > snapshot=166428672
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Spilled Records=0
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Total committed heap
>>>>> > usage (bytes)=38351872
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     CPU time spent
>>>>> (ms)=20080
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Virtual memory (bytes)
>>>>> > snapshot=1240104960
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     SPLIT_RAW_BYTES=286
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Map output records=0
>>>>> >
>>>>> >
>>>>> > TIA,
>>>>> >
>>>>> >
>>>>> > Kim
>>>>> >
>>>>>
>>>>>
>>>>>
>>>>
>>
>

Re: Why is HDFS_BYTES_WRITTEN is much larger than HDFS_BYTES_READ in this case?

Posted by Kim Chew <kc...@gmail.com>.
None of that.

I checked the the input file's SequenceFile Header and it says
"org.apache.hadoop.io.compress.zlib.BuiltInZlibDeflater"

Kim


On Fri, Mar 28, 2014 at 10:34 AM, Hardik Pandya <sm...@gmail.com>wrote:

> what is your compression format gzip, lzo or snappy
>
> for lzo final output
>
> FileOutputFormat.setCompressOutput(conf, true);
> FileOutputFormat.setOutputCompressorClass(conf, LzoCodec.class);
>
> In addition, to make LZO splittable, you need to make a LZO index file.
>
>
> On Thu, Mar 27, 2014 at 8:57 PM, Kim Chew <kc...@gmail.com> wrote:
>
>> Thanks folks.
>>
>> I am not awared my input data file has been compressed.
>> FileOutputFromat.setCompressOutput() is set to true when the file is
>> written. 8-(
>>
>> Kim
>>
>>
>> On Thu, Mar 27, 2014 at 5:46 PM, Mostafa Ead <mo...@gmail.com>wrote:
>>
>>> The following might answer you partially:
>>>
>>> Input key is not read from HDFS, it is auto generated as the offset of
>>> the input value in the input file. I think that is (partially) why read
>>> hdfs bytes is smaller than written hdfs bytes.
>>>  On Mar 27, 2014 1:34 PM, "Kim Chew" <kc...@gmail.com> wrote:
>>>
>>>> I am also wondering if, say, I have two identical timestamp so they are
>>>> going to be written to the same file. Does MulitpleOutputs handle appending?
>>>>
>>>> Thanks.
>>>>
>>>> Kim
>>>>
>>>>
>>>> On Thu, Mar 27, 2014 at 12:30 PM, Thomas Bentsen <th...@bentzn.com> wrote:
>>>>
>>>>> Have you checked the content of the files you write?
>>>>>
>>>>>
>>>>> /th
>>>>>
>>>>> On Thu, 2014-03-27 at 11:43 -0700, Kim Chew wrote:
>>>>> > I have a simple M/R job using Mapper only thus no reducer. The mapper
>>>>> > read a timestamp from the value, generate a path to the output file
>>>>> > and writes the key and value to the output file.
>>>>> >
>>>>> >
>>>>> > The input file is a sequence file, not compressed and stored in the
>>>>> > HDFS, it has a size of 162.68 MB.
>>>>> >
>>>>> >
>>>>> > Output also is written as a sequence file.
>>>>> >
>>>>> >
>>>>> >
>>>>> > However, after I ran my job, I have two output part files from the
>>>>> > mapper. One has a size of 835.12 MB and the other has a size of
>>>>> 224.77
>>>>> > MB. So why is the total outputs size is so much larger? Shouldn't it
>>>>> > be more or less equal to the input's size of 162.68MB since I just
>>>>> > write the key and value passed to mapper to the output?
>>>>> >
>>>>> >
>>>>> > Here is the mapper code snippet,
>>>>> >
>>>>> > public void map(BytesWritable key, BytesWritable value, Context
>>>>> > context) throws IOException, InterruptedException {
>>>>> >
>>>>> >         long timestamp = bytesToInt(value.getBytes(),
>>>>> > TIMESTAMP_INDEX);;
>>>>> >         String tsStr = sdf.format(new Date(timestamp * 1000L));
>>>>> >
>>>>> >         mos.write(key, value, generateFileName(tsStr)); // mos is a
>>>>> > MultipleOutputs object.
>>>>> >     }
>>>>> >
>>>>> >         private String generateFileName(String key) {
>>>>> >         return outputDir+"/"+key+"/raw-vectors";
>>>>> >     }
>>>>> >
>>>>> >
>>>>> > And here are the job outputs,
>>>>> >
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Launched map tasks=2
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Data-local map tasks=2
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:   File Output Format
>>>>> > Counters
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Bytes Written=0
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:   FileSystemCounters
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:
>>>>> HDFS_BYTES_READ=171086386
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=54272
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:
>>>>> > HDFS_BYTES_WRITTEN=1111374798
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:   File Input Format Counters
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Bytes Read=170782415
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:   Map-Reduce Framework
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Map input records=547
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Physical memory (bytes)
>>>>> > snapshot=166428672
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Spilled Records=0
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Total committed heap
>>>>> > usage (bytes)=38351872
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     CPU time spent
>>>>> (ms)=20080
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Virtual memory (bytes)
>>>>> > snapshot=1240104960
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     SPLIT_RAW_BYTES=286
>>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Map output records=0
>>>>> >
>>>>> >
>>>>> > TIA,
>>>>> >
>>>>> >
>>>>> > Kim
>>>>> >
>>>>>
>>>>>
>>>>>
>>>>
>>
>

Re: Why is HDFS_BYTES_WRITTEN is much larger than HDFS_BYTES_READ in this case?

Posted by Hardik Pandya <sm...@gmail.com>.
what is your compression format gzip, lzo or snappy

for lzo final output

FileOutputFormat.setCompressOutput(conf, true);
FileOutputFormat.setOutputCompressorClass(conf, LzoCodec.class);

In addition, to make LZO splittable, you need to make a LZO index file.


On Thu, Mar 27, 2014 at 8:57 PM, Kim Chew <kc...@gmail.com> wrote:

> Thanks folks.
>
> I am not awared my input data file has been compressed.
> FileOutputFromat.setCompressOutput() is set to true when the file is
> written. 8-(
>
> Kim
>
>
> On Thu, Mar 27, 2014 at 5:46 PM, Mostafa Ead <mo...@gmail.com>wrote:
>
>> The following might answer you partially:
>>
>> Input key is not read from HDFS, it is auto generated as the offset of
>> the input value in the input file. I think that is (partially) why read
>> hdfs bytes is smaller than written hdfs bytes.
>>  On Mar 27, 2014 1:34 PM, "Kim Chew" <kc...@gmail.com> wrote:
>>
>>> I am also wondering if, say, I have two identical timestamp so they are
>>> going to be written to the same file. Does MulitpleOutputs handle appending?
>>>
>>> Thanks.
>>>
>>> Kim
>>>
>>>
>>> On Thu, Mar 27, 2014 at 12:30 PM, Thomas Bentsen <th...@bentzn.com> wrote:
>>>
>>>> Have you checked the content of the files you write?
>>>>
>>>>
>>>> /th
>>>>
>>>> On Thu, 2014-03-27 at 11:43 -0700, Kim Chew wrote:
>>>> > I have a simple M/R job using Mapper only thus no reducer. The mapper
>>>> > read a timestamp from the value, generate a path to the output file
>>>> > and writes the key and value to the output file.
>>>> >
>>>> >
>>>> > The input file is a sequence file, not compressed and stored in the
>>>> > HDFS, it has a size of 162.68 MB.
>>>> >
>>>> >
>>>> > Output also is written as a sequence file.
>>>> >
>>>> >
>>>> >
>>>> > However, after I ran my job, I have two output part files from the
>>>> > mapper. One has a size of 835.12 MB and the other has a size of 224.77
>>>> > MB. So why is the total outputs size is so much larger? Shouldn't it
>>>> > be more or less equal to the input's size of 162.68MB since I just
>>>> > write the key and value passed to mapper to the output?
>>>> >
>>>> >
>>>> > Here is the mapper code snippet,
>>>> >
>>>> > public void map(BytesWritable key, BytesWritable value, Context
>>>> > context) throws IOException, InterruptedException {
>>>> >
>>>> >         long timestamp = bytesToInt(value.getBytes(),
>>>> > TIMESTAMP_INDEX);;
>>>> >         String tsStr = sdf.format(new Date(timestamp * 1000L));
>>>> >
>>>> >         mos.write(key, value, generateFileName(tsStr)); // mos is a
>>>> > MultipleOutputs object.
>>>> >     }
>>>> >
>>>> >         private String generateFileName(String key) {
>>>> >         return outputDir+"/"+key+"/raw-vectors";
>>>> >     }
>>>> >
>>>> >
>>>> > And here are the job outputs,
>>>> >
>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Launched map tasks=2
>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Data-local map tasks=2
>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:   File Output Format
>>>> > Counters
>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Bytes Written=0
>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:   FileSystemCounters
>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     HDFS_BYTES_READ=171086386
>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=54272
>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:
>>>> > HDFS_BYTES_WRITTEN=1111374798
>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:   File Input Format Counters
>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Bytes Read=170782415
>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:   Map-Reduce Framework
>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Map input records=547
>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Physical memory (bytes)
>>>> > snapshot=166428672
>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Spilled Records=0
>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Total committed heap
>>>> > usage (bytes)=38351872
>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     CPU time spent (ms)=20080
>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Virtual memory (bytes)
>>>> > snapshot=1240104960
>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     SPLIT_RAW_BYTES=286
>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Map output records=0
>>>> >
>>>> >
>>>> > TIA,
>>>> >
>>>> >
>>>> > Kim
>>>> >
>>>>
>>>>
>>>>
>>>
>

Re: Why is HDFS_BYTES_WRITTEN is much larger than HDFS_BYTES_READ in this case?

Posted by Hardik Pandya <sm...@gmail.com>.
what is your compression format gzip, lzo or snappy

for lzo final output

FileOutputFormat.setCompressOutput(conf, true);
FileOutputFormat.setOutputCompressorClass(conf, LzoCodec.class);

In addition, to make LZO splittable, you need to make a LZO index file.


On Thu, Mar 27, 2014 at 8:57 PM, Kim Chew <kc...@gmail.com> wrote:

> Thanks folks.
>
> I am not awared my input data file has been compressed.
> FileOutputFromat.setCompressOutput() is set to true when the file is
> written. 8-(
>
> Kim
>
>
> On Thu, Mar 27, 2014 at 5:46 PM, Mostafa Ead <mo...@gmail.com>wrote:
>
>> The following might answer you partially:
>>
>> Input key is not read from HDFS, it is auto generated as the offset of
>> the input value in the input file. I think that is (partially) why read
>> hdfs bytes is smaller than written hdfs bytes.
>>  On Mar 27, 2014 1:34 PM, "Kim Chew" <kc...@gmail.com> wrote:
>>
>>> I am also wondering if, say, I have two identical timestamp so they are
>>> going to be written to the same file. Does MulitpleOutputs handle appending?
>>>
>>> Thanks.
>>>
>>> Kim
>>>
>>>
>>> On Thu, Mar 27, 2014 at 12:30 PM, Thomas Bentsen <th...@bentzn.com> wrote:
>>>
>>>> Have you checked the content of the files you write?
>>>>
>>>>
>>>> /th
>>>>
>>>> On Thu, 2014-03-27 at 11:43 -0700, Kim Chew wrote:
>>>> > I have a simple M/R job using Mapper only thus no reducer. The mapper
>>>> > read a timestamp from the value, generate a path to the output file
>>>> > and writes the key and value to the output file.
>>>> >
>>>> >
>>>> > The input file is a sequence file, not compressed and stored in the
>>>> > HDFS, it has a size of 162.68 MB.
>>>> >
>>>> >
>>>> > Output also is written as a sequence file.
>>>> >
>>>> >
>>>> >
>>>> > However, after I ran my job, I have two output part files from the
>>>> > mapper. One has a size of 835.12 MB and the other has a size of 224.77
>>>> > MB. So why is the total outputs size is so much larger? Shouldn't it
>>>> > be more or less equal to the input's size of 162.68MB since I just
>>>> > write the key and value passed to mapper to the output?
>>>> >
>>>> >
>>>> > Here is the mapper code snippet,
>>>> >
>>>> > public void map(BytesWritable key, BytesWritable value, Context
>>>> > context) throws IOException, InterruptedException {
>>>> >
>>>> >         long timestamp = bytesToInt(value.getBytes(),
>>>> > TIMESTAMP_INDEX);;
>>>> >         String tsStr = sdf.format(new Date(timestamp * 1000L));
>>>> >
>>>> >         mos.write(key, value, generateFileName(tsStr)); // mos is a
>>>> > MultipleOutputs object.
>>>> >     }
>>>> >
>>>> >         private String generateFileName(String key) {
>>>> >         return outputDir+"/"+key+"/raw-vectors";
>>>> >     }
>>>> >
>>>> >
>>>> > And here are the job outputs,
>>>> >
>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Launched map tasks=2
>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Data-local map tasks=2
>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:   File Output Format
>>>> > Counters
>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Bytes Written=0
>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:   FileSystemCounters
>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     HDFS_BYTES_READ=171086386
>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=54272
>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:
>>>> > HDFS_BYTES_WRITTEN=1111374798
>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:   File Input Format Counters
>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Bytes Read=170782415
>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:   Map-Reduce Framework
>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Map input records=547
>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Physical memory (bytes)
>>>> > snapshot=166428672
>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Spilled Records=0
>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Total committed heap
>>>> > usage (bytes)=38351872
>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     CPU time spent (ms)=20080
>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Virtual memory (bytes)
>>>> > snapshot=1240104960
>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     SPLIT_RAW_BYTES=286
>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Map output records=0
>>>> >
>>>> >
>>>> > TIA,
>>>> >
>>>> >
>>>> > Kim
>>>> >
>>>>
>>>>
>>>>
>>>
>

Re: Why is HDFS_BYTES_WRITTEN is much larger than HDFS_BYTES_READ in this case?

Posted by Hardik Pandya <sm...@gmail.com>.
what is your compression format gzip, lzo or snappy

for lzo final output

FileOutputFormat.setCompressOutput(conf, true);
FileOutputFormat.setOutputCompressorClass(conf, LzoCodec.class);

In addition, to make LZO splittable, you need to make a LZO index file.


On Thu, Mar 27, 2014 at 8:57 PM, Kim Chew <kc...@gmail.com> wrote:

> Thanks folks.
>
> I am not awared my input data file has been compressed.
> FileOutputFromat.setCompressOutput() is set to true when the file is
> written. 8-(
>
> Kim
>
>
> On Thu, Mar 27, 2014 at 5:46 PM, Mostafa Ead <mo...@gmail.com>wrote:
>
>> The following might answer you partially:
>>
>> Input key is not read from HDFS, it is auto generated as the offset of
>> the input value in the input file. I think that is (partially) why read
>> hdfs bytes is smaller than written hdfs bytes.
>>  On Mar 27, 2014 1:34 PM, "Kim Chew" <kc...@gmail.com> wrote:
>>
>>> I am also wondering if, say, I have two identical timestamp so they are
>>> going to be written to the same file. Does MulitpleOutputs handle appending?
>>>
>>> Thanks.
>>>
>>> Kim
>>>
>>>
>>> On Thu, Mar 27, 2014 at 12:30 PM, Thomas Bentsen <th...@bentzn.com> wrote:
>>>
>>>> Have you checked the content of the files you write?
>>>>
>>>>
>>>> /th
>>>>
>>>> On Thu, 2014-03-27 at 11:43 -0700, Kim Chew wrote:
>>>> > I have a simple M/R job using Mapper only thus no reducer. The mapper
>>>> > read a timestamp from the value, generate a path to the output file
>>>> > and writes the key and value to the output file.
>>>> >
>>>> >
>>>> > The input file is a sequence file, not compressed and stored in the
>>>> > HDFS, it has a size of 162.68 MB.
>>>> >
>>>> >
>>>> > Output also is written as a sequence file.
>>>> >
>>>> >
>>>> >
>>>> > However, after I ran my job, I have two output part files from the
>>>> > mapper. One has a size of 835.12 MB and the other has a size of 224.77
>>>> > MB. So why is the total outputs size is so much larger? Shouldn't it
>>>> > be more or less equal to the input's size of 162.68MB since I just
>>>> > write the key and value passed to mapper to the output?
>>>> >
>>>> >
>>>> > Here is the mapper code snippet,
>>>> >
>>>> > public void map(BytesWritable key, BytesWritable value, Context
>>>> > context) throws IOException, InterruptedException {
>>>> >
>>>> >         long timestamp = bytesToInt(value.getBytes(),
>>>> > TIMESTAMP_INDEX);;
>>>> >         String tsStr = sdf.format(new Date(timestamp * 1000L));
>>>> >
>>>> >         mos.write(key, value, generateFileName(tsStr)); // mos is a
>>>> > MultipleOutputs object.
>>>> >     }
>>>> >
>>>> >         private String generateFileName(String key) {
>>>> >         return outputDir+"/"+key+"/raw-vectors";
>>>> >     }
>>>> >
>>>> >
>>>> > And here are the job outputs,
>>>> >
>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Launched map tasks=2
>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Data-local map tasks=2
>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:   File Output Format
>>>> > Counters
>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Bytes Written=0
>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:   FileSystemCounters
>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     HDFS_BYTES_READ=171086386
>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=54272
>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:
>>>> > HDFS_BYTES_WRITTEN=1111374798
>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:   File Input Format Counters
>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Bytes Read=170782415
>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:   Map-Reduce Framework
>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Map input records=547
>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Physical memory (bytes)
>>>> > snapshot=166428672
>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Spilled Records=0
>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Total committed heap
>>>> > usage (bytes)=38351872
>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     CPU time spent (ms)=20080
>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Virtual memory (bytes)
>>>> > snapshot=1240104960
>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     SPLIT_RAW_BYTES=286
>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Map output records=0
>>>> >
>>>> >
>>>> > TIA,
>>>> >
>>>> >
>>>> > Kim
>>>> >
>>>>
>>>>
>>>>
>>>
>

Re: Why is HDFS_BYTES_WRITTEN is much larger than HDFS_BYTES_READ in this case?

Posted by Hardik Pandya <sm...@gmail.com>.
what is your compression format gzip, lzo or snappy

for lzo final output

FileOutputFormat.setCompressOutput(conf, true);
FileOutputFormat.setOutputCompressorClass(conf, LzoCodec.class);

In addition, to make LZO splittable, you need to make a LZO index file.


On Thu, Mar 27, 2014 at 8:57 PM, Kim Chew <kc...@gmail.com> wrote:

> Thanks folks.
>
> I am not awared my input data file has been compressed.
> FileOutputFromat.setCompressOutput() is set to true when the file is
> written. 8-(
>
> Kim
>
>
> On Thu, Mar 27, 2014 at 5:46 PM, Mostafa Ead <mo...@gmail.com>wrote:
>
>> The following might answer you partially:
>>
>> Input key is not read from HDFS, it is auto generated as the offset of
>> the input value in the input file. I think that is (partially) why read
>> hdfs bytes is smaller than written hdfs bytes.
>>  On Mar 27, 2014 1:34 PM, "Kim Chew" <kc...@gmail.com> wrote:
>>
>>> I am also wondering if, say, I have two identical timestamp so they are
>>> going to be written to the same file. Does MulitpleOutputs handle appending?
>>>
>>> Thanks.
>>>
>>> Kim
>>>
>>>
>>> On Thu, Mar 27, 2014 at 12:30 PM, Thomas Bentsen <th...@bentzn.com> wrote:
>>>
>>>> Have you checked the content of the files you write?
>>>>
>>>>
>>>> /th
>>>>
>>>> On Thu, 2014-03-27 at 11:43 -0700, Kim Chew wrote:
>>>> > I have a simple M/R job using Mapper only thus no reducer. The mapper
>>>> > read a timestamp from the value, generate a path to the output file
>>>> > and writes the key and value to the output file.
>>>> >
>>>> >
>>>> > The input file is a sequence file, not compressed and stored in the
>>>> > HDFS, it has a size of 162.68 MB.
>>>> >
>>>> >
>>>> > Output also is written as a sequence file.
>>>> >
>>>> >
>>>> >
>>>> > However, after I ran my job, I have two output part files from the
>>>> > mapper. One has a size of 835.12 MB and the other has a size of 224.77
>>>> > MB. So why is the total outputs size is so much larger? Shouldn't it
>>>> > be more or less equal to the input's size of 162.68MB since I just
>>>> > write the key and value passed to mapper to the output?
>>>> >
>>>> >
>>>> > Here is the mapper code snippet,
>>>> >
>>>> > public void map(BytesWritable key, BytesWritable value, Context
>>>> > context) throws IOException, InterruptedException {
>>>> >
>>>> >         long timestamp = bytesToInt(value.getBytes(),
>>>> > TIMESTAMP_INDEX);;
>>>> >         String tsStr = sdf.format(new Date(timestamp * 1000L));
>>>> >
>>>> >         mos.write(key, value, generateFileName(tsStr)); // mos is a
>>>> > MultipleOutputs object.
>>>> >     }
>>>> >
>>>> >         private String generateFileName(String key) {
>>>> >         return outputDir+"/"+key+"/raw-vectors";
>>>> >     }
>>>> >
>>>> >
>>>> > And here are the job outputs,
>>>> >
>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Launched map tasks=2
>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Data-local map tasks=2
>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:   File Output Format
>>>> > Counters
>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Bytes Written=0
>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:   FileSystemCounters
>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     HDFS_BYTES_READ=171086386
>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=54272
>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:
>>>> > HDFS_BYTES_WRITTEN=1111374798
>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:   File Input Format Counters
>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Bytes Read=170782415
>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:   Map-Reduce Framework
>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Map input records=547
>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Physical memory (bytes)
>>>> > snapshot=166428672
>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Spilled Records=0
>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Total committed heap
>>>> > usage (bytes)=38351872
>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     CPU time spent (ms)=20080
>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Virtual memory (bytes)
>>>> > snapshot=1240104960
>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     SPLIT_RAW_BYTES=286
>>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Map output records=0
>>>> >
>>>> >
>>>> > TIA,
>>>> >
>>>> >
>>>> > Kim
>>>> >
>>>>
>>>>
>>>>
>>>
>

Re: Why is HDFS_BYTES_WRITTEN is much larger than HDFS_BYTES_READ in this case?

Posted by Kim Chew <kc...@gmail.com>.
Thanks folks.

I am not awared my input data file has been compressed.
FileOutputFromat.setCompressOutput() is set to true when the file is
written. 8-(

Kim


On Thu, Mar 27, 2014 at 5:46 PM, Mostafa Ead <mo...@gmail.com>wrote:

> The following might answer you partially:
>
> Input key is not read from HDFS, it is auto generated as the offset of the
> input value in the input file. I think that is (partially) why read hdfs
> bytes is smaller than written hdfs bytes.
>  On Mar 27, 2014 1:34 PM, "Kim Chew" <kc...@gmail.com> wrote:
>
>> I am also wondering if, say, I have two identical timestamp so they are
>> going to be written to the same file. Does MulitpleOutputs handle appending?
>>
>> Thanks.
>>
>> Kim
>>
>>
>> On Thu, Mar 27, 2014 at 12:30 PM, Thomas Bentsen <th...@bentzn.com> wrote:
>>
>>> Have you checked the content of the files you write?
>>>
>>>
>>> /th
>>>
>>> On Thu, 2014-03-27 at 11:43 -0700, Kim Chew wrote:
>>> > I have a simple M/R job using Mapper only thus no reducer. The mapper
>>> > read a timestamp from the value, generate a path to the output file
>>> > and writes the key and value to the output file.
>>> >
>>> >
>>> > The input file is a sequence file, not compressed and stored in the
>>> > HDFS, it has a size of 162.68 MB.
>>> >
>>> >
>>> > Output also is written as a sequence file.
>>> >
>>> >
>>> >
>>> > However, after I ran my job, I have two output part files from the
>>> > mapper. One has a size of 835.12 MB and the other has a size of 224.77
>>> > MB. So why is the total outputs size is so much larger? Shouldn't it
>>> > be more or less equal to the input's size of 162.68MB since I just
>>> > write the key and value passed to mapper to the output?
>>> >
>>> >
>>> > Here is the mapper code snippet,
>>> >
>>> > public void map(BytesWritable key, BytesWritable value, Context
>>> > context) throws IOException, InterruptedException {
>>> >
>>> >         long timestamp = bytesToInt(value.getBytes(),
>>> > TIMESTAMP_INDEX);;
>>> >         String tsStr = sdf.format(new Date(timestamp * 1000L));
>>> >
>>> >         mos.write(key, value, generateFileName(tsStr)); // mos is a
>>> > MultipleOutputs object.
>>> >     }
>>> >
>>> >         private String generateFileName(String key) {
>>> >         return outputDir+"/"+key+"/raw-vectors";
>>> >     }
>>> >
>>> >
>>> > And here are the job outputs,
>>> >
>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Launched map tasks=2
>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Data-local map tasks=2
>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
>>> > 14/03/27 11:00:56 INFO mapred.JobClient:   File Output Format
>>> > Counters
>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Bytes Written=0
>>> > 14/03/27 11:00:56 INFO mapred.JobClient:   FileSystemCounters
>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     HDFS_BYTES_READ=171086386
>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=54272
>>> > 14/03/27 11:00:56 INFO mapred.JobClient:
>>> > HDFS_BYTES_WRITTEN=1111374798
>>> > 14/03/27 11:00:56 INFO mapred.JobClient:   File Input Format Counters
>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Bytes Read=170782415
>>> > 14/03/27 11:00:56 INFO mapred.JobClient:   Map-Reduce Framework
>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Map input records=547
>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Physical memory (bytes)
>>> > snapshot=166428672
>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Spilled Records=0
>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Total committed heap
>>> > usage (bytes)=38351872
>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     CPU time spent (ms)=20080
>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Virtual memory (bytes)
>>> > snapshot=1240104960
>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     SPLIT_RAW_BYTES=286
>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Map output records=0
>>> >
>>> >
>>> > TIA,
>>> >
>>> >
>>> > Kim
>>> >
>>>
>>>
>>>
>>

Re: Why is HDFS_BYTES_WRITTEN is much larger than HDFS_BYTES_READ in this case?

Posted by Kim Chew <kc...@gmail.com>.
Thanks folks.

I am not awared my input data file has been compressed.
FileOutputFromat.setCompressOutput() is set to true when the file is
written. 8-(

Kim


On Thu, Mar 27, 2014 at 5:46 PM, Mostafa Ead <mo...@gmail.com>wrote:

> The following might answer you partially:
>
> Input key is not read from HDFS, it is auto generated as the offset of the
> input value in the input file. I think that is (partially) why read hdfs
> bytes is smaller than written hdfs bytes.
>  On Mar 27, 2014 1:34 PM, "Kim Chew" <kc...@gmail.com> wrote:
>
>> I am also wondering if, say, I have two identical timestamp so they are
>> going to be written to the same file. Does MulitpleOutputs handle appending?
>>
>> Thanks.
>>
>> Kim
>>
>>
>> On Thu, Mar 27, 2014 at 12:30 PM, Thomas Bentsen <th...@bentzn.com> wrote:
>>
>>> Have you checked the content of the files you write?
>>>
>>>
>>> /th
>>>
>>> On Thu, 2014-03-27 at 11:43 -0700, Kim Chew wrote:
>>> > I have a simple M/R job using Mapper only thus no reducer. The mapper
>>> > read a timestamp from the value, generate a path to the output file
>>> > and writes the key and value to the output file.
>>> >
>>> >
>>> > The input file is a sequence file, not compressed and stored in the
>>> > HDFS, it has a size of 162.68 MB.
>>> >
>>> >
>>> > Output also is written as a sequence file.
>>> >
>>> >
>>> >
>>> > However, after I ran my job, I have two output part files from the
>>> > mapper. One has a size of 835.12 MB and the other has a size of 224.77
>>> > MB. So why is the total outputs size is so much larger? Shouldn't it
>>> > be more or less equal to the input's size of 162.68MB since I just
>>> > write the key and value passed to mapper to the output?
>>> >
>>> >
>>> > Here is the mapper code snippet,
>>> >
>>> > public void map(BytesWritable key, BytesWritable value, Context
>>> > context) throws IOException, InterruptedException {
>>> >
>>> >         long timestamp = bytesToInt(value.getBytes(),
>>> > TIMESTAMP_INDEX);;
>>> >         String tsStr = sdf.format(new Date(timestamp * 1000L));
>>> >
>>> >         mos.write(key, value, generateFileName(tsStr)); // mos is a
>>> > MultipleOutputs object.
>>> >     }
>>> >
>>> >         private String generateFileName(String key) {
>>> >         return outputDir+"/"+key+"/raw-vectors";
>>> >     }
>>> >
>>> >
>>> > And here are the job outputs,
>>> >
>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Launched map tasks=2
>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Data-local map tasks=2
>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
>>> > 14/03/27 11:00:56 INFO mapred.JobClient:   File Output Format
>>> > Counters
>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Bytes Written=0
>>> > 14/03/27 11:00:56 INFO mapred.JobClient:   FileSystemCounters
>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     HDFS_BYTES_READ=171086386
>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=54272
>>> > 14/03/27 11:00:56 INFO mapred.JobClient:
>>> > HDFS_BYTES_WRITTEN=1111374798
>>> > 14/03/27 11:00:56 INFO mapred.JobClient:   File Input Format Counters
>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Bytes Read=170782415
>>> > 14/03/27 11:00:56 INFO mapred.JobClient:   Map-Reduce Framework
>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Map input records=547
>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Physical memory (bytes)
>>> > snapshot=166428672
>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Spilled Records=0
>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Total committed heap
>>> > usage (bytes)=38351872
>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     CPU time spent (ms)=20080
>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Virtual memory (bytes)
>>> > snapshot=1240104960
>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     SPLIT_RAW_BYTES=286
>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Map output records=0
>>> >
>>> >
>>> > TIA,
>>> >
>>> >
>>> > Kim
>>> >
>>>
>>>
>>>
>>

Re: Why is HDFS_BYTES_WRITTEN is much larger than HDFS_BYTES_READ in this case?

Posted by Kim Chew <kc...@gmail.com>.
Thanks folks.

I am not awared my input data file has been compressed.
FileOutputFromat.setCompressOutput() is set to true when the file is
written. 8-(

Kim


On Thu, Mar 27, 2014 at 5:46 PM, Mostafa Ead <mo...@gmail.com>wrote:

> The following might answer you partially:
>
> Input key is not read from HDFS, it is auto generated as the offset of the
> input value in the input file. I think that is (partially) why read hdfs
> bytes is smaller than written hdfs bytes.
>  On Mar 27, 2014 1:34 PM, "Kim Chew" <kc...@gmail.com> wrote:
>
>> I am also wondering if, say, I have two identical timestamp so they are
>> going to be written to the same file. Does MulitpleOutputs handle appending?
>>
>> Thanks.
>>
>> Kim
>>
>>
>> On Thu, Mar 27, 2014 at 12:30 PM, Thomas Bentsen <th...@bentzn.com> wrote:
>>
>>> Have you checked the content of the files you write?
>>>
>>>
>>> /th
>>>
>>> On Thu, 2014-03-27 at 11:43 -0700, Kim Chew wrote:
>>> > I have a simple M/R job using Mapper only thus no reducer. The mapper
>>> > read a timestamp from the value, generate a path to the output file
>>> > and writes the key and value to the output file.
>>> >
>>> >
>>> > The input file is a sequence file, not compressed and stored in the
>>> > HDFS, it has a size of 162.68 MB.
>>> >
>>> >
>>> > Output also is written as a sequence file.
>>> >
>>> >
>>> >
>>> > However, after I ran my job, I have two output part files from the
>>> > mapper. One has a size of 835.12 MB and the other has a size of 224.77
>>> > MB. So why is the total outputs size is so much larger? Shouldn't it
>>> > be more or less equal to the input's size of 162.68MB since I just
>>> > write the key and value passed to mapper to the output?
>>> >
>>> >
>>> > Here is the mapper code snippet,
>>> >
>>> > public void map(BytesWritable key, BytesWritable value, Context
>>> > context) throws IOException, InterruptedException {
>>> >
>>> >         long timestamp = bytesToInt(value.getBytes(),
>>> > TIMESTAMP_INDEX);;
>>> >         String tsStr = sdf.format(new Date(timestamp * 1000L));
>>> >
>>> >         mos.write(key, value, generateFileName(tsStr)); // mos is a
>>> > MultipleOutputs object.
>>> >     }
>>> >
>>> >         private String generateFileName(String key) {
>>> >         return outputDir+"/"+key+"/raw-vectors";
>>> >     }
>>> >
>>> >
>>> > And here are the job outputs,
>>> >
>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Launched map tasks=2
>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Data-local map tasks=2
>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
>>> > 14/03/27 11:00:56 INFO mapred.JobClient:   File Output Format
>>> > Counters
>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Bytes Written=0
>>> > 14/03/27 11:00:56 INFO mapred.JobClient:   FileSystemCounters
>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     HDFS_BYTES_READ=171086386
>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=54272
>>> > 14/03/27 11:00:56 INFO mapred.JobClient:
>>> > HDFS_BYTES_WRITTEN=1111374798
>>> > 14/03/27 11:00:56 INFO mapred.JobClient:   File Input Format Counters
>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Bytes Read=170782415
>>> > 14/03/27 11:00:56 INFO mapred.JobClient:   Map-Reduce Framework
>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Map input records=547
>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Physical memory (bytes)
>>> > snapshot=166428672
>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Spilled Records=0
>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Total committed heap
>>> > usage (bytes)=38351872
>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     CPU time spent (ms)=20080
>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Virtual memory (bytes)
>>> > snapshot=1240104960
>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     SPLIT_RAW_BYTES=286
>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Map output records=0
>>> >
>>> >
>>> > TIA,
>>> >
>>> >
>>> > Kim
>>> >
>>>
>>>
>>>
>>

Re: Why is HDFS_BYTES_WRITTEN is much larger than HDFS_BYTES_READ in this case?

Posted by Kim Chew <kc...@gmail.com>.
Thanks folks.

I am not awared my input data file has been compressed.
FileOutputFromat.setCompressOutput() is set to true when the file is
written. 8-(

Kim


On Thu, Mar 27, 2014 at 5:46 PM, Mostafa Ead <mo...@gmail.com>wrote:

> The following might answer you partially:
>
> Input key is not read from HDFS, it is auto generated as the offset of the
> input value in the input file. I think that is (partially) why read hdfs
> bytes is smaller than written hdfs bytes.
>  On Mar 27, 2014 1:34 PM, "Kim Chew" <kc...@gmail.com> wrote:
>
>> I am also wondering if, say, I have two identical timestamp so they are
>> going to be written to the same file. Does MulitpleOutputs handle appending?
>>
>> Thanks.
>>
>> Kim
>>
>>
>> On Thu, Mar 27, 2014 at 12:30 PM, Thomas Bentsen <th...@bentzn.com> wrote:
>>
>>> Have you checked the content of the files you write?
>>>
>>>
>>> /th
>>>
>>> On Thu, 2014-03-27 at 11:43 -0700, Kim Chew wrote:
>>> > I have a simple M/R job using Mapper only thus no reducer. The mapper
>>> > read a timestamp from the value, generate a path to the output file
>>> > and writes the key and value to the output file.
>>> >
>>> >
>>> > The input file is a sequence file, not compressed and stored in the
>>> > HDFS, it has a size of 162.68 MB.
>>> >
>>> >
>>> > Output also is written as a sequence file.
>>> >
>>> >
>>> >
>>> > However, after I ran my job, I have two output part files from the
>>> > mapper. One has a size of 835.12 MB and the other has a size of 224.77
>>> > MB. So why is the total outputs size is so much larger? Shouldn't it
>>> > be more or less equal to the input's size of 162.68MB since I just
>>> > write the key and value passed to mapper to the output?
>>> >
>>> >
>>> > Here is the mapper code snippet,
>>> >
>>> > public void map(BytesWritable key, BytesWritable value, Context
>>> > context) throws IOException, InterruptedException {
>>> >
>>> >         long timestamp = bytesToInt(value.getBytes(),
>>> > TIMESTAMP_INDEX);;
>>> >         String tsStr = sdf.format(new Date(timestamp * 1000L));
>>> >
>>> >         mos.write(key, value, generateFileName(tsStr)); // mos is a
>>> > MultipleOutputs object.
>>> >     }
>>> >
>>> >         private String generateFileName(String key) {
>>> >         return outputDir+"/"+key+"/raw-vectors";
>>> >     }
>>> >
>>> >
>>> > And here are the job outputs,
>>> >
>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Launched map tasks=2
>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Data-local map tasks=2
>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
>>> > 14/03/27 11:00:56 INFO mapred.JobClient:   File Output Format
>>> > Counters
>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Bytes Written=0
>>> > 14/03/27 11:00:56 INFO mapred.JobClient:   FileSystemCounters
>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     HDFS_BYTES_READ=171086386
>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=54272
>>> > 14/03/27 11:00:56 INFO mapred.JobClient:
>>> > HDFS_BYTES_WRITTEN=1111374798
>>> > 14/03/27 11:00:56 INFO mapred.JobClient:   File Input Format Counters
>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Bytes Read=170782415
>>> > 14/03/27 11:00:56 INFO mapred.JobClient:   Map-Reduce Framework
>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Map input records=547
>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Physical memory (bytes)
>>> > snapshot=166428672
>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Spilled Records=0
>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Total committed heap
>>> > usage (bytes)=38351872
>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     CPU time spent (ms)=20080
>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Virtual memory (bytes)
>>> > snapshot=1240104960
>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     SPLIT_RAW_BYTES=286
>>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Map output records=0
>>> >
>>> >
>>> > TIA,
>>> >
>>> >
>>> > Kim
>>> >
>>>
>>>
>>>
>>

Re: Why is HDFS_BYTES_WRITTEN is much larger than HDFS_BYTES_READ in this case?

Posted by Mostafa Ead <mo...@gmail.com>.
The following might answer you partially:

Input key is not read from HDFS, it is auto generated as the offset of the
input value in the input file. I think that is (partially) why read hdfs
bytes is smaller than written hdfs bytes.
 On Mar 27, 2014 1:34 PM, "Kim Chew" <kc...@gmail.com> wrote:

> I am also wondering if, say, I have two identical timestamp so they are
> going to be written to the same file. Does MulitpleOutputs handle appending?
>
> Thanks.
>
> Kim
>
>
> On Thu, Mar 27, 2014 at 12:30 PM, Thomas Bentsen <th...@bentzn.com> wrote:
>
>> Have you checked the content of the files you write?
>>
>>
>> /th
>>
>> On Thu, 2014-03-27 at 11:43 -0700, Kim Chew wrote:
>> > I have a simple M/R job using Mapper only thus no reducer. The mapper
>> > read a timestamp from the value, generate a path to the output file
>> > and writes the key and value to the output file.
>> >
>> >
>> > The input file is a sequence file, not compressed and stored in the
>> > HDFS, it has a size of 162.68 MB.
>> >
>> >
>> > Output also is written as a sequence file.
>> >
>> >
>> >
>> > However, after I ran my job, I have two output part files from the
>> > mapper. One has a size of 835.12 MB and the other has a size of 224.77
>> > MB. So why is the total outputs size is so much larger? Shouldn't it
>> > be more or less equal to the input's size of 162.68MB since I just
>> > write the key and value passed to mapper to the output?
>> >
>> >
>> > Here is the mapper code snippet,
>> >
>> > public void map(BytesWritable key, BytesWritable value, Context
>> > context) throws IOException, InterruptedException {
>> >
>> >         long timestamp = bytesToInt(value.getBytes(),
>> > TIMESTAMP_INDEX);;
>> >         String tsStr = sdf.format(new Date(timestamp * 1000L));
>> >
>> >         mos.write(key, value, generateFileName(tsStr)); // mos is a
>> > MultipleOutputs object.
>> >     }
>> >
>> >         private String generateFileName(String key) {
>> >         return outputDir+"/"+key+"/raw-vectors";
>> >     }
>> >
>> >
>> > And here are the job outputs,
>> >
>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Launched map tasks=2
>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Data-local map tasks=2
>> > 14/03/27 11:00:56 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
>> > 14/03/27 11:00:56 INFO mapred.JobClient:   File Output Format
>> > Counters
>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Bytes Written=0
>> > 14/03/27 11:00:56 INFO mapred.JobClient:   FileSystemCounters
>> > 14/03/27 11:00:56 INFO mapred.JobClient:     HDFS_BYTES_READ=171086386
>> > 14/03/27 11:00:56 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=54272
>> > 14/03/27 11:00:56 INFO mapred.JobClient:
>> > HDFS_BYTES_WRITTEN=1111374798
>> > 14/03/27 11:00:56 INFO mapred.JobClient:   File Input Format Counters
>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Bytes Read=170782415
>> > 14/03/27 11:00:56 INFO mapred.JobClient:   Map-Reduce Framework
>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Map input records=547
>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Physical memory (bytes)
>> > snapshot=166428672
>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Spilled Records=0
>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Total committed heap
>> > usage (bytes)=38351872
>> > 14/03/27 11:00:56 INFO mapred.JobClient:     CPU time spent (ms)=20080
>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Virtual memory (bytes)
>> > snapshot=1240104960
>> > 14/03/27 11:00:56 INFO mapred.JobClient:     SPLIT_RAW_BYTES=286
>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Map output records=0
>> >
>> >
>> > TIA,
>> >
>> >
>> > Kim
>> >
>>
>>
>>
>

Re: Why is HDFS_BYTES_WRITTEN is much larger than HDFS_BYTES_READ in this case?

Posted by Mostafa Ead <mo...@gmail.com>.
The following might answer you partially:

Input key is not read from HDFS, it is auto generated as the offset of the
input value in the input file. I think that is (partially) why read hdfs
bytes is smaller than written hdfs bytes.
 On Mar 27, 2014 1:34 PM, "Kim Chew" <kc...@gmail.com> wrote:

> I am also wondering if, say, I have two identical timestamp so they are
> going to be written to the same file. Does MulitpleOutputs handle appending?
>
> Thanks.
>
> Kim
>
>
> On Thu, Mar 27, 2014 at 12:30 PM, Thomas Bentsen <th...@bentzn.com> wrote:
>
>> Have you checked the content of the files you write?
>>
>>
>> /th
>>
>> On Thu, 2014-03-27 at 11:43 -0700, Kim Chew wrote:
>> > I have a simple M/R job using Mapper only thus no reducer. The mapper
>> > read a timestamp from the value, generate a path to the output file
>> > and writes the key and value to the output file.
>> >
>> >
>> > The input file is a sequence file, not compressed and stored in the
>> > HDFS, it has a size of 162.68 MB.
>> >
>> >
>> > Output also is written as a sequence file.
>> >
>> >
>> >
>> > However, after I ran my job, I have two output part files from the
>> > mapper. One has a size of 835.12 MB and the other has a size of 224.77
>> > MB. So why is the total outputs size is so much larger? Shouldn't it
>> > be more or less equal to the input's size of 162.68MB since I just
>> > write the key and value passed to mapper to the output?
>> >
>> >
>> > Here is the mapper code snippet,
>> >
>> > public void map(BytesWritable key, BytesWritable value, Context
>> > context) throws IOException, InterruptedException {
>> >
>> >         long timestamp = bytesToInt(value.getBytes(),
>> > TIMESTAMP_INDEX);;
>> >         String tsStr = sdf.format(new Date(timestamp * 1000L));
>> >
>> >         mos.write(key, value, generateFileName(tsStr)); // mos is a
>> > MultipleOutputs object.
>> >     }
>> >
>> >         private String generateFileName(String key) {
>> >         return outputDir+"/"+key+"/raw-vectors";
>> >     }
>> >
>> >
>> > And here are the job outputs,
>> >
>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Launched map tasks=2
>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Data-local map tasks=2
>> > 14/03/27 11:00:56 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
>> > 14/03/27 11:00:56 INFO mapred.JobClient:   File Output Format
>> > Counters
>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Bytes Written=0
>> > 14/03/27 11:00:56 INFO mapred.JobClient:   FileSystemCounters
>> > 14/03/27 11:00:56 INFO mapred.JobClient:     HDFS_BYTES_READ=171086386
>> > 14/03/27 11:00:56 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=54272
>> > 14/03/27 11:00:56 INFO mapred.JobClient:
>> > HDFS_BYTES_WRITTEN=1111374798
>> > 14/03/27 11:00:56 INFO mapred.JobClient:   File Input Format Counters
>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Bytes Read=170782415
>> > 14/03/27 11:00:56 INFO mapred.JobClient:   Map-Reduce Framework
>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Map input records=547
>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Physical memory (bytes)
>> > snapshot=166428672
>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Spilled Records=0
>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Total committed heap
>> > usage (bytes)=38351872
>> > 14/03/27 11:00:56 INFO mapred.JobClient:     CPU time spent (ms)=20080
>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Virtual memory (bytes)
>> > snapshot=1240104960
>> > 14/03/27 11:00:56 INFO mapred.JobClient:     SPLIT_RAW_BYTES=286
>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Map output records=0
>> >
>> >
>> > TIA,
>> >
>> >
>> > Kim
>> >
>>
>>
>>
>

Re: Why is HDFS_BYTES_WRITTEN is much larger than HDFS_BYTES_READ in this case?

Posted by Mostafa Ead <mo...@gmail.com>.
The following might answer you partially:

Input key is not read from HDFS, it is auto generated as the offset of the
input value in the input file. I think that is (partially) why read hdfs
bytes is smaller than written hdfs bytes.
 On Mar 27, 2014 1:34 PM, "Kim Chew" <kc...@gmail.com> wrote:

> I am also wondering if, say, I have two identical timestamp so they are
> going to be written to the same file. Does MulitpleOutputs handle appending?
>
> Thanks.
>
> Kim
>
>
> On Thu, Mar 27, 2014 at 12:30 PM, Thomas Bentsen <th...@bentzn.com> wrote:
>
>> Have you checked the content of the files you write?
>>
>>
>> /th
>>
>> On Thu, 2014-03-27 at 11:43 -0700, Kim Chew wrote:
>> > I have a simple M/R job using Mapper only thus no reducer. The mapper
>> > read a timestamp from the value, generate a path to the output file
>> > and writes the key and value to the output file.
>> >
>> >
>> > The input file is a sequence file, not compressed and stored in the
>> > HDFS, it has a size of 162.68 MB.
>> >
>> >
>> > Output also is written as a sequence file.
>> >
>> >
>> >
>> > However, after I ran my job, I have two output part files from the
>> > mapper. One has a size of 835.12 MB and the other has a size of 224.77
>> > MB. So why is the total outputs size is so much larger? Shouldn't it
>> > be more or less equal to the input's size of 162.68MB since I just
>> > write the key and value passed to mapper to the output?
>> >
>> >
>> > Here is the mapper code snippet,
>> >
>> > public void map(BytesWritable key, BytesWritable value, Context
>> > context) throws IOException, InterruptedException {
>> >
>> >         long timestamp = bytesToInt(value.getBytes(),
>> > TIMESTAMP_INDEX);;
>> >         String tsStr = sdf.format(new Date(timestamp * 1000L));
>> >
>> >         mos.write(key, value, generateFileName(tsStr)); // mos is a
>> > MultipleOutputs object.
>> >     }
>> >
>> >         private String generateFileName(String key) {
>> >         return outputDir+"/"+key+"/raw-vectors";
>> >     }
>> >
>> >
>> > And here are the job outputs,
>> >
>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Launched map tasks=2
>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Data-local map tasks=2
>> > 14/03/27 11:00:56 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
>> > 14/03/27 11:00:56 INFO mapred.JobClient:   File Output Format
>> > Counters
>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Bytes Written=0
>> > 14/03/27 11:00:56 INFO mapred.JobClient:   FileSystemCounters
>> > 14/03/27 11:00:56 INFO mapred.JobClient:     HDFS_BYTES_READ=171086386
>> > 14/03/27 11:00:56 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=54272
>> > 14/03/27 11:00:56 INFO mapred.JobClient:
>> > HDFS_BYTES_WRITTEN=1111374798
>> > 14/03/27 11:00:56 INFO mapred.JobClient:   File Input Format Counters
>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Bytes Read=170782415
>> > 14/03/27 11:00:56 INFO mapred.JobClient:   Map-Reduce Framework
>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Map input records=547
>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Physical memory (bytes)
>> > snapshot=166428672
>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Spilled Records=0
>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Total committed heap
>> > usage (bytes)=38351872
>> > 14/03/27 11:00:56 INFO mapred.JobClient:     CPU time spent (ms)=20080
>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Virtual memory (bytes)
>> > snapshot=1240104960
>> > 14/03/27 11:00:56 INFO mapred.JobClient:     SPLIT_RAW_BYTES=286
>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Map output records=0
>> >
>> >
>> > TIA,
>> >
>> >
>> > Kim
>> >
>>
>>
>>
>

Re: Why is HDFS_BYTES_WRITTEN is much larger than HDFS_BYTES_READ in this case?

Posted by Mostafa Ead <mo...@gmail.com>.
The following might answer you partially:

Input key is not read from HDFS, it is auto generated as the offset of the
input value in the input file. I think that is (partially) why read hdfs
bytes is smaller than written hdfs bytes.
 On Mar 27, 2014 1:34 PM, "Kim Chew" <kc...@gmail.com> wrote:

> I am also wondering if, say, I have two identical timestamp so they are
> going to be written to the same file. Does MulitpleOutputs handle appending?
>
> Thanks.
>
> Kim
>
>
> On Thu, Mar 27, 2014 at 12:30 PM, Thomas Bentsen <th...@bentzn.com> wrote:
>
>> Have you checked the content of the files you write?
>>
>>
>> /th
>>
>> On Thu, 2014-03-27 at 11:43 -0700, Kim Chew wrote:
>> > I have a simple M/R job using Mapper only thus no reducer. The mapper
>> > read a timestamp from the value, generate a path to the output file
>> > and writes the key and value to the output file.
>> >
>> >
>> > The input file is a sequence file, not compressed and stored in the
>> > HDFS, it has a size of 162.68 MB.
>> >
>> >
>> > Output also is written as a sequence file.
>> >
>> >
>> >
>> > However, after I ran my job, I have two output part files from the
>> > mapper. One has a size of 835.12 MB and the other has a size of 224.77
>> > MB. So why is the total outputs size is so much larger? Shouldn't it
>> > be more or less equal to the input's size of 162.68MB since I just
>> > write the key and value passed to mapper to the output?
>> >
>> >
>> > Here is the mapper code snippet,
>> >
>> > public void map(BytesWritable key, BytesWritable value, Context
>> > context) throws IOException, InterruptedException {
>> >
>> >         long timestamp = bytesToInt(value.getBytes(),
>> > TIMESTAMP_INDEX);;
>> >         String tsStr = sdf.format(new Date(timestamp * 1000L));
>> >
>> >         mos.write(key, value, generateFileName(tsStr)); // mos is a
>> > MultipleOutputs object.
>> >     }
>> >
>> >         private String generateFileName(String key) {
>> >         return outputDir+"/"+key+"/raw-vectors";
>> >     }
>> >
>> >
>> > And here are the job outputs,
>> >
>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Launched map tasks=2
>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Data-local map tasks=2
>> > 14/03/27 11:00:56 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
>> > 14/03/27 11:00:56 INFO mapred.JobClient:   File Output Format
>> > Counters
>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Bytes Written=0
>> > 14/03/27 11:00:56 INFO mapred.JobClient:   FileSystemCounters
>> > 14/03/27 11:00:56 INFO mapred.JobClient:     HDFS_BYTES_READ=171086386
>> > 14/03/27 11:00:56 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=54272
>> > 14/03/27 11:00:56 INFO mapred.JobClient:
>> > HDFS_BYTES_WRITTEN=1111374798
>> > 14/03/27 11:00:56 INFO mapred.JobClient:   File Input Format Counters
>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Bytes Read=170782415
>> > 14/03/27 11:00:56 INFO mapred.JobClient:   Map-Reduce Framework
>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Map input records=547
>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Physical memory (bytes)
>> > snapshot=166428672
>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Spilled Records=0
>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Total committed heap
>> > usage (bytes)=38351872
>> > 14/03/27 11:00:56 INFO mapred.JobClient:     CPU time spent (ms)=20080
>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Virtual memory (bytes)
>> > snapshot=1240104960
>> > 14/03/27 11:00:56 INFO mapred.JobClient:     SPLIT_RAW_BYTES=286
>> > 14/03/27 11:00:56 INFO mapred.JobClient:     Map output records=0
>> >
>> >
>> > TIA,
>> >
>> >
>> > Kim
>> >
>>
>>
>>
>

Re: Why is HDFS_BYTES_WRITTEN is much larger than HDFS_BYTES_READ in this case?

Posted by Kim Chew <kc...@gmail.com>.
I am also wondering if, say, I have two identical timestamp so they are
going to be written to the same file. Does MulitpleOutputs handle appending?

Thanks.

Kim


On Thu, Mar 27, 2014 at 12:30 PM, Thomas Bentsen <th...@bentzn.com> wrote:

> Have you checked the content of the files you write?
>
>
> /th
>
> On Thu, 2014-03-27 at 11:43 -0700, Kim Chew wrote:
> > I have a simple M/R job using Mapper only thus no reducer. The mapper
> > read a timestamp from the value, generate a path to the output file
> > and writes the key and value to the output file.
> >
> >
> > The input file is a sequence file, not compressed and stored in the
> > HDFS, it has a size of 162.68 MB.
> >
> >
> > Output also is written as a sequence file.
> >
> >
> >
> > However, after I ran my job, I have two output part files from the
> > mapper. One has a size of 835.12 MB and the other has a size of 224.77
> > MB. So why is the total outputs size is so much larger? Shouldn't it
> > be more or less equal to the input's size of 162.68MB since I just
> > write the key and value passed to mapper to the output?
> >
> >
> > Here is the mapper code snippet,
> >
> > public void map(BytesWritable key, BytesWritable value, Context
> > context) throws IOException, InterruptedException {
> >
> >         long timestamp = bytesToInt(value.getBytes(),
> > TIMESTAMP_INDEX);;
> >         String tsStr = sdf.format(new Date(timestamp * 1000L));
> >
> >         mos.write(key, value, generateFileName(tsStr)); // mos is a
> > MultipleOutputs object.
> >     }
> >
> >         private String generateFileName(String key) {
> >         return outputDir+"/"+key+"/raw-vectors";
> >     }
> >
> >
> > And here are the job outputs,
> >
> > 14/03/27 11:00:56 INFO mapred.JobClient:     Launched map tasks=2
> > 14/03/27 11:00:56 INFO mapred.JobClient:     Data-local map tasks=2
> > 14/03/27 11:00:56 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
> > 14/03/27 11:00:56 INFO mapred.JobClient:   File Output Format
> > Counters
> > 14/03/27 11:00:56 INFO mapred.JobClient:     Bytes Written=0
> > 14/03/27 11:00:56 INFO mapred.JobClient:   FileSystemCounters
> > 14/03/27 11:00:56 INFO mapred.JobClient:     HDFS_BYTES_READ=171086386
> > 14/03/27 11:00:56 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=54272
> > 14/03/27 11:00:56 INFO mapred.JobClient:
> > HDFS_BYTES_WRITTEN=1111374798
> > 14/03/27 11:00:56 INFO mapred.JobClient:   File Input Format Counters
> > 14/03/27 11:00:56 INFO mapred.JobClient:     Bytes Read=170782415
> > 14/03/27 11:00:56 INFO mapred.JobClient:   Map-Reduce Framework
> > 14/03/27 11:00:56 INFO mapred.JobClient:     Map input records=547
> > 14/03/27 11:00:56 INFO mapred.JobClient:     Physical memory (bytes)
> > snapshot=166428672
> > 14/03/27 11:00:56 INFO mapred.JobClient:     Spilled Records=0
> > 14/03/27 11:00:56 INFO mapred.JobClient:     Total committed heap
> > usage (bytes)=38351872
> > 14/03/27 11:00:56 INFO mapred.JobClient:     CPU time spent (ms)=20080
> > 14/03/27 11:00:56 INFO mapred.JobClient:     Virtual memory (bytes)
> > snapshot=1240104960
> > 14/03/27 11:00:56 INFO mapred.JobClient:     SPLIT_RAW_BYTES=286
> > 14/03/27 11:00:56 INFO mapred.JobClient:     Map output records=0
> >
> >
> > TIA,
> >
> >
> > Kim
> >
>
>
>

Re: Why is HDFS_BYTES_WRITTEN is much larger than HDFS_BYTES_READ in this case?

Posted by Kim Chew <kc...@gmail.com>.
I am also wondering if, say, I have two identical timestamp so they are
going to be written to the same file. Does MulitpleOutputs handle appending?

Thanks.

Kim


On Thu, Mar 27, 2014 at 12:30 PM, Thomas Bentsen <th...@bentzn.com> wrote:

> Have you checked the content of the files you write?
>
>
> /th
>
> On Thu, 2014-03-27 at 11:43 -0700, Kim Chew wrote:
> > I have a simple M/R job using Mapper only thus no reducer. The mapper
> > read a timestamp from the value, generate a path to the output file
> > and writes the key and value to the output file.
> >
> >
> > The input file is a sequence file, not compressed and stored in the
> > HDFS, it has a size of 162.68 MB.
> >
> >
> > Output also is written as a sequence file.
> >
> >
> >
> > However, after I ran my job, I have two output part files from the
> > mapper. One has a size of 835.12 MB and the other has a size of 224.77
> > MB. So why is the total outputs size is so much larger? Shouldn't it
> > be more or less equal to the input's size of 162.68MB since I just
> > write the key and value passed to mapper to the output?
> >
> >
> > Here is the mapper code snippet,
> >
> > public void map(BytesWritable key, BytesWritable value, Context
> > context) throws IOException, InterruptedException {
> >
> >         long timestamp = bytesToInt(value.getBytes(),
> > TIMESTAMP_INDEX);;
> >         String tsStr = sdf.format(new Date(timestamp * 1000L));
> >
> >         mos.write(key, value, generateFileName(tsStr)); // mos is a
> > MultipleOutputs object.
> >     }
> >
> >         private String generateFileName(String key) {
> >         return outputDir+"/"+key+"/raw-vectors";
> >     }
> >
> >
> > And here are the job outputs,
> >
> > 14/03/27 11:00:56 INFO mapred.JobClient:     Launched map tasks=2
> > 14/03/27 11:00:56 INFO mapred.JobClient:     Data-local map tasks=2
> > 14/03/27 11:00:56 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
> > 14/03/27 11:00:56 INFO mapred.JobClient:   File Output Format
> > Counters
> > 14/03/27 11:00:56 INFO mapred.JobClient:     Bytes Written=0
> > 14/03/27 11:00:56 INFO mapred.JobClient:   FileSystemCounters
> > 14/03/27 11:00:56 INFO mapred.JobClient:     HDFS_BYTES_READ=171086386
> > 14/03/27 11:00:56 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=54272
> > 14/03/27 11:00:56 INFO mapred.JobClient:
> > HDFS_BYTES_WRITTEN=1111374798
> > 14/03/27 11:00:56 INFO mapred.JobClient:   File Input Format Counters
> > 14/03/27 11:00:56 INFO mapred.JobClient:     Bytes Read=170782415
> > 14/03/27 11:00:56 INFO mapred.JobClient:   Map-Reduce Framework
> > 14/03/27 11:00:56 INFO mapred.JobClient:     Map input records=547
> > 14/03/27 11:00:56 INFO mapred.JobClient:     Physical memory (bytes)
> > snapshot=166428672
> > 14/03/27 11:00:56 INFO mapred.JobClient:     Spilled Records=0
> > 14/03/27 11:00:56 INFO mapred.JobClient:     Total committed heap
> > usage (bytes)=38351872
> > 14/03/27 11:00:56 INFO mapred.JobClient:     CPU time spent (ms)=20080
> > 14/03/27 11:00:56 INFO mapred.JobClient:     Virtual memory (bytes)
> > snapshot=1240104960
> > 14/03/27 11:00:56 INFO mapred.JobClient:     SPLIT_RAW_BYTES=286
> > 14/03/27 11:00:56 INFO mapred.JobClient:     Map output records=0
> >
> >
> > TIA,
> >
> >
> > Kim
> >
>
>
>

Re: Why is HDFS_BYTES_WRITTEN is much larger than HDFS_BYTES_READ in this case?

Posted by Kim Chew <kc...@gmail.com>.
I am also wondering if, say, I have two identical timestamp so they are
going to be written to the same file. Does MulitpleOutputs handle appending?

Thanks.

Kim


On Thu, Mar 27, 2014 at 12:30 PM, Thomas Bentsen <th...@bentzn.com> wrote:

> Have you checked the content of the files you write?
>
>
> /th
>
> On Thu, 2014-03-27 at 11:43 -0700, Kim Chew wrote:
> > I have a simple M/R job using Mapper only thus no reducer. The mapper
> > read a timestamp from the value, generate a path to the output file
> > and writes the key and value to the output file.
> >
> >
> > The input file is a sequence file, not compressed and stored in the
> > HDFS, it has a size of 162.68 MB.
> >
> >
> > Output also is written as a sequence file.
> >
> >
> >
> > However, after I ran my job, I have two output part files from the
> > mapper. One has a size of 835.12 MB and the other has a size of 224.77
> > MB. So why is the total outputs size is so much larger? Shouldn't it
> > be more or less equal to the input's size of 162.68MB since I just
> > write the key and value passed to mapper to the output?
> >
> >
> > Here is the mapper code snippet,
> >
> > public void map(BytesWritable key, BytesWritable value, Context
> > context) throws IOException, InterruptedException {
> >
> >         long timestamp = bytesToInt(value.getBytes(),
> > TIMESTAMP_INDEX);;
> >         String tsStr = sdf.format(new Date(timestamp * 1000L));
> >
> >         mos.write(key, value, generateFileName(tsStr)); // mos is a
> > MultipleOutputs object.
> >     }
> >
> >         private String generateFileName(String key) {
> >         return outputDir+"/"+key+"/raw-vectors";
> >     }
> >
> >
> > And here are the job outputs,
> >
> > 14/03/27 11:00:56 INFO mapred.JobClient:     Launched map tasks=2
> > 14/03/27 11:00:56 INFO mapred.JobClient:     Data-local map tasks=2
> > 14/03/27 11:00:56 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
> > 14/03/27 11:00:56 INFO mapred.JobClient:   File Output Format
> > Counters
> > 14/03/27 11:00:56 INFO mapred.JobClient:     Bytes Written=0
> > 14/03/27 11:00:56 INFO mapred.JobClient:   FileSystemCounters
> > 14/03/27 11:00:56 INFO mapred.JobClient:     HDFS_BYTES_READ=171086386
> > 14/03/27 11:00:56 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=54272
> > 14/03/27 11:00:56 INFO mapred.JobClient:
> > HDFS_BYTES_WRITTEN=1111374798
> > 14/03/27 11:00:56 INFO mapred.JobClient:   File Input Format Counters
> > 14/03/27 11:00:56 INFO mapred.JobClient:     Bytes Read=170782415
> > 14/03/27 11:00:56 INFO mapred.JobClient:   Map-Reduce Framework
> > 14/03/27 11:00:56 INFO mapred.JobClient:     Map input records=547
> > 14/03/27 11:00:56 INFO mapred.JobClient:     Physical memory (bytes)
> > snapshot=166428672
> > 14/03/27 11:00:56 INFO mapred.JobClient:     Spilled Records=0
> > 14/03/27 11:00:56 INFO mapred.JobClient:     Total committed heap
> > usage (bytes)=38351872
> > 14/03/27 11:00:56 INFO mapred.JobClient:     CPU time spent (ms)=20080
> > 14/03/27 11:00:56 INFO mapred.JobClient:     Virtual memory (bytes)
> > snapshot=1240104960
> > 14/03/27 11:00:56 INFO mapred.JobClient:     SPLIT_RAW_BYTES=286
> > 14/03/27 11:00:56 INFO mapred.JobClient:     Map output records=0
> >
> >
> > TIA,
> >
> >
> > Kim
> >
>
>
>

Re: Why is HDFS_BYTES_WRITTEN is much larger than HDFS_BYTES_READ in this case?

Posted by Kim Chew <kc...@gmail.com>.
Yea, gonna do that. 8-)

Kim


On Thu, Mar 27, 2014 at 12:30 PM, Thomas Bentsen <th...@bentzn.com> wrote:

> Have you checked the content of the files you write?
>
>
> /th
>
> On Thu, 2014-03-27 at 11:43 -0700, Kim Chew wrote:
> > I have a simple M/R job using Mapper only thus no reducer. The mapper
> > read a timestamp from the value, generate a path to the output file
> > and writes the key and value to the output file.
> >
> >
> > The input file is a sequence file, not compressed and stored in the
> > HDFS, it has a size of 162.68 MB.
> >
> >
> > Output also is written as a sequence file.
> >
> >
> >
> > However, after I ran my job, I have two output part files from the
> > mapper. One has a size of 835.12 MB and the other has a size of 224.77
> > MB. So why is the total outputs size is so much larger? Shouldn't it
> > be more or less equal to the input's size of 162.68MB since I just
> > write the key and value passed to mapper to the output?
> >
> >
> > Here is the mapper code snippet,
> >
> > public void map(BytesWritable key, BytesWritable value, Context
> > context) throws IOException, InterruptedException {
> >
> >         long timestamp = bytesToInt(value.getBytes(),
> > TIMESTAMP_INDEX);;
> >         String tsStr = sdf.format(new Date(timestamp * 1000L));
> >
> >         mos.write(key, value, generateFileName(tsStr)); // mos is a
> > MultipleOutputs object.
> >     }
> >
> >         private String generateFileName(String key) {
> >         return outputDir+"/"+key+"/raw-vectors";
> >     }
> >
> >
> > And here are the job outputs,
> >
> > 14/03/27 11:00:56 INFO mapred.JobClient:     Launched map tasks=2
> > 14/03/27 11:00:56 INFO mapred.JobClient:     Data-local map tasks=2
> > 14/03/27 11:00:56 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
> > 14/03/27 11:00:56 INFO mapred.JobClient:   File Output Format
> > Counters
> > 14/03/27 11:00:56 INFO mapred.JobClient:     Bytes Written=0
> > 14/03/27 11:00:56 INFO mapred.JobClient:   FileSystemCounters
> > 14/03/27 11:00:56 INFO mapred.JobClient:     HDFS_BYTES_READ=171086386
> > 14/03/27 11:00:56 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=54272
> > 14/03/27 11:00:56 INFO mapred.JobClient:
> > HDFS_BYTES_WRITTEN=1111374798
> > 14/03/27 11:00:56 INFO mapred.JobClient:   File Input Format Counters
> > 14/03/27 11:00:56 INFO mapred.JobClient:     Bytes Read=170782415
> > 14/03/27 11:00:56 INFO mapred.JobClient:   Map-Reduce Framework
> > 14/03/27 11:00:56 INFO mapred.JobClient:     Map input records=547
> > 14/03/27 11:00:56 INFO mapred.JobClient:     Physical memory (bytes)
> > snapshot=166428672
> > 14/03/27 11:00:56 INFO mapred.JobClient:     Spilled Records=0
> > 14/03/27 11:00:56 INFO mapred.JobClient:     Total committed heap
> > usage (bytes)=38351872
> > 14/03/27 11:00:56 INFO mapred.JobClient:     CPU time spent (ms)=20080
> > 14/03/27 11:00:56 INFO mapred.JobClient:     Virtual memory (bytes)
> > snapshot=1240104960
> > 14/03/27 11:00:56 INFO mapred.JobClient:     SPLIT_RAW_BYTES=286
> > 14/03/27 11:00:56 INFO mapred.JobClient:     Map output records=0
> >
> >
> > TIA,
> >
> >
> > Kim
> >
>
>
>

Re: Why is HDFS_BYTES_WRITTEN is much larger than HDFS_BYTES_READ in this case?

Posted by Kim Chew <kc...@gmail.com>.
Yea, gonna do that. 8-)

Kim


On Thu, Mar 27, 2014 at 12:30 PM, Thomas Bentsen <th...@bentzn.com> wrote:

> Have you checked the content of the files you write?
>
>
> /th
>
> On Thu, 2014-03-27 at 11:43 -0700, Kim Chew wrote:
> > I have a simple M/R job using Mapper only thus no reducer. The mapper
> > read a timestamp from the value, generate a path to the output file
> > and writes the key and value to the output file.
> >
> >
> > The input file is a sequence file, not compressed and stored in the
> > HDFS, it has a size of 162.68 MB.
> >
> >
> > Output also is written as a sequence file.
> >
> >
> >
> > However, after I ran my job, I have two output part files from the
> > mapper. One has a size of 835.12 MB and the other has a size of 224.77
> > MB. So why is the total outputs size is so much larger? Shouldn't it
> > be more or less equal to the input's size of 162.68MB since I just
> > write the key and value passed to mapper to the output?
> >
> >
> > Here is the mapper code snippet,
> >
> > public void map(BytesWritable key, BytesWritable value, Context
> > context) throws IOException, InterruptedException {
> >
> >         long timestamp = bytesToInt(value.getBytes(),
> > TIMESTAMP_INDEX);;
> >         String tsStr = sdf.format(new Date(timestamp * 1000L));
> >
> >         mos.write(key, value, generateFileName(tsStr)); // mos is a
> > MultipleOutputs object.
> >     }
> >
> >         private String generateFileName(String key) {
> >         return outputDir+"/"+key+"/raw-vectors";
> >     }
> >
> >
> > And here are the job outputs,
> >
> > 14/03/27 11:00:56 INFO mapred.JobClient:     Launched map tasks=2
> > 14/03/27 11:00:56 INFO mapred.JobClient:     Data-local map tasks=2
> > 14/03/27 11:00:56 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
> > 14/03/27 11:00:56 INFO mapred.JobClient:   File Output Format
> > Counters
> > 14/03/27 11:00:56 INFO mapred.JobClient:     Bytes Written=0
> > 14/03/27 11:00:56 INFO mapred.JobClient:   FileSystemCounters
> > 14/03/27 11:00:56 INFO mapred.JobClient:     HDFS_BYTES_READ=171086386
> > 14/03/27 11:00:56 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=54272
> > 14/03/27 11:00:56 INFO mapred.JobClient:
> > HDFS_BYTES_WRITTEN=1111374798
> > 14/03/27 11:00:56 INFO mapred.JobClient:   File Input Format Counters
> > 14/03/27 11:00:56 INFO mapred.JobClient:     Bytes Read=170782415
> > 14/03/27 11:00:56 INFO mapred.JobClient:   Map-Reduce Framework
> > 14/03/27 11:00:56 INFO mapred.JobClient:     Map input records=547
> > 14/03/27 11:00:56 INFO mapred.JobClient:     Physical memory (bytes)
> > snapshot=166428672
> > 14/03/27 11:00:56 INFO mapred.JobClient:     Spilled Records=0
> > 14/03/27 11:00:56 INFO mapred.JobClient:     Total committed heap
> > usage (bytes)=38351872
> > 14/03/27 11:00:56 INFO mapred.JobClient:     CPU time spent (ms)=20080
> > 14/03/27 11:00:56 INFO mapred.JobClient:     Virtual memory (bytes)
> > snapshot=1240104960
> > 14/03/27 11:00:56 INFO mapred.JobClient:     SPLIT_RAW_BYTES=286
> > 14/03/27 11:00:56 INFO mapred.JobClient:     Map output records=0
> >
> >
> > TIA,
> >
> >
> > Kim
> >
>
>
>

Re: Why is HDFS_BYTES_WRITTEN is much larger than HDFS_BYTES_READ in this case?

Posted by Kim Chew <kc...@gmail.com>.
I am also wondering if, say, I have two identical timestamp so they are
going to be written to the same file. Does MulitpleOutputs handle appending?

Thanks.

Kim


On Thu, Mar 27, 2014 at 12:30 PM, Thomas Bentsen <th...@bentzn.com> wrote:

> Have you checked the content of the files you write?
>
>
> /th
>
> On Thu, 2014-03-27 at 11:43 -0700, Kim Chew wrote:
> > I have a simple M/R job using Mapper only thus no reducer. The mapper
> > read a timestamp from the value, generate a path to the output file
> > and writes the key and value to the output file.
> >
> >
> > The input file is a sequence file, not compressed and stored in the
> > HDFS, it has a size of 162.68 MB.
> >
> >
> > Output also is written as a sequence file.
> >
> >
> >
> > However, after I ran my job, I have two output part files from the
> > mapper. One has a size of 835.12 MB and the other has a size of 224.77
> > MB. So why is the total outputs size is so much larger? Shouldn't it
> > be more or less equal to the input's size of 162.68MB since I just
> > write the key and value passed to mapper to the output?
> >
> >
> > Here is the mapper code snippet,
> >
> > public void map(BytesWritable key, BytesWritable value, Context
> > context) throws IOException, InterruptedException {
> >
> >         long timestamp = bytesToInt(value.getBytes(),
> > TIMESTAMP_INDEX);;
> >         String tsStr = sdf.format(new Date(timestamp * 1000L));
> >
> >         mos.write(key, value, generateFileName(tsStr)); // mos is a
> > MultipleOutputs object.
> >     }
> >
> >         private String generateFileName(String key) {
> >         return outputDir+"/"+key+"/raw-vectors";
> >     }
> >
> >
> > And here are the job outputs,
> >
> > 14/03/27 11:00:56 INFO mapred.JobClient:     Launched map tasks=2
> > 14/03/27 11:00:56 INFO mapred.JobClient:     Data-local map tasks=2
> > 14/03/27 11:00:56 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
> > 14/03/27 11:00:56 INFO mapred.JobClient:   File Output Format
> > Counters
> > 14/03/27 11:00:56 INFO mapred.JobClient:     Bytes Written=0
> > 14/03/27 11:00:56 INFO mapred.JobClient:   FileSystemCounters
> > 14/03/27 11:00:56 INFO mapred.JobClient:     HDFS_BYTES_READ=171086386
> > 14/03/27 11:00:56 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=54272
> > 14/03/27 11:00:56 INFO mapred.JobClient:
> > HDFS_BYTES_WRITTEN=1111374798
> > 14/03/27 11:00:56 INFO mapred.JobClient:   File Input Format Counters
> > 14/03/27 11:00:56 INFO mapred.JobClient:     Bytes Read=170782415
> > 14/03/27 11:00:56 INFO mapred.JobClient:   Map-Reduce Framework
> > 14/03/27 11:00:56 INFO mapred.JobClient:     Map input records=547
> > 14/03/27 11:00:56 INFO mapred.JobClient:     Physical memory (bytes)
> > snapshot=166428672
> > 14/03/27 11:00:56 INFO mapred.JobClient:     Spilled Records=0
> > 14/03/27 11:00:56 INFO mapred.JobClient:     Total committed heap
> > usage (bytes)=38351872
> > 14/03/27 11:00:56 INFO mapred.JobClient:     CPU time spent (ms)=20080
> > 14/03/27 11:00:56 INFO mapred.JobClient:     Virtual memory (bytes)
> > snapshot=1240104960
> > 14/03/27 11:00:56 INFO mapred.JobClient:     SPLIT_RAW_BYTES=286
> > 14/03/27 11:00:56 INFO mapred.JobClient:     Map output records=0
> >
> >
> > TIA,
> >
> >
> > Kim
> >
>
>
>

Re: Why is HDFS_BYTES_WRITTEN is much larger than HDFS_BYTES_READ in this case?

Posted by Thomas Bentsen <th...@bentzn.com>.
Have you checked the content of the files you write?


/th

On Thu, 2014-03-27 at 11:43 -0700, Kim Chew wrote:
> I have a simple M/R job using Mapper only thus no reducer. The mapper
> read a timestamp from the value, generate a path to the output file
> and writes the key and value to the output file.
> 
> 
> The input file is a sequence file, not compressed and stored in the
> HDFS, it has a size of 162.68 MB.
> 
> 
> Output also is written as a sequence file.
> 
> 
> 
> However, after I ran my job, I have two output part files from the
> mapper. One has a size of 835.12 MB and the other has a size of 224.77
> MB. So why is the total outputs size is so much larger? Shouldn't it
> be more or less equal to the input's size of 162.68MB since I just
> write the key and value passed to mapper to the output?
> 
> 
> Here is the mapper code snippet,
> 
> public void map(BytesWritable key, BytesWritable value, Context
> context) throws IOException, InterruptedException {
> 
>         long timestamp = bytesToInt(value.getBytes(),
> TIMESTAMP_INDEX);;
>         String tsStr = sdf.format(new Date(timestamp * 1000L));
>         
>         mos.write(key, value, generateFileName(tsStr)); // mos is a
> MultipleOutputs object.
>     }
> 
>         private String generateFileName(String key) {
>         return outputDir+"/"+key+"/raw-vectors";
>     }
> 
> 
> And here are the job outputs,
> 
> 14/03/27 11:00:56 INFO mapred.JobClient:     Launched map tasks=2
> 14/03/27 11:00:56 INFO mapred.JobClient:     Data-local map tasks=2
> 14/03/27 11:00:56 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
> 14/03/27 11:00:56 INFO mapred.JobClient:   File Output Format
> Counters 
> 14/03/27 11:00:56 INFO mapred.JobClient:     Bytes Written=0
> 14/03/27 11:00:56 INFO mapred.JobClient:   FileSystemCounters
> 14/03/27 11:00:56 INFO mapred.JobClient:     HDFS_BYTES_READ=171086386
> 14/03/27 11:00:56 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=54272
> 14/03/27 11:00:56 INFO mapred.JobClient:
> HDFS_BYTES_WRITTEN=1111374798
> 14/03/27 11:00:56 INFO mapred.JobClient:   File Input Format Counters 
> 14/03/27 11:00:56 INFO mapred.JobClient:     Bytes Read=170782415
> 14/03/27 11:00:56 INFO mapred.JobClient:   Map-Reduce Framework
> 14/03/27 11:00:56 INFO mapred.JobClient:     Map input records=547
> 14/03/27 11:00:56 INFO mapred.JobClient:     Physical memory (bytes)
> snapshot=166428672
> 14/03/27 11:00:56 INFO mapred.JobClient:     Spilled Records=0
> 14/03/27 11:00:56 INFO mapred.JobClient:     Total committed heap
> usage (bytes)=38351872
> 14/03/27 11:00:56 INFO mapred.JobClient:     CPU time spent (ms)=20080
> 14/03/27 11:00:56 INFO mapred.JobClient:     Virtual memory (bytes)
> snapshot=1240104960
> 14/03/27 11:00:56 INFO mapred.JobClient:     SPLIT_RAW_BYTES=286
> 14/03/27 11:00:56 INFO mapred.JobClient:     Map output records=0
> 
> 
> TIA,
> 
> 
> Kim
> 



Re: Why is HDFS_BYTES_WRITTEN is much larger than HDFS_BYTES_READ in this case?

Posted by Thomas Bentsen <th...@bentzn.com>.
Have you checked the content of the files you write?


/th

On Thu, 2014-03-27 at 11:43 -0700, Kim Chew wrote:
> I have a simple M/R job using Mapper only thus no reducer. The mapper
> read a timestamp from the value, generate a path to the output file
> and writes the key and value to the output file.
> 
> 
> The input file is a sequence file, not compressed and stored in the
> HDFS, it has a size of 162.68 MB.
> 
> 
> Output also is written as a sequence file.
> 
> 
> 
> However, after I ran my job, I have two output part files from the
> mapper. One has a size of 835.12 MB and the other has a size of 224.77
> MB. So why is the total outputs size is so much larger? Shouldn't it
> be more or less equal to the input's size of 162.68MB since I just
> write the key and value passed to mapper to the output?
> 
> 
> Here is the mapper code snippet,
> 
> public void map(BytesWritable key, BytesWritable value, Context
> context) throws IOException, InterruptedException {
> 
>         long timestamp = bytesToInt(value.getBytes(),
> TIMESTAMP_INDEX);;
>         String tsStr = sdf.format(new Date(timestamp * 1000L));
>         
>         mos.write(key, value, generateFileName(tsStr)); // mos is a
> MultipleOutputs object.
>     }
> 
>         private String generateFileName(String key) {
>         return outputDir+"/"+key+"/raw-vectors";
>     }
> 
> 
> And here are the job outputs,
> 
> 14/03/27 11:00:56 INFO mapred.JobClient:     Launched map tasks=2
> 14/03/27 11:00:56 INFO mapred.JobClient:     Data-local map tasks=2
> 14/03/27 11:00:56 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
> 14/03/27 11:00:56 INFO mapred.JobClient:   File Output Format
> Counters 
> 14/03/27 11:00:56 INFO mapred.JobClient:     Bytes Written=0
> 14/03/27 11:00:56 INFO mapred.JobClient:   FileSystemCounters
> 14/03/27 11:00:56 INFO mapred.JobClient:     HDFS_BYTES_READ=171086386
> 14/03/27 11:00:56 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=54272
> 14/03/27 11:00:56 INFO mapred.JobClient:
> HDFS_BYTES_WRITTEN=1111374798
> 14/03/27 11:00:56 INFO mapred.JobClient:   File Input Format Counters 
> 14/03/27 11:00:56 INFO mapred.JobClient:     Bytes Read=170782415
> 14/03/27 11:00:56 INFO mapred.JobClient:   Map-Reduce Framework
> 14/03/27 11:00:56 INFO mapred.JobClient:     Map input records=547
> 14/03/27 11:00:56 INFO mapred.JobClient:     Physical memory (bytes)
> snapshot=166428672
> 14/03/27 11:00:56 INFO mapred.JobClient:     Spilled Records=0
> 14/03/27 11:00:56 INFO mapred.JobClient:     Total committed heap
> usage (bytes)=38351872
> 14/03/27 11:00:56 INFO mapred.JobClient:     CPU time spent (ms)=20080
> 14/03/27 11:00:56 INFO mapred.JobClient:     Virtual memory (bytes)
> snapshot=1240104960
> 14/03/27 11:00:56 INFO mapred.JobClient:     SPLIT_RAW_BYTES=286
> 14/03/27 11:00:56 INFO mapred.JobClient:     Map output records=0
> 
> 
> TIA,
> 
> 
> Kim
> 



Re: Why is HDFS_BYTES_WRITTEN is much larger than HDFS_BYTES_READ in this case?

Posted by Thomas Bentsen <th...@bentzn.com>.
Have you checked the content of the files you write?


/th

On Thu, 2014-03-27 at 11:43 -0700, Kim Chew wrote:
> I have a simple M/R job using Mapper only thus no reducer. The mapper
> read a timestamp from the value, generate a path to the output file
> and writes the key and value to the output file.
> 
> 
> The input file is a sequence file, not compressed and stored in the
> HDFS, it has a size of 162.68 MB.
> 
> 
> Output also is written as a sequence file.
> 
> 
> 
> However, after I ran my job, I have two output part files from the
> mapper. One has a size of 835.12 MB and the other has a size of 224.77
> MB. So why is the total outputs size is so much larger? Shouldn't it
> be more or less equal to the input's size of 162.68MB since I just
> write the key and value passed to mapper to the output?
> 
> 
> Here is the mapper code snippet,
> 
> public void map(BytesWritable key, BytesWritable value, Context
> context) throws IOException, InterruptedException {
> 
>         long timestamp = bytesToInt(value.getBytes(),
> TIMESTAMP_INDEX);;
>         String tsStr = sdf.format(new Date(timestamp * 1000L));
>         
>         mos.write(key, value, generateFileName(tsStr)); // mos is a
> MultipleOutputs object.
>     }
> 
>         private String generateFileName(String key) {
>         return outputDir+"/"+key+"/raw-vectors";
>     }
> 
> 
> And here are the job outputs,
> 
> 14/03/27 11:00:56 INFO mapred.JobClient:     Launched map tasks=2
> 14/03/27 11:00:56 INFO mapred.JobClient:     Data-local map tasks=2
> 14/03/27 11:00:56 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
> 14/03/27 11:00:56 INFO mapred.JobClient:   File Output Format
> Counters 
> 14/03/27 11:00:56 INFO mapred.JobClient:     Bytes Written=0
> 14/03/27 11:00:56 INFO mapred.JobClient:   FileSystemCounters
> 14/03/27 11:00:56 INFO mapred.JobClient:     HDFS_BYTES_READ=171086386
> 14/03/27 11:00:56 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=54272
> 14/03/27 11:00:56 INFO mapred.JobClient:
> HDFS_BYTES_WRITTEN=1111374798
> 14/03/27 11:00:56 INFO mapred.JobClient:   File Input Format Counters 
> 14/03/27 11:00:56 INFO mapred.JobClient:     Bytes Read=170782415
> 14/03/27 11:00:56 INFO mapred.JobClient:   Map-Reduce Framework
> 14/03/27 11:00:56 INFO mapred.JobClient:     Map input records=547
> 14/03/27 11:00:56 INFO mapred.JobClient:     Physical memory (bytes)
> snapshot=166428672
> 14/03/27 11:00:56 INFO mapred.JobClient:     Spilled Records=0
> 14/03/27 11:00:56 INFO mapred.JobClient:     Total committed heap
> usage (bytes)=38351872
> 14/03/27 11:00:56 INFO mapred.JobClient:     CPU time spent (ms)=20080
> 14/03/27 11:00:56 INFO mapred.JobClient:     Virtual memory (bytes)
> snapshot=1240104960
> 14/03/27 11:00:56 INFO mapred.JobClient:     SPLIT_RAW_BYTES=286
> 14/03/27 11:00:56 INFO mapred.JobClient:     Map output records=0
> 
> 
> TIA,
> 
> 
> Kim
> 



Re: Why is HDFS_BYTES_WRITTEN is much larger than HDFS_BYTES_READ in this case?

Posted by Thomas Bentsen <th...@bentzn.com>.
Have you checked the content of the files you write?


/th

On Thu, 2014-03-27 at 11:43 -0700, Kim Chew wrote:
> I have a simple M/R job using Mapper only thus no reducer. The mapper
> read a timestamp from the value, generate a path to the output file
> and writes the key and value to the output file.
> 
> 
> The input file is a sequence file, not compressed and stored in the
> HDFS, it has a size of 162.68 MB.
> 
> 
> Output also is written as a sequence file.
> 
> 
> 
> However, after I ran my job, I have two output part files from the
> mapper. One has a size of 835.12 MB and the other has a size of 224.77
> MB. So why is the total outputs size is so much larger? Shouldn't it
> be more or less equal to the input's size of 162.68MB since I just
> write the key and value passed to mapper to the output?
> 
> 
> Here is the mapper code snippet,
> 
> public void map(BytesWritable key, BytesWritable value, Context
> context) throws IOException, InterruptedException {
> 
>         long timestamp = bytesToInt(value.getBytes(),
> TIMESTAMP_INDEX);;
>         String tsStr = sdf.format(new Date(timestamp * 1000L));
>         
>         mos.write(key, value, generateFileName(tsStr)); // mos is a
> MultipleOutputs object.
>     }
> 
>         private String generateFileName(String key) {
>         return outputDir+"/"+key+"/raw-vectors";
>     }
> 
> 
> And here are the job outputs,
> 
> 14/03/27 11:00:56 INFO mapred.JobClient:     Launched map tasks=2
> 14/03/27 11:00:56 INFO mapred.JobClient:     Data-local map tasks=2
> 14/03/27 11:00:56 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
> 14/03/27 11:00:56 INFO mapred.JobClient:   File Output Format
> Counters 
> 14/03/27 11:00:56 INFO mapred.JobClient:     Bytes Written=0
> 14/03/27 11:00:56 INFO mapred.JobClient:   FileSystemCounters
> 14/03/27 11:00:56 INFO mapred.JobClient:     HDFS_BYTES_READ=171086386
> 14/03/27 11:00:56 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=54272
> 14/03/27 11:00:56 INFO mapred.JobClient:
> HDFS_BYTES_WRITTEN=1111374798
> 14/03/27 11:00:56 INFO mapred.JobClient:   File Input Format Counters 
> 14/03/27 11:00:56 INFO mapred.JobClient:     Bytes Read=170782415
> 14/03/27 11:00:56 INFO mapred.JobClient:   Map-Reduce Framework
> 14/03/27 11:00:56 INFO mapred.JobClient:     Map input records=547
> 14/03/27 11:00:56 INFO mapred.JobClient:     Physical memory (bytes)
> snapshot=166428672
> 14/03/27 11:00:56 INFO mapred.JobClient:     Spilled Records=0
> 14/03/27 11:00:56 INFO mapred.JobClient:     Total committed heap
> usage (bytes)=38351872
> 14/03/27 11:00:56 INFO mapred.JobClient:     CPU time spent (ms)=20080
> 14/03/27 11:00:56 INFO mapred.JobClient:     Virtual memory (bytes)
> snapshot=1240104960
> 14/03/27 11:00:56 INFO mapred.JobClient:     SPLIT_RAW_BYTES=286
> 14/03/27 11:00:56 INFO mapred.JobClient:     Map output records=0
> 
> 
> TIA,
> 
> 
> Kim
>