You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by Some Body <so...@squareplanet.de> on 2010/07/08 15:44:59 UTC

SequenceFile as map input

To get around the small-file-problem (I have thousands of 2MB log files) I wrote
a class to convert all my log files into a single SequenceFile in
(Text key,  BytesWritable value) format.  That works fine. I can run this:

    hadoop fs -text /my.seq |grep peemt114.log | head -1
    10/07/08 15:02:10 INFO util.NativeCodeLoader: Loaded the native-hadoop library
    10/07/08 15:02:10 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
    10/07/08 15:02:10 INFO compress.CodecPool: Got brand-new decompressor
    peemt114.log    70 65 65 6d 74 31 31 34 09 .........[snip].......

which shows my file name key (peemt114.log) 
and file contents value which appears to be converted to hex.
The hex values up to the first tab (09)  translate to my hostname.

I'm trying to adapt my mapper to use the SequenceFile as input.

I  changed the job's inputFormatClass to:
    MyJob.setInputFormatClass(SequenceFileInputFormat.class);
and modified my mapper signature to:
   public class MyMapper extends Mapper<Object, BytesWritable, Text, Text> {

but how do I convert the value back to Text? When I print out the key,values using:
        System.out.printf("MAPPER INKEY: [%s]\n", key);
        System.out.printf("MAPPER INVAL: [%s]\n", value.toString());
I get::   
    MAPPER INKEY: [peemt114.log] 
    MAPPER INVAL: [70 65 65 6d 74 31 31 34 09 .....[snip]......]

Alan

Re: SequenceFile as map input

Posted by Alex Kozlov <al...@cloudera.com>.

Hi Alan,

You don't need to do this complex trickery if you write <Object,Text> to the
Sequence File.  How do you create the Sequence File?  In your case it might
make sense to create a <Text,Text> Sequence File where the first object is
the file name or compete path and the second is the content.

Then you just call:

process_line*(*value.toString*()*, context*)*;

without having to do the StringBuffer thing.

Alex K

On Fri, Jul 9, 2010 at 10:10 AM, Alan Miller <so...@squareplanet.de>wrote:

>  Hi Alex,
>
> My original files are ascii text. I was using <Object, Text, Text, Text>
> and everything worked fine.
> Because my files are small (>2MB on avg.) I get one-map task per file.
> For my test I had 2000 files, totalling 5GB and the whole run took approx
> 40 minutes.
>
> I read that I could improve performance by merging my original files into
> one big SequenceFile.
>
> I did that and that's why I trying to use <Object, BytesWritable, Text,
> Text>
> My new SequenceFile is only 444MB so my m/r job trigerred 7 map tasks but
> apparently my new
> map() is computationally more intensive and the whole run now takes 64
> minutes.
>
> In my map(Text key, BytesWritable value, Context context)  value contains
> the contents
> of a whole file. I tried to break it down into line-based records which I
> send to reduce().
>
>    StringBuilder line = *new* StringBuilder*()*;
>    *char* linefeed = '\n';
>    *for* *(**byte* byt : value.getBytes*())* *{*
>        *if* *(* *(**int**)*byt == *(**int**)*linefeed  *)*  *{*
>           line.append*((**char**)*byt*)*;
>           process_line*(*line.toString*()*, context*)*;
>           line.delete*(*0, line.length*())*;
>        *}* *else* *{*
>           line.append*((**char**)*byt*)*;
>        *}*
>    *}*
>
> Alan
>
>
> On 07/08/2010 11:22 PM, Alex Kozlov wrote:
>
> Hi Alan,
>
> Is the content of the original file ascii text?  Then you should be using
> <Object, Text, Text, Text> signature.  By default 'hadoop fs -text ...'
> just will call toString() on the object.  You get the object itself in the
> map() method and can do whatever you want with it.  If Text or BytesWritable
> does not work for you, you can always write your own class implementing
> Writable<http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/Writable.html>interface.
>
> Let me know if you need more details how to do this.
>
> Alex K
>
> On Thu, Jul 8, 2010 at 1:59 PM, Alan Miller <so...@squareplanet.de>wrote:
>
>>  Hi Alex,
>>
>> I'm not sure what you mean. I already set my mapper's signature to:
>>
>>   public class MyMapper extends Mapper<Object, BytesWritable, Text, Text>
>> {
>>       ...
>>      public void map(Text key, BytesWritable value, Context context)
>>      }
>>    }
>>
>> In my map() loop the contents of value is the text from the original file
>> and the value.toString() returns a String of bytes as hex pairs separated
>> by space.
>> But I'd like the original tab separated list of strings (i.e. the lines in
>> my original files).
>>
>> I see BytesWritable.getBytes() returns a byte[]. I guess I could write my
>> own
>> RecordReader to convert the byte[] back to text strings but I thought this
>> is
>> something the framework would provide.
>>
>> Alan
>>
>>
>> On 07/08/2010 08:42 PM, Alex Loddengaard wrote:
>>
>> Hi Alan,
>>
>>  SequenceFiles keep track of the key and value type, so you should be
>> able to use the Writables in the signature.  Though it looks like you're
>> using the new API, and I admit that I'm not an expert with the new API.
>>  Have you tried using the Writables in the signature?
>>
>> Alex
>>
>> On Thu, Jul 8, 2010 at 6:44 AM, Some Body <so...@squareplanet.de>wrote:
>>
>>> To get around the small-file-problem (I have thousands of 2MB log files)
>>> I wrote
>>> a class to convert all my log files into a single SequenceFile in
>>> (Text key,  BytesWritable value) format.  That works fine. I can run
>>> this:
>>>
>>>    hadoop fs -text /my.seq |grep peemt114.log | head -1
>>>    10/07/08 15:02:10 INFO util.NativeCodeLoader: Loaded the native-hadoop
>>> library
>>>    10/07/08 15:02:10 INFO zlib.ZlibFactory: Successfully loaded &
>>> initialized native-zlib library
>>>    10/07/08 15:02:10 INFO compress.CodecPool: Got brand-new decompressor
>>>    peemt114.log    70 65 65 6d 74 31 31 34 09 .........[snip].......
>>>
>>> which shows my file name key (peemt114.log)
>>> and file contents value which appears to be converted to hex.
>>> The hex values up to the first tab (09)  translate to my hostname.
>>>
>>> I'm trying to adapt my mapper to use the SequenceFile as input.
>>>
>>> I  changed the job's inputFormatClass to:
>>>    MyJob.setInputFormatClass(SequenceFileInputFormat.class);
>>> and modified my mapper signature to:
>>>   public class MyMapper extends Mapper<Object, BytesWritable, Text, Text>
>>> {
>>>
>>> but how do I convert the value back to Text? When I print out the
>>> key,values using:
>>>        System.out.printf("MAPPER INKEY: [%s]\n", key);
>>>        System.out.printf("MAPPER INVAL: [%s]\n", value.toString());
>>> I get::
>>>    MAPPER INKEY: [peemt114.log]
>>>    MAPPER INVAL: [70 65 65 6d 74 31 31 34 09 .....[snip]......]
>>>
>>> Alan
>>>
>>
>>
>>
>
>

Re: SequenceFile as map input

Posted by Alan Miller <so...@squareplanet.de>.

Hi Alex,

My original files are ascii text. I was using <Object, Text, Text, Text> 
and everything worked fine.
Because my files are small (>2MB on avg.) I get one-map task per file.
For my test I had 2000 files, totalling 5GB and the whole run took 
approx 40 minutes.

I read that I could improve performance by merging my original files 
into one big SequenceFile.

I did that and that's why I trying to use <Object, BytesWritable, Text, 
Text>
My new SequenceFile is only 444MB so my m/r job trigerred 7 map tasks 
but apparently my new
map() is computationally more intensive and the whole run now takes 64 
minutes.

In my map(Text key, BytesWritable value, Context context)  value 
contains the contents
of a whole file. I tried to break it down into line-based records which 
I send to reduce().

    StringBuilder line = *new* StringBuilder*()*;
*char* linefeed = '\n';
*for* *(**byte* byt : value.getBytes*())* *{*
*if* *(* *(**int**)*byt == *(**int**)*linefeed *)* *{*
           line.append*((**char**)*byt*)*;
process_line*(*line.toString*()*, context*)*;
           line.delete*(*0, line.length*())*;
*}* *else* *{*
           line.append*((**char**)*byt*)*;
*}*
*}*

Alan

On 07/08/2010 11:22 PM, Alex Kozlov wrote:
> Hi Alan,
>
> Is the content of the original file ascii text?  Then you should be 
> using <Object, Text, Text, Text> signature.  By default 'hadoop fs 
> -text ...' just will call toString() on the object.  You get the 
> object itself in the map() method and can do whatever you want with 
> it.  If Text or BytesWritable does not work for you, you can always 
> write your own class implementing Writable 
> <http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/Writable.html> 
> interface.
>
> Let me know if you need more details how to do this.
>
> Alex K
>
> On Thu, Jul 8, 2010 at 1:59 PM, Alan Miller <somebody@squareplanet.de 
> <ma...@squareplanet.de>> wrote:
>
>     Hi Alex,
>
>     I'm not sure what you mean. I already set my mapper's signature to:
>
>       public class MyMapper extends Mapper<Object, BytesWritable,
>     Text, Text> {
>          ...
>          public void map(Text key, BytesWritable value, Context context)
>          }
>        }
>
>     In my map() loop the contents of value is the text from the
>     original file
>     and the value.toString() returns a String of bytes as hex pairs
>     separated by space.
>     But I'd like the original tab separated list of strings (i.e. the
>     lines in my original files).
>
>     I see BytesWritable.getBytes() returns a byte[]. I guess I could
>     write my own
>     RecordReader to convert the byte[] back to text strings but I
>     thought this is
>     something the framework would provide.
>
>     Alan
>
>
>     On 07/08/2010 08:42 PM, Alex Loddengaard wrote:
>>     Hi Alan,
>>
>>     SequenceFiles keep track of the key and value type, so you should
>>     be able to use the Writables in the signature.  Though it looks
>>     like you're using the new API, and I admit that I'm not an expert
>>     with the new API.  Have you tried using the Writables in the
>>     signature?
>>
>>     Alex
>>
>>     On Thu, Jul 8, 2010 at 6:44 AM, Some Body
>>     <somebody@squareplanet.de <ma...@squareplanet.de>> wrote:
>>
>>         To get around the small-file-problem (I have thousands of 2MB
>>         log files) I wrote
>>         a class to convert all my log files into a single SequenceFile in
>>         (Text key,  BytesWritable value) format.  That works fine. I
>>         can run this:
>>
>>            hadoop fs -text /my.seq |grep peemt114.log | head -1
>>            10/07/08 15:02:10 INFO util.NativeCodeLoader: Loaded the
>>         native-hadoop library
>>            10/07/08 15:02:10 INFO zlib.ZlibFactory: Successfully
>>         loaded & initialized native-zlib library
>>            10/07/08 15:02:10 INFO compress.CodecPool: Got brand-new
>>         decompressor
>>            peemt114.log    70 65 65 6d 74 31 31 34 09
>>         .........[snip].......
>>
>>         which shows my file name key (peemt114.log)
>>         and file contents value which appears to be converted to hex.
>>         The hex values up to the first tab (09)  translate to my
>>         hostname.
>>
>>         I'm trying to adapt my mapper to use the SequenceFile as input.
>>
>>         I  changed the job's inputFormatClass to:
>>            MyJob.setInputFormatClass(SequenceFileInputFormat.class);
>>         and modified my mapper signature to:
>>           public class MyMapper extends Mapper<Object, BytesWritable,
>>         Text, Text> {
>>
>>         but how do I convert the value back to Text? When I print out
>>         the key,values using:
>>                System.out.printf("MAPPER INKEY: [%s]\n", key);
>>                System.out.printf("MAPPER INVAL: [%s]\n",
>>         value.toString());
>>         I get::
>>            MAPPER INKEY: [peemt114.log]
>>            MAPPER INVAL: [70 65 65 6d 74 31 31 34 09 .....[snip]......]
>>
>>         Alan
>>
>>
>
>

Re: SequenceFile as map input

Posted by Alex Kozlov <al...@cloudera.com>.

Hi Alan,

Is the content of the original file ascii text?  Then you should be using
<Object, Text, Text, Text> signature.  By default 'hadoop fs -text ...' just
will call toString() on the object.  You get the object itself in the map()
method and can do whatever you want with it.  If Text or BytesWritable does
not work for you, you can always write your own class implementing
Writable<http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/Writable.html>interface.

Let me know if you need more details how to do this.

Alex K

On Thu, Jul 8, 2010 at 1:59 PM, Alan Miller <so...@squareplanet.de>wrote:

>  Hi Alex,
>
> I'm not sure what you mean. I already set my mapper's signature to:
>
>   public class MyMapper extends Mapper<Object, BytesWritable, Text, Text>
> {
>      ...
>      public void map(Text key, BytesWritable value, Context context)
>      }
>    }
>
> In my map() loop the contents of value is the text from the original file
> and the value.toString() returns a String of bytes as hex pairs separated
> by space.
> But I'd like the original tab separated list of strings (i.e. the lines in
> my original files).
>
> I see BytesWritable.getBytes() returns a byte[]. I guess I could write my
> own
> RecordReader to convert the byte[] back to text strings but I thought this
> is
> something the framework would provide.
>
> Alan
>
>
> On 07/08/2010 08:42 PM, Alex Loddengaard wrote:
>
> Hi Alan,
>
>  SequenceFiles keep track of the key and value type, so you should be able
> to use the Writables in the signature.  Though it looks like you're using
> the new API, and I admit that I'm not an expert with the new API.  Have you
> tried using the Writables in the signature?
>
> Alex
>
> On Thu, Jul 8, 2010 at 6:44 AM, Some Body <so...@squareplanet.de>wrote:
>
>> To get around the small-file-problem (I have thousands of 2MB log files) I
>> wrote
>> a class to convert all my log files into a single SequenceFile in
>> (Text key,  BytesWritable value) format.  That works fine. I can run this:
>>
>>    hadoop fs -text /my.seq |grep peemt114.log | head -1
>>    10/07/08 15:02:10 INFO util.NativeCodeLoader: Loaded the native-hadoop
>> library
>>    10/07/08 15:02:10 INFO zlib.ZlibFactory: Successfully loaded &
>> initialized native-zlib library
>>    10/07/08 15:02:10 INFO compress.CodecPool: Got brand-new decompressor
>>    peemt114.log    70 65 65 6d 74 31 31 34 09 .........[snip].......
>>
>> which shows my file name key (peemt114.log)
>> and file contents value which appears to be converted to hex.
>> The hex values up to the first tab (09)  translate to my hostname.
>>
>> I'm trying to adapt my mapper to use the SequenceFile as input.
>>
>> I  changed the job's inputFormatClass to:
>>    MyJob.setInputFormatClass(SequenceFileInputFormat.class);
>> and modified my mapper signature to:
>>   public class MyMapper extends Mapper<Object, BytesWritable, Text, Text>
>> {
>>
>> but how do I convert the value back to Text? When I print out the
>> key,values using:
>>        System.out.printf("MAPPER INKEY: [%s]\n", key);
>>        System.out.printf("MAPPER INVAL: [%s]\n", value.toString());
>> I get::
>>    MAPPER INKEY: [peemt114.log]
>>    MAPPER INVAL: [70 65 65 6d 74 31 31 34 09 .....[snip]......]
>>
>> Alan
>>
>
>
>

Re: SequenceFile as map input

Posted by Alan Miller <so...@squareplanet.de>.

Hi Alex,

I'm not sure what you mean. I already set my mapper's signature to:
   public class MyMapper extends Mapper<Object, BytesWritable, Text, Text> {
      ...
      public void map(Text key, BytesWritable value, Context context)
      }
    }

In my map() loop the contents of value is the text from the original file
and the value.toString() returns a String of bytes as hex pairs 
separated by space.
But I'd like the original tab separated list of strings (i.e. the lines 
in my original files).

I see BytesWritable.getBytes() returns a byte[]. I guess I could write 
my own
RecordReader to convert the byte[] back to text strings but I thought 
this is
something the framework would provide.

Alan

On 07/08/2010 08:42 PM, Alex Loddengaard wrote:
> Hi Alan,
>
> SequenceFiles keep track of the key and value type, so you should be 
> able to use the Writables in the signature.  Though it looks like 
> you're using the new API, and I admit that I'm not an expert with the 
> new API.  Have you tried using the Writables in the signature?
>
> Alex
>
> On Thu, Jul 8, 2010 at 6:44 AM, Some Body <somebody@squareplanet.de 
> <ma...@squareplanet.de>> wrote:
>
>     To get around the small-file-problem (I have thousands of 2MB log
>     files) I wrote
>     a class to convert all my log files into a single SequenceFile in
>     (Text key,  BytesWritable value) format.  That works fine. I can
>     run this:
>
>        hadoop fs -text /my.seq |grep peemt114.log | head -1
>        10/07/08 15:02:10 INFO util.NativeCodeLoader: Loaded the
>     native-hadoop library
>        10/07/08 15:02:10 INFO zlib.ZlibFactory: Successfully loaded &
>     initialized native-zlib library
>        10/07/08 15:02:10 INFO compress.CodecPool: Got brand-new
>     decompressor
>        peemt114.log    70 65 65 6d 74 31 31 34 09 .........[snip].......
>
>     which shows my file name key (peemt114.log)
>     and file contents value which appears to be converted to hex.
>     The hex values up to the first tab (09)  translate to my hostname.
>
>     I'm trying to adapt my mapper to use the SequenceFile as input.
>
>     I  changed the job's inputFormatClass to:
>        MyJob.setInputFormatClass(SequenceFileInputFormat.class);
>     and modified my mapper signature to:
>       public class MyMapper extends Mapper<Object, BytesWritable,
>     Text, Text> {
>
>     but how do I convert the value back to Text? When I print out the
>     key,values using:
>            System.out.printf("MAPPER INKEY: [%s]\n", key);
>            System.out.printf("MAPPER INVAL: [%s]\n", value.toString());
>     I get::
>        MAPPER INKEY: [peemt114.log]
>        MAPPER INVAL: [70 65 65 6d 74 31 31 34 09 .....[snip]......]
>
>     Alan
>
>

Re: SequenceFile as map input

Posted by Alex Loddengaard <al...@cloudera.com>.

Hi Alan,

SequenceFiles keep track of the key and value type, so you should be able to
use the Writables in the signature.  Though it looks like you're using the
new API, and I admit that I'm not an expert with the new API.  Have you
tried using the Writables in the signature?

Alex

On Thu, Jul 8, 2010 at 6:44 AM, Some Body <so...@squareplanet.de> wrote:

> To get around the small-file-problem (I have thousands of 2MB log files) I
> wrote
> a class to convert all my log files into a single SequenceFile in
> (Text key,  BytesWritable value) format.  That works fine. I can run this:
>
>    hadoop fs -text /my.seq |grep peemt114.log | head -1
>    10/07/08 15:02:10 INFO util.NativeCodeLoader: Loaded the native-hadoop
> library
>    10/07/08 15:02:10 INFO zlib.ZlibFactory: Successfully loaded &
> initialized native-zlib library
>    10/07/08 15:02:10 INFO compress.CodecPool: Got brand-new decompressor
>    peemt114.log    70 65 65 6d 74 31 31 34 09 .........[snip].......
>
> which shows my file name key (peemt114.log)
> and file contents value which appears to be converted to hex.
> The hex values up to the first tab (09)  translate to my hostname.
>
> I'm trying to adapt my mapper to use the SequenceFile as input.
>
> I  changed the job's inputFormatClass to:
>    MyJob.setInputFormatClass(SequenceFileInputFormat.class);
> and modified my mapper signature to:
>   public class MyMapper extends Mapper<Object, BytesWritable, Text, Text> {
>
> but how do I convert the value back to Text? When I print out the
> key,values using:
>        System.out.printf("MAPPER INKEY: [%s]\n", key);
>        System.out.printf("MAPPER INVAL: [%s]\n", value.toString());
> I get::
>    MAPPER INKEY: [peemt114.log]
>    MAPPER INVAL: [70 65 65 6d 74 31 31 34 09 .....[snip]......]
>
> Alan
>