You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-dev@hadoop.apache.org by Pedro Costa <ps...@gmail.com> on 2012/04/03 17:01:38 UTC

Re: Reduce output is strange

If I want to compare 2 sequence files to see if they are the same, how do I
compare?



On 19 December 2011 14:43, Robert Evans <ev...@yahoo-inc.com> wrote:

> Oh I forgot to say that part of the Random Characters are actually random
> characters.  Sequence files store a set of random characters as synch
> points within the file.  This allows for splitting the file easily without
> a high risk that the random sequence appears inside the data itself just by
> chance.
>
> --Bobby Evans
>
> On 12/19/11 7:51 AM, "Pedro Costa" <ps...@gmail.com> wrote:
>
> Hi,
>
> In the hadoop MapReduce, I've executed the webdatascan example, and the
> reduce output is in a SequeceFile. The result is shows here (
> http://paste.lisp.org/display/126572). What's the trash (random
> characters), like "u 265
> 0000100 330 320 252 " \n # ; 374 5 211 V ' 340 376" in the output? Is the
> output correct?
>
>
> 0000000   S   E   Q 006 031   o   r   g   .   a   p   a   c   h   e   .
> 0000020   h   a   d   o   o   p   .   i   o   .   T   e   x   t 031   o
> 0000040   r   g   .   a   p   a   c   h   e   .   h   a   d   o   o   p
> 0000060   .   i   o   .   T   e   x   t  \0  \0  \0  \0  \0  \0   u 265
> 0000100 330 320 252   "  \n   #   ; 374   5 211   V   ' 340 376  \0  \0
> 0000120  \0   X  \0  \0  \0     037   a   p   p   l   e       a   p   p
> 0000140   l   e       b   a   n   a   n   a       a   p   p   l   e
> 0000160   a   p   p   l   e       7   c   a   r   r   o   t       c   a
> 0000200   r   r   o   t       c   a   r   r   o   t       c   a   r   r
> 0000220   o   t       a   p   p   l   e       b   a   n   a   n   a
> 0000240   c   a   r   r   o   t       b   a   n   a   n   a
> 0000256
>
>
> --
> Thanks,
>
>


-- 
Best regards,

Re: Reduce output is strange

Posted by Owen O'Malley <om...@apache.org>.

On Tue, Apr 3, 2012 at 8:25 AM, Pedro Costa <ps...@gmail.com> wrote:
> What I want to ask is:
>
> - how do I read the values from sequence files that are block, or record
> compressed, or uncompressed?

You use the SequenceFile.Reader class.

> - how do I know if the sequence file is block compressed, record
> compressed, or uncompressed?

You use the SequenceFile.Reader class.

>
> - how do I know if it's a sequence file or a Textfile?

SequenceFile's always have "SEQ" followed by the version in the first 4 bytes.

-- Owen

Re: Reduce output is strange

Posted by Pedro Costa <ps...@gmail.com>.

What I want to ask is:

- how do I read the values from sequence files that are block, or record
compressed, or uncompressed?

- how do I know if the sequence file is block compressed, record
compressed, or uncompressed?

- how do I know if it's a sequence file or a Textfile?



On 3 April 2012 16:01, Pedro Costa <ps...@gmail.com> wrote:

> If I want to compare 2 sequence files to see if they are the same, how do
> I compare?
>
>
>
> On 19 December 2011 14:43, Robert Evans <ev...@yahoo-inc.com> wrote:
>
>> Oh I forgot to say that part of the Random Characters are actually random
>> characters.  Sequence files store a set of random characters as synch
>> points within the file.  This allows for splitting the file easily without
>> a high risk that the random sequence appears inside the data itself just by
>> chance.
>>
>> --Bobby Evans
>>
>> On 12/19/11 7:51 AM, "Pedro Costa" <ps...@gmail.com> wrote:
>>
>> Hi,
>>
>> In the hadoop MapReduce, I've executed the webdatascan example, and the
>> reduce output is in a SequeceFile. The result is shows here (
>> http://paste.lisp.org/display/126572). What's the trash (random
>> characters), like "u 265
>> 0000100 330 320 252 " \n # ; 374 5 211 V ' 340 376" in the output? Is the
>> output correct?
>>
>>
>> 0000000   S   E   Q 006 031   o   r   g   .   a   p   a   c   h   e   .
>> 0000020   h   a   d   o   o   p   .   i   o   .   T   e   x   t 031   o
>> 0000040   r   g   .   a   p   a   c   h   e   .   h   a   d   o   o   p
>> 0000060   .   i   o   .   T   e   x   t  \0  \0  \0  \0  \0  \0   u 265
>> 0000100 330 320 252   "  \n   #   ; 374   5 211   V   ' 340 376  \0  \0
>> 0000120  \0   X  \0  \0  \0     037   a   p   p   l   e       a   p   p
>> 0000140   l   e       b   a   n   a   n   a       a   p   p   l   e
>> 0000160   a   p   p   l   e       7   c   a   r   r   o   t       c   a
>> 0000200   r   r   o   t       c   a   r   r   o   t       c   a   r   r
>> 0000220   o   t       a   p   p   l   e       b   a   n   a   n   a
>> 0000240   c   a   r   r   o   t       b   a   n   a   n   a
>> 0000256
>>
>>
>> --
>> Thanks,
>>
>>
>
>
> --
> Best regards,
>
>


-- 
Best regards,

Re: Reduce output is strange

Posted by Owen O'Malley <om...@apache.org>.

On Tue, Apr 3, 2012 at 8:01 AM, Pedro Costa <ps...@gmail.com> wrote:
> If I want to compare 2 sequence files to see if they are the same, how do I
> compare?

>From the command line, you can "textify" the files with:

hadoop fs -text myfile.seq

Of course, if you are using API you can iterate through the two
Sequence files and compare them row by row.

-- Owen