You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hadoop.apache.org by Adam Retter <ad...@googlemail.com> on 2014/01/24 19:05:52 UTC

Memory problems with BytesWritable and huge binary files

Hi there,

We have several diverse large datasets to process (one set may be as
much as 27 TB), however all of the files in these datasets are binary
files. We need to be able to pass each binary file to several tools
running in the Map Reduce framework.
We already have a working pipeline of MapReduce tasks that receives
each binary file (as BytesWritable) and processes it, we have tested
it with very small test datasets so far.

For any particular data set, the size of the files involves varies
wildly with each file being anywhere between about 2 KB and 4 GB. With
that in mind we have tried to follow the advice to read the files into
a Sequence File in HDFS. To create the Sequence File we have a Map
Reduce Job that uses a SequenceFileOutputFormat[Text, BytesWritable].

We cannot split these files into chunks, they must be processed by our
tools in our mappers and reducers as complete files. The problem we
have is that BytesWritable appears to load the entire content of a
file into memory, and now that we are trying to process our production
size datasets, once you get a couple of large files on the go, the JVM
throws the dreaded OutOfMemoryError.

What we need is someway to process these binary files, by reading and
writing their contents as Streams to and from the Sequence File. Or
really any other mechanism that does not involve loading the entire
file into RAM! Our own tools that we use in the mappers and reducers
in-fact expect to work with java.io.InputStream. We have tried quite a
few things now, including writing some custom Writable
implementations, but we then end up buffering data in temporary files
which is not exactly ideal when the data already exists in the
sequence files in HDFS.

Is there any hope?


Thanks Adam.

-- 
Adam Retter

skype: adam.retter
tweet: adamretter
http://www.adamretter.org.uk

Re: Memory problems with BytesWritable and huge binary files

Posted by Harsh J <ha...@cloudera.com>.
Hi,

The Writable interface is not necessary - you can interact with a file
at the input stream level by eliminating it, which is what Vinod was
suggesting in his reply. Checkout the serialisations section in Tom
White's Hadoop: The Definitive Guide, under chapter "Hadoop I/O" that
discusses this on a general level. When controlling the RecordReader
over a file (not a sequence file, as you'd be limited by its abilities
and interfaces there) you get access to the raw input stream so you
can use that to your advantage.

On Sat, Jan 25, 2014 at 5:40 AM, Adam Retter <ad...@googlemail.com> wrote:
> So I am not sure I follow you, as we already have a custom InputFormat
> and RecordReader and that does not seem to help.
>
> The reason it does not seem to help is that it needs to return the
> data as a Writable so that the Writable can then be used in the
> following map operation. The map operation needs access to the entire
> file.
>
> The only way to do this in Hadoop by default is to use BytesWritable,
> but that places everything in memory.
>
> What am I missing?
>
> On 24 January 2014 22:42, Vinod Kumar Vavilapalli
> <vi...@hortonworks.com> wrote:
>> Okay. Assuming you don't need a whole file (video) in memory for your processing, you can simply write a Inputformat/RecordReader implementation that streams through any given file and processes it.
>>
>> +Vinod
>>
>> On Jan 24, 2014, at 12:44 PM, Adam Retter <ad...@googlemail.com> wrote:
>>
>>>> Is your data in any given file a bunch of key-value pairs?
>>>
>>> No. The content of each file itself is the value we are interested in,
>>> and I guess that it's filename is the key.
>>>
>>>> If that isn't the
>>>> case, I'm wondering how writing a single large key-value into a sequence
>>>> file helps. It won't. May be you can give an example of your input data?
>>>
>>> Well from the Hadoop O'Reilly book, I rather got the impression that
>>> HDFS does not like small files due to it's 64MB block size, and it is
>>> instead recommended to place small files into a Sequence file. Is that
>>> not the case?
>>>
>>> Our input data really varies between 130 different file types, it
>>> could be Microsoft Office documents, Video Recordings, Audio, CAD
>>> diagrams etc.
>>>
>>>> If indeed they are a bunch of smaller sized key-value pairs, you can write
>>>> your own custom InputFormat that reads the data from your input files one
>>>> k-v pair after another, and feed it to your MR job. There isn't any need for
>>>> converting them to sequence-files at that point.
>>>
>>> As I mentioned in my initial email, each file cannot be split up!
>>>
>>>> Thanks
>>>> +Vinod
>>>> Hortonworks Inc.
>>>> http://hortonworks.com/
>>>>
>>>>
>>>> CONFIDENTIALITY NOTICE
>>>> NOTICE: This message is intended for the use of the individual or entity to
>>>> which it is addressed and may contain information that is confidential,
>>>> privileged and exempt from disclosure under applicable law. If the reader of
>>>> this message is not the intended recipient, you are hereby notified that any
>>>> printing, copying, dissemination, distribution, disclosure or forwarding of
>>>> this communication is strictly prohibited. If you have received this
>>>> communication in error, please contact the sender immediately and delete it
>>>> from your system. Thank You.
>>>
>>>
>>>
>>> --
>>> Adam Retter
>>>
>>> skype: adam.retter
>>> tweet: adamretter
>>> http://www.adamretter.org.uk
>>
>>
>> --
>> CONFIDENTIALITY NOTICE
>> NOTICE: This message is intended for the use of the individual or entity to
>> which it is addressed and may contain information that is confidential,
>> privileged and exempt from disclosure under applicable law. If the reader
>> of this message is not the intended recipient, you are hereby notified that
>> any printing, copying, dissemination, distribution, disclosure or
>> forwarding of this communication is strictly prohibited. If you have
>> received this communication in error, please contact the sender immediately
>> and delete it from your system. Thank You.
>
>
>
> --
> Adam Retter
>
> skype: adam.retter
> tweet: adamretter
> http://www.adamretter.org.uk



-- 
Harsh J

Re: Memory problems with BytesWritable and huge binary files

Posted by Harsh J <ha...@cloudera.com>.
Hi,

The Writable interface is not necessary - you can interact with a file
at the input stream level by eliminating it, which is what Vinod was
suggesting in his reply. Checkout the serialisations section in Tom
White's Hadoop: The Definitive Guide, under chapter "Hadoop I/O" that
discusses this on a general level. When controlling the RecordReader
over a file (not a sequence file, as you'd be limited by its abilities
and interfaces there) you get access to the raw input stream so you
can use that to your advantage.

On Sat, Jan 25, 2014 at 5:40 AM, Adam Retter <ad...@googlemail.com> wrote:
> So I am not sure I follow you, as we already have a custom InputFormat
> and RecordReader and that does not seem to help.
>
> The reason it does not seem to help is that it needs to return the
> data as a Writable so that the Writable can then be used in the
> following map operation. The map operation needs access to the entire
> file.
>
> The only way to do this in Hadoop by default is to use BytesWritable,
> but that places everything in memory.
>
> What am I missing?
>
> On 24 January 2014 22:42, Vinod Kumar Vavilapalli
> <vi...@hortonworks.com> wrote:
>> Okay. Assuming you don't need a whole file (video) in memory for your processing, you can simply write a Inputformat/RecordReader implementation that streams through any given file and processes it.
>>
>> +Vinod
>>
>> On Jan 24, 2014, at 12:44 PM, Adam Retter <ad...@googlemail.com> wrote:
>>
>>>> Is your data in any given file a bunch of key-value pairs?
>>>
>>> No. The content of each file itself is the value we are interested in,
>>> and I guess that it's filename is the key.
>>>
>>>> If that isn't the
>>>> case, I'm wondering how writing a single large key-value into a sequence
>>>> file helps. It won't. May be you can give an example of your input data?
>>>
>>> Well from the Hadoop O'Reilly book, I rather got the impression that
>>> HDFS does not like small files due to it's 64MB block size, and it is
>>> instead recommended to place small files into a Sequence file. Is that
>>> not the case?
>>>
>>> Our input data really varies between 130 different file types, it
>>> could be Microsoft Office documents, Video Recordings, Audio, CAD
>>> diagrams etc.
>>>
>>>> If indeed they are a bunch of smaller sized key-value pairs, you can write
>>>> your own custom InputFormat that reads the data from your input files one
>>>> k-v pair after another, and feed it to your MR job. There isn't any need for
>>>> converting them to sequence-files at that point.
>>>
>>> As I mentioned in my initial email, each file cannot be split up!
>>>
>>>> Thanks
>>>> +Vinod
>>>> Hortonworks Inc.
>>>> http://hortonworks.com/
>>>>
>>>>
>>>> CONFIDENTIALITY NOTICE
>>>> NOTICE: This message is intended for the use of the individual or entity to
>>>> which it is addressed and may contain information that is confidential,
>>>> privileged and exempt from disclosure under applicable law. If the reader of
>>>> this message is not the intended recipient, you are hereby notified that any
>>>> printing, copying, dissemination, distribution, disclosure or forwarding of
>>>> this communication is strictly prohibited. If you have received this
>>>> communication in error, please contact the sender immediately and delete it
>>>> from your system. Thank You.
>>>
>>>
>>>
>>> --
>>> Adam Retter
>>>
>>> skype: adam.retter
>>> tweet: adamretter
>>> http://www.adamretter.org.uk
>>
>>
>> --
>> CONFIDENTIALITY NOTICE
>> NOTICE: This message is intended for the use of the individual or entity to
>> which it is addressed and may contain information that is confidential,
>> privileged and exempt from disclosure under applicable law. If the reader
>> of this message is not the intended recipient, you are hereby notified that
>> any printing, copying, dissemination, distribution, disclosure or
>> forwarding of this communication is strictly prohibited. If you have
>> received this communication in error, please contact the sender immediately
>> and delete it from your system. Thank You.
>
>
>
> --
> Adam Retter
>
> skype: adam.retter
> tweet: adamretter
> http://www.adamretter.org.uk



-- 
Harsh J

Re: Memory problems with BytesWritable and huge binary files

Posted by Harsh J <ha...@cloudera.com>.
Hi,

The Writable interface is not necessary - you can interact with a file
at the input stream level by eliminating it, which is what Vinod was
suggesting in his reply. Checkout the serialisations section in Tom
White's Hadoop: The Definitive Guide, under chapter "Hadoop I/O" that
discusses this on a general level. When controlling the RecordReader
over a file (not a sequence file, as you'd be limited by its abilities
and interfaces there) you get access to the raw input stream so you
can use that to your advantage.

On Sat, Jan 25, 2014 at 5:40 AM, Adam Retter <ad...@googlemail.com> wrote:
> So I am not sure I follow you, as we already have a custom InputFormat
> and RecordReader and that does not seem to help.
>
> The reason it does not seem to help is that it needs to return the
> data as a Writable so that the Writable can then be used in the
> following map operation. The map operation needs access to the entire
> file.
>
> The only way to do this in Hadoop by default is to use BytesWritable,
> but that places everything in memory.
>
> What am I missing?
>
> On 24 January 2014 22:42, Vinod Kumar Vavilapalli
> <vi...@hortonworks.com> wrote:
>> Okay. Assuming you don't need a whole file (video) in memory for your processing, you can simply write a Inputformat/RecordReader implementation that streams through any given file and processes it.
>>
>> +Vinod
>>
>> On Jan 24, 2014, at 12:44 PM, Adam Retter <ad...@googlemail.com> wrote:
>>
>>>> Is your data in any given file a bunch of key-value pairs?
>>>
>>> No. The content of each file itself is the value we are interested in,
>>> and I guess that it's filename is the key.
>>>
>>>> If that isn't the
>>>> case, I'm wondering how writing a single large key-value into a sequence
>>>> file helps. It won't. May be you can give an example of your input data?
>>>
>>> Well from the Hadoop O'Reilly book, I rather got the impression that
>>> HDFS does not like small files due to it's 64MB block size, and it is
>>> instead recommended to place small files into a Sequence file. Is that
>>> not the case?
>>>
>>> Our input data really varies between 130 different file types, it
>>> could be Microsoft Office documents, Video Recordings, Audio, CAD
>>> diagrams etc.
>>>
>>>> If indeed they are a bunch of smaller sized key-value pairs, you can write
>>>> your own custom InputFormat that reads the data from your input files one
>>>> k-v pair after another, and feed it to your MR job. There isn't any need for
>>>> converting them to sequence-files at that point.
>>>
>>> As I mentioned in my initial email, each file cannot be split up!
>>>
>>>> Thanks
>>>> +Vinod
>>>> Hortonworks Inc.
>>>> http://hortonworks.com/
>>>>
>>>>
>>>> CONFIDENTIALITY NOTICE
>>>> NOTICE: This message is intended for the use of the individual or entity to
>>>> which it is addressed and may contain information that is confidential,
>>>> privileged and exempt from disclosure under applicable law. If the reader of
>>>> this message is not the intended recipient, you are hereby notified that any
>>>> printing, copying, dissemination, distribution, disclosure or forwarding of
>>>> this communication is strictly prohibited. If you have received this
>>>> communication in error, please contact the sender immediately and delete it
>>>> from your system. Thank You.
>>>
>>>
>>>
>>> --
>>> Adam Retter
>>>
>>> skype: adam.retter
>>> tweet: adamretter
>>> http://www.adamretter.org.uk
>>
>>
>> --
>> CONFIDENTIALITY NOTICE
>> NOTICE: This message is intended for the use of the individual or entity to
>> which it is addressed and may contain information that is confidential,
>> privileged and exempt from disclosure under applicable law. If the reader
>> of this message is not the intended recipient, you are hereby notified that
>> any printing, copying, dissemination, distribution, disclosure or
>> forwarding of this communication is strictly prohibited. If you have
>> received this communication in error, please contact the sender immediately
>> and delete it from your system. Thank You.
>
>
>
> --
> Adam Retter
>
> skype: adam.retter
> tweet: adamretter
> http://www.adamretter.org.uk



-- 
Harsh J

Re: Memory problems with BytesWritable and huge binary files

Posted by Harsh J <ha...@cloudera.com>.
Hi,

The Writable interface is not necessary - you can interact with a file
at the input stream level by eliminating it, which is what Vinod was
suggesting in his reply. Checkout the serialisations section in Tom
White's Hadoop: The Definitive Guide, under chapter "Hadoop I/O" that
discusses this on a general level. When controlling the RecordReader
over a file (not a sequence file, as you'd be limited by its abilities
and interfaces there) you get access to the raw input stream so you
can use that to your advantage.

On Sat, Jan 25, 2014 at 5:40 AM, Adam Retter <ad...@googlemail.com> wrote:
> So I am not sure I follow you, as we already have a custom InputFormat
> and RecordReader and that does not seem to help.
>
> The reason it does not seem to help is that it needs to return the
> data as a Writable so that the Writable can then be used in the
> following map operation. The map operation needs access to the entire
> file.
>
> The only way to do this in Hadoop by default is to use BytesWritable,
> but that places everything in memory.
>
> What am I missing?
>
> On 24 January 2014 22:42, Vinod Kumar Vavilapalli
> <vi...@hortonworks.com> wrote:
>> Okay. Assuming you don't need a whole file (video) in memory for your processing, you can simply write a Inputformat/RecordReader implementation that streams through any given file and processes it.
>>
>> +Vinod
>>
>> On Jan 24, 2014, at 12:44 PM, Adam Retter <ad...@googlemail.com> wrote:
>>
>>>> Is your data in any given file a bunch of key-value pairs?
>>>
>>> No. The content of each file itself is the value we are interested in,
>>> and I guess that it's filename is the key.
>>>
>>>> If that isn't the
>>>> case, I'm wondering how writing a single large key-value into a sequence
>>>> file helps. It won't. May be you can give an example of your input data?
>>>
>>> Well from the Hadoop O'Reilly book, I rather got the impression that
>>> HDFS does not like small files due to it's 64MB block size, and it is
>>> instead recommended to place small files into a Sequence file. Is that
>>> not the case?
>>>
>>> Our input data really varies between 130 different file types, it
>>> could be Microsoft Office documents, Video Recordings, Audio, CAD
>>> diagrams etc.
>>>
>>>> If indeed they are a bunch of smaller sized key-value pairs, you can write
>>>> your own custom InputFormat that reads the data from your input files one
>>>> k-v pair after another, and feed it to your MR job. There isn't any need for
>>>> converting them to sequence-files at that point.
>>>
>>> As I mentioned in my initial email, each file cannot be split up!
>>>
>>>> Thanks
>>>> +Vinod
>>>> Hortonworks Inc.
>>>> http://hortonworks.com/
>>>>
>>>>
>>>> CONFIDENTIALITY NOTICE
>>>> NOTICE: This message is intended for the use of the individual or entity to
>>>> which it is addressed and may contain information that is confidential,
>>>> privileged and exempt from disclosure under applicable law. If the reader of
>>>> this message is not the intended recipient, you are hereby notified that any
>>>> printing, copying, dissemination, distribution, disclosure or forwarding of
>>>> this communication is strictly prohibited. If you have received this
>>>> communication in error, please contact the sender immediately and delete it
>>>> from your system. Thank You.
>>>
>>>
>>>
>>> --
>>> Adam Retter
>>>
>>> skype: adam.retter
>>> tweet: adamretter
>>> http://www.adamretter.org.uk
>>
>>
>> --
>> CONFIDENTIALITY NOTICE
>> NOTICE: This message is intended for the use of the individual or entity to
>> which it is addressed and may contain information that is confidential,
>> privileged and exempt from disclosure under applicable law. If the reader
>> of this message is not the intended recipient, you are hereby notified that
>> any printing, copying, dissemination, distribution, disclosure or
>> forwarding of this communication is strictly prohibited. If you have
>> received this communication in error, please contact the sender immediately
>> and delete it from your system. Thank You.
>
>
>
> --
> Adam Retter
>
> skype: adam.retter
> tweet: adamretter
> http://www.adamretter.org.uk



-- 
Harsh J

Re: Memory problems with BytesWritable and huge binary files

Posted by Adam Retter <ad...@googlemail.com>.
So I am not sure I follow you, as we already have a custom InputFormat
and RecordReader and that does not seem to help.

The reason it does not seem to help is that it needs to return the
data as a Writable so that the Writable can then be used in the
following map operation. The map operation needs access to the entire
file.

The only way to do this in Hadoop by default is to use BytesWritable,
but that places everything in memory.

What am I missing?

On 24 January 2014 22:42, Vinod Kumar Vavilapalli
<vi...@hortonworks.com> wrote:
> Okay. Assuming you don't need a whole file (video) in memory for your processing, you can simply write a Inputformat/RecordReader implementation that streams through any given file and processes it.
>
> +Vinod
>
> On Jan 24, 2014, at 12:44 PM, Adam Retter <ad...@googlemail.com> wrote:
>
>>> Is your data in any given file a bunch of key-value pairs?
>>
>> No. The content of each file itself is the value we are interested in,
>> and I guess that it's filename is the key.
>>
>>> If that isn't the
>>> case, I'm wondering how writing a single large key-value into a sequence
>>> file helps. It won't. May be you can give an example of your input data?
>>
>> Well from the Hadoop O'Reilly book, I rather got the impression that
>> HDFS does not like small files due to it's 64MB block size, and it is
>> instead recommended to place small files into a Sequence file. Is that
>> not the case?
>>
>> Our input data really varies between 130 different file types, it
>> could be Microsoft Office documents, Video Recordings, Audio, CAD
>> diagrams etc.
>>
>>> If indeed they are a bunch of smaller sized key-value pairs, you can write
>>> your own custom InputFormat that reads the data from your input files one
>>> k-v pair after another, and feed it to your MR job. There isn't any need for
>>> converting them to sequence-files at that point.
>>
>> As I mentioned in my initial email, each file cannot be split up!
>>
>>> Thanks
>>> +Vinod
>>> Hortonworks Inc.
>>> http://hortonworks.com/
>>>
>>>
>>> CONFIDENTIALITY NOTICE
>>> NOTICE: This message is intended for the use of the individual or entity to
>>> which it is addressed and may contain information that is confidential,
>>> privileged and exempt from disclosure under applicable law. If the reader of
>>> this message is not the intended recipient, you are hereby notified that any
>>> printing, copying, dissemination, distribution, disclosure or forwarding of
>>> this communication is strictly prohibited. If you have received this
>>> communication in error, please contact the sender immediately and delete it
>>> from your system. Thank You.
>>
>>
>>
>> --
>> Adam Retter
>>
>> skype: adam.retter
>> tweet: adamretter
>> http://www.adamretter.org.uk
>
>
> --
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to
> which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.



-- 
Adam Retter

skype: adam.retter
tweet: adamretter
http://www.adamretter.org.uk

Re: Memory problems with BytesWritable and huge binary files

Posted by Adam Retter <ad...@googlemail.com>.
So I am not sure I follow you, as we already have a custom InputFormat
and RecordReader and that does not seem to help.

The reason it does not seem to help is that it needs to return the
data as a Writable so that the Writable can then be used in the
following map operation. The map operation needs access to the entire
file.

The only way to do this in Hadoop by default is to use BytesWritable,
but that places everything in memory.

What am I missing?

On 24 January 2014 22:42, Vinod Kumar Vavilapalli
<vi...@hortonworks.com> wrote:
> Okay. Assuming you don't need a whole file (video) in memory for your processing, you can simply write a Inputformat/RecordReader implementation that streams through any given file and processes it.
>
> +Vinod
>
> On Jan 24, 2014, at 12:44 PM, Adam Retter <ad...@googlemail.com> wrote:
>
>>> Is your data in any given file a bunch of key-value pairs?
>>
>> No. The content of each file itself is the value we are interested in,
>> and I guess that it's filename is the key.
>>
>>> If that isn't the
>>> case, I'm wondering how writing a single large key-value into a sequence
>>> file helps. It won't. May be you can give an example of your input data?
>>
>> Well from the Hadoop O'Reilly book, I rather got the impression that
>> HDFS does not like small files due to it's 64MB block size, and it is
>> instead recommended to place small files into a Sequence file. Is that
>> not the case?
>>
>> Our input data really varies between 130 different file types, it
>> could be Microsoft Office documents, Video Recordings, Audio, CAD
>> diagrams etc.
>>
>>> If indeed they are a bunch of smaller sized key-value pairs, you can write
>>> your own custom InputFormat that reads the data from your input files one
>>> k-v pair after another, and feed it to your MR job. There isn't any need for
>>> converting them to sequence-files at that point.
>>
>> As I mentioned in my initial email, each file cannot be split up!
>>
>>> Thanks
>>> +Vinod
>>> Hortonworks Inc.
>>> http://hortonworks.com/
>>>
>>>
>>> CONFIDENTIALITY NOTICE
>>> NOTICE: This message is intended for the use of the individual or entity to
>>> which it is addressed and may contain information that is confidential,
>>> privileged and exempt from disclosure under applicable law. If the reader of
>>> this message is not the intended recipient, you are hereby notified that any
>>> printing, copying, dissemination, distribution, disclosure or forwarding of
>>> this communication is strictly prohibited. If you have received this
>>> communication in error, please contact the sender immediately and delete it
>>> from your system. Thank You.
>>
>>
>>
>> --
>> Adam Retter
>>
>> skype: adam.retter
>> tweet: adamretter
>> http://www.adamretter.org.uk
>
>
> --
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to
> which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.



-- 
Adam Retter

skype: adam.retter
tweet: adamretter
http://www.adamretter.org.uk

Re: Memory problems with BytesWritable and huge binary files

Posted by Adam Retter <ad...@googlemail.com>.
So I am not sure I follow you, as we already have a custom InputFormat
and RecordReader and that does not seem to help.

The reason it does not seem to help is that it needs to return the
data as a Writable so that the Writable can then be used in the
following map operation. The map operation needs access to the entire
file.

The only way to do this in Hadoop by default is to use BytesWritable,
but that places everything in memory.

What am I missing?

On 24 January 2014 22:42, Vinod Kumar Vavilapalli
<vi...@hortonworks.com> wrote:
> Okay. Assuming you don't need a whole file (video) in memory for your processing, you can simply write a Inputformat/RecordReader implementation that streams through any given file and processes it.
>
> +Vinod
>
> On Jan 24, 2014, at 12:44 PM, Adam Retter <ad...@googlemail.com> wrote:
>
>>> Is your data in any given file a bunch of key-value pairs?
>>
>> No. The content of each file itself is the value we are interested in,
>> and I guess that it's filename is the key.
>>
>>> If that isn't the
>>> case, I'm wondering how writing a single large key-value into a sequence
>>> file helps. It won't. May be you can give an example of your input data?
>>
>> Well from the Hadoop O'Reilly book, I rather got the impression that
>> HDFS does not like small files due to it's 64MB block size, and it is
>> instead recommended to place small files into a Sequence file. Is that
>> not the case?
>>
>> Our input data really varies between 130 different file types, it
>> could be Microsoft Office documents, Video Recordings, Audio, CAD
>> diagrams etc.
>>
>>> If indeed they are a bunch of smaller sized key-value pairs, you can write
>>> your own custom InputFormat that reads the data from your input files one
>>> k-v pair after another, and feed it to your MR job. There isn't any need for
>>> converting them to sequence-files at that point.
>>
>> As I mentioned in my initial email, each file cannot be split up!
>>
>>> Thanks
>>> +Vinod
>>> Hortonworks Inc.
>>> http://hortonworks.com/
>>>
>>>
>>> CONFIDENTIALITY NOTICE
>>> NOTICE: This message is intended for the use of the individual or entity to
>>> which it is addressed and may contain information that is confidential,
>>> privileged and exempt from disclosure under applicable law. If the reader of
>>> this message is not the intended recipient, you are hereby notified that any
>>> printing, copying, dissemination, distribution, disclosure or forwarding of
>>> this communication is strictly prohibited. If you have received this
>>> communication in error, please contact the sender immediately and delete it
>>> from your system. Thank You.
>>
>>
>>
>> --
>> Adam Retter
>>
>> skype: adam.retter
>> tweet: adamretter
>> http://www.adamretter.org.uk
>
>
> --
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to
> which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.



-- 
Adam Retter

skype: adam.retter
tweet: adamretter
http://www.adamretter.org.uk

Re: Memory problems with BytesWritable and huge binary files

Posted by Adam Retter <ad...@googlemail.com>.
So I am not sure I follow you, as we already have a custom InputFormat
and RecordReader and that does not seem to help.

The reason it does not seem to help is that it needs to return the
data as a Writable so that the Writable can then be used in the
following map operation. The map operation needs access to the entire
file.

The only way to do this in Hadoop by default is to use BytesWritable,
but that places everything in memory.

What am I missing?

On 24 January 2014 22:42, Vinod Kumar Vavilapalli
<vi...@hortonworks.com> wrote:
> Okay. Assuming you don't need a whole file (video) in memory for your processing, you can simply write a Inputformat/RecordReader implementation that streams through any given file and processes it.
>
> +Vinod
>
> On Jan 24, 2014, at 12:44 PM, Adam Retter <ad...@googlemail.com> wrote:
>
>>> Is your data in any given file a bunch of key-value pairs?
>>
>> No. The content of each file itself is the value we are interested in,
>> and I guess that it's filename is the key.
>>
>>> If that isn't the
>>> case, I'm wondering how writing a single large key-value into a sequence
>>> file helps. It won't. May be you can give an example of your input data?
>>
>> Well from the Hadoop O'Reilly book, I rather got the impression that
>> HDFS does not like small files due to it's 64MB block size, and it is
>> instead recommended to place small files into a Sequence file. Is that
>> not the case?
>>
>> Our input data really varies between 130 different file types, it
>> could be Microsoft Office documents, Video Recordings, Audio, CAD
>> diagrams etc.
>>
>>> If indeed they are a bunch of smaller sized key-value pairs, you can write
>>> your own custom InputFormat that reads the data from your input files one
>>> k-v pair after another, and feed it to your MR job. There isn't any need for
>>> converting them to sequence-files at that point.
>>
>> As I mentioned in my initial email, each file cannot be split up!
>>
>>> Thanks
>>> +Vinod
>>> Hortonworks Inc.
>>> http://hortonworks.com/
>>>
>>>
>>> CONFIDENTIALITY NOTICE
>>> NOTICE: This message is intended for the use of the individual or entity to
>>> which it is addressed and may contain information that is confidential,
>>> privileged and exempt from disclosure under applicable law. If the reader of
>>> this message is not the intended recipient, you are hereby notified that any
>>> printing, copying, dissemination, distribution, disclosure or forwarding of
>>> this communication is strictly prohibited. If you have received this
>>> communication in error, please contact the sender immediately and delete it
>>> from your system. Thank You.
>>
>>
>>
>> --
>> Adam Retter
>>
>> skype: adam.retter
>> tweet: adamretter
>> http://www.adamretter.org.uk
>
>
> --
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to
> which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.



-- 
Adam Retter

skype: adam.retter
tweet: adamretter
http://www.adamretter.org.uk

Re: Memory problems with BytesWritable and huge binary files

Posted by Vinod Kumar Vavilapalli <vi...@hortonworks.com>.
Okay. Assuming you don't need a whole file (video) in memory for your processing, you can simply write a Inputformat/RecordReader implementation that streams through any given file and processes it.

+Vinod

On Jan 24, 2014, at 12:44 PM, Adam Retter <ad...@googlemail.com> wrote:

>> Is your data in any given file a bunch of key-value pairs?
> 
> No. The content of each file itself is the value we are interested in,
> and I guess that it's filename is the key.
> 
>> If that isn't the
>> case, I'm wondering how writing a single large key-value into a sequence
>> file helps. It won't. May be you can give an example of your input data?
> 
> Well from the Hadoop O'Reilly book, I rather got the impression that
> HDFS does not like small files due to it's 64MB block size, and it is
> instead recommended to place small files into a Sequence file. Is that
> not the case?
> 
> Our input data really varies between 130 different file types, it
> could be Microsoft Office documents, Video Recordings, Audio, CAD
> diagrams etc.
> 
>> If indeed they are a bunch of smaller sized key-value pairs, you can write
>> your own custom InputFormat that reads the data from your input files one
>> k-v pair after another, and feed it to your MR job. There isn't any need for
>> converting them to sequence-files at that point.
> 
> As I mentioned in my initial email, each file cannot be split up!
> 
>> Thanks
>> +Vinod
>> Hortonworks Inc.
>> http://hortonworks.com/
>> 
>> 
>> CONFIDENTIALITY NOTICE
>> NOTICE: This message is intended for the use of the individual or entity to
>> which it is addressed and may contain information that is confidential,
>> privileged and exempt from disclosure under applicable law. If the reader of
>> this message is not the intended recipient, you are hereby notified that any
>> printing, copying, dissemination, distribution, disclosure or forwarding of
>> this communication is strictly prohibited. If you have received this
>> communication in error, please contact the sender immediately and delete it
>> from your system. Thank You.
> 
> 
> 
> -- 
> Adam Retter
> 
> skype: adam.retter
> tweet: adamretter
> http://www.adamretter.org.uk


-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: Memory problems with BytesWritable and huge binary files

Posted by Vinod Kumar Vavilapalli <vi...@hortonworks.com>.
Okay. Assuming you don't need a whole file (video) in memory for your processing, you can simply write a Inputformat/RecordReader implementation that streams through any given file and processes it.

+Vinod

On Jan 24, 2014, at 12:44 PM, Adam Retter <ad...@googlemail.com> wrote:

>> Is your data in any given file a bunch of key-value pairs?
> 
> No. The content of each file itself is the value we are interested in,
> and I guess that it's filename is the key.
> 
>> If that isn't the
>> case, I'm wondering how writing a single large key-value into a sequence
>> file helps. It won't. May be you can give an example of your input data?
> 
> Well from the Hadoop O'Reilly book, I rather got the impression that
> HDFS does not like small files due to it's 64MB block size, and it is
> instead recommended to place small files into a Sequence file. Is that
> not the case?
> 
> Our input data really varies between 130 different file types, it
> could be Microsoft Office documents, Video Recordings, Audio, CAD
> diagrams etc.
> 
>> If indeed they are a bunch of smaller sized key-value pairs, you can write
>> your own custom InputFormat that reads the data from your input files one
>> k-v pair after another, and feed it to your MR job. There isn't any need for
>> converting them to sequence-files at that point.
> 
> As I mentioned in my initial email, each file cannot be split up!
> 
>> Thanks
>> +Vinod
>> Hortonworks Inc.
>> http://hortonworks.com/
>> 
>> 
>> CONFIDENTIALITY NOTICE
>> NOTICE: This message is intended for the use of the individual or entity to
>> which it is addressed and may contain information that is confidential,
>> privileged and exempt from disclosure under applicable law. If the reader of
>> this message is not the intended recipient, you are hereby notified that any
>> printing, copying, dissemination, distribution, disclosure or forwarding of
>> this communication is strictly prohibited. If you have received this
>> communication in error, please contact the sender immediately and delete it
>> from your system. Thank You.
> 
> 
> 
> -- 
> Adam Retter
> 
> skype: adam.retter
> tweet: adamretter
> http://www.adamretter.org.uk


-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: Memory problems with BytesWritable and huge binary files

Posted by Vinod Kumar Vavilapalli <vi...@hortonworks.com>.
Okay. Assuming you don't need a whole file (video) in memory for your processing, you can simply write a Inputformat/RecordReader implementation that streams through any given file and processes it.

+Vinod

On Jan 24, 2014, at 12:44 PM, Adam Retter <ad...@googlemail.com> wrote:

>> Is your data in any given file a bunch of key-value pairs?
> 
> No. The content of each file itself is the value we are interested in,
> and I guess that it's filename is the key.
> 
>> If that isn't the
>> case, I'm wondering how writing a single large key-value into a sequence
>> file helps. It won't. May be you can give an example of your input data?
> 
> Well from the Hadoop O'Reilly book, I rather got the impression that
> HDFS does not like small files due to it's 64MB block size, and it is
> instead recommended to place small files into a Sequence file. Is that
> not the case?
> 
> Our input data really varies between 130 different file types, it
> could be Microsoft Office documents, Video Recordings, Audio, CAD
> diagrams etc.
> 
>> If indeed they are a bunch of smaller sized key-value pairs, you can write
>> your own custom InputFormat that reads the data from your input files one
>> k-v pair after another, and feed it to your MR job. There isn't any need for
>> converting them to sequence-files at that point.
> 
> As I mentioned in my initial email, each file cannot be split up!
> 
>> Thanks
>> +Vinod
>> Hortonworks Inc.
>> http://hortonworks.com/
>> 
>> 
>> CONFIDENTIALITY NOTICE
>> NOTICE: This message is intended for the use of the individual or entity to
>> which it is addressed and may contain information that is confidential,
>> privileged and exempt from disclosure under applicable law. If the reader of
>> this message is not the intended recipient, you are hereby notified that any
>> printing, copying, dissemination, distribution, disclosure or forwarding of
>> this communication is strictly prohibited. If you have received this
>> communication in error, please contact the sender immediately and delete it
>> from your system. Thank You.
> 
> 
> 
> -- 
> Adam Retter
> 
> skype: adam.retter
> tweet: adamretter
> http://www.adamretter.org.uk


-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: Memory problems with BytesWritable and huge binary files

Posted by Vinod Kumar Vavilapalli <vi...@hortonworks.com>.
Okay. Assuming you don't need a whole file (video) in memory for your processing, you can simply write a Inputformat/RecordReader implementation that streams through any given file and processes it.

+Vinod

On Jan 24, 2014, at 12:44 PM, Adam Retter <ad...@googlemail.com> wrote:

>> Is your data in any given file a bunch of key-value pairs?
> 
> No. The content of each file itself is the value we are interested in,
> and I guess that it's filename is the key.
> 
>> If that isn't the
>> case, I'm wondering how writing a single large key-value into a sequence
>> file helps. It won't. May be you can give an example of your input data?
> 
> Well from the Hadoop O'Reilly book, I rather got the impression that
> HDFS does not like small files due to it's 64MB block size, and it is
> instead recommended to place small files into a Sequence file. Is that
> not the case?
> 
> Our input data really varies between 130 different file types, it
> could be Microsoft Office documents, Video Recordings, Audio, CAD
> diagrams etc.
> 
>> If indeed they are a bunch of smaller sized key-value pairs, you can write
>> your own custom InputFormat that reads the data from your input files one
>> k-v pair after another, and feed it to your MR job. There isn't any need for
>> converting them to sequence-files at that point.
> 
> As I mentioned in my initial email, each file cannot be split up!
> 
>> Thanks
>> +Vinod
>> Hortonworks Inc.
>> http://hortonworks.com/
>> 
>> 
>> CONFIDENTIALITY NOTICE
>> NOTICE: This message is intended for the use of the individual or entity to
>> which it is addressed and may contain information that is confidential,
>> privileged and exempt from disclosure under applicable law. If the reader of
>> this message is not the intended recipient, you are hereby notified that any
>> printing, copying, dissemination, distribution, disclosure or forwarding of
>> this communication is strictly prohibited. If you have received this
>> communication in error, please contact the sender immediately and delete it
>> from your system. Thank You.
> 
> 
> 
> -- 
> Adam Retter
> 
> skype: adam.retter
> tweet: adamretter
> http://www.adamretter.org.uk


-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: Memory problems with BytesWritable and huge binary files

Posted by Adam Retter <ad...@googlemail.com>.
> Is your data in any given file a bunch of key-value pairs?

No. The content of each file itself is the value we are interested in,
and I guess that it's filename is the key.

> If that isn't the
> case, I'm wondering how writing a single large key-value into a sequence
> file helps. It won't. May be you can give an example of your input data?

Well from the Hadoop O'Reilly book, I rather got the impression that
HDFS does not like small files due to it's 64MB block size, and it is
instead recommended to place small files into a Sequence file. Is that
not the case?

Our input data really varies between 130 different file types, it
could be Microsoft Office documents, Video Recordings, Audio, CAD
diagrams etc.

> If indeed they are a bunch of smaller sized key-value pairs, you can write
> your own custom InputFormat that reads the data from your input files one
> k-v pair after another, and feed it to your MR job. There isn't any need for
> converting them to sequence-files at that point.

As I mentioned in my initial email, each file cannot be split up!

> Thanks
> +Vinod
> Hortonworks Inc.
> http://hortonworks.com/
>
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to
> which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader of
> this message is not the intended recipient, you are hereby notified that any
> printing, copying, dissemination, distribution, disclosure or forwarding of
> this communication is strictly prohibited. If you have received this
> communication in error, please contact the sender immediately and delete it
> from your system. Thank You.



-- 
Adam Retter

skype: adam.retter
tweet: adamretter
http://www.adamretter.org.uk

Re: Memory problems with BytesWritable and huge binary files

Posted by Adam Retter <ad...@googlemail.com>.
> Is your data in any given file a bunch of key-value pairs?

No. The content of each file itself is the value we are interested in,
and I guess that it's filename is the key.

> If that isn't the
> case, I'm wondering how writing a single large key-value into a sequence
> file helps. It won't. May be you can give an example of your input data?

Well from the Hadoop O'Reilly book, I rather got the impression that
HDFS does not like small files due to it's 64MB block size, and it is
instead recommended to place small files into a Sequence file. Is that
not the case?

Our input data really varies between 130 different file types, it
could be Microsoft Office documents, Video Recordings, Audio, CAD
diagrams etc.

> If indeed they are a bunch of smaller sized key-value pairs, you can write
> your own custom InputFormat that reads the data from your input files one
> k-v pair after another, and feed it to your MR job. There isn't any need for
> converting them to sequence-files at that point.

As I mentioned in my initial email, each file cannot be split up!

> Thanks
> +Vinod
> Hortonworks Inc.
> http://hortonworks.com/
>
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to
> which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader of
> this message is not the intended recipient, you are hereby notified that any
> printing, copying, dissemination, distribution, disclosure or forwarding of
> this communication is strictly prohibited. If you have received this
> communication in error, please contact the sender immediately and delete it
> from your system. Thank You.



-- 
Adam Retter

skype: adam.retter
tweet: adamretter
http://www.adamretter.org.uk

Re: Memory problems with BytesWritable and huge binary files

Posted by Adam Retter <ad...@googlemail.com>.
> Is your data in any given file a bunch of key-value pairs?

No. The content of each file itself is the value we are interested in,
and I guess that it's filename is the key.

> If that isn't the
> case, I'm wondering how writing a single large key-value into a sequence
> file helps. It won't. May be you can give an example of your input data?

Well from the Hadoop O'Reilly book, I rather got the impression that
HDFS does not like small files due to it's 64MB block size, and it is
instead recommended to place small files into a Sequence file. Is that
not the case?

Our input data really varies between 130 different file types, it
could be Microsoft Office documents, Video Recordings, Audio, CAD
diagrams etc.

> If indeed they are a bunch of smaller sized key-value pairs, you can write
> your own custom InputFormat that reads the data from your input files one
> k-v pair after another, and feed it to your MR job. There isn't any need for
> converting them to sequence-files at that point.

As I mentioned in my initial email, each file cannot be split up!

> Thanks
> +Vinod
> Hortonworks Inc.
> http://hortonworks.com/
>
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to
> which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader of
> this message is not the intended recipient, you are hereby notified that any
> printing, copying, dissemination, distribution, disclosure or forwarding of
> this communication is strictly prohibited. If you have received this
> communication in error, please contact the sender immediately and delete it
> from your system. Thank You.



-- 
Adam Retter

skype: adam.retter
tweet: adamretter
http://www.adamretter.org.uk

Re: Memory problems with BytesWritable and huge binary files

Posted by Adam Retter <ad...@googlemail.com>.
> Is your data in any given file a bunch of key-value pairs?

No. The content of each file itself is the value we are interested in,
and I guess that it's filename is the key.

> If that isn't the
> case, I'm wondering how writing a single large key-value into a sequence
> file helps. It won't. May be you can give an example of your input data?

Well from the Hadoop O'Reilly book, I rather got the impression that
HDFS does not like small files due to it's 64MB block size, and it is
instead recommended to place small files into a Sequence file. Is that
not the case?

Our input data really varies between 130 different file types, it
could be Microsoft Office documents, Video Recordings, Audio, CAD
diagrams etc.

> If indeed they are a bunch of smaller sized key-value pairs, you can write
> your own custom InputFormat that reads the data from your input files one
> k-v pair after another, and feed it to your MR job. There isn't any need for
> converting them to sequence-files at that point.

As I mentioned in my initial email, each file cannot be split up!

> Thanks
> +Vinod
> Hortonworks Inc.
> http://hortonworks.com/
>
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to
> which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader of
> this message is not the intended recipient, you are hereby notified that any
> printing, copying, dissemination, distribution, disclosure or forwarding of
> this communication is strictly prohibited. If you have received this
> communication in error, please contact the sender immediately and delete it
> from your system. Thank You.



-- 
Adam Retter

skype: adam.retter
tweet: adamretter
http://www.adamretter.org.uk

Re: Memory problems with BytesWritable and huge binary files

Posted by Vinod Kumar Vavilapalli <vi...@hortonworks.com>.
Is your data in any given file a bunch of key-value pairs? If that isn't
the case, I'm wondering how writing a single large key-value into a
sequence file helps. It won't. May be you can give an example of your input
data?

If indeed they are a bunch of smaller sized key-value pairs, you can write
your own custom InputFormat that reads the data from your input files one
k-v pair after another, and feed it to your MR job. There isn't any need
for converting them to sequence-files at that point.

Thanks
+Vinod
Hortonworks Inc.
http://hortonworks.com/

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: Memory problems with BytesWritable and huge binary files

Posted by Vinod Kumar Vavilapalli <vi...@hortonworks.com>.
Is your data in any given file a bunch of key-value pairs? If that isn't
the case, I'm wondering how writing a single large key-value into a
sequence file helps. It won't. May be you can give an example of your input
data?

If indeed they are a bunch of smaller sized key-value pairs, you can write
your own custom InputFormat that reads the data from your input files one
k-v pair after another, and feed it to your MR job. There isn't any need
for converting them to sequence-files at that point.

Thanks
+Vinod
Hortonworks Inc.
http://hortonworks.com/

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: Memory problems with BytesWritable and huge binary files

Posted by Vinod Kumar Vavilapalli <vi...@hortonworks.com>.
Is your data in any given file a bunch of key-value pairs? If that isn't
the case, I'm wondering how writing a single large key-value into a
sequence file helps. It won't. May be you can give an example of your input
data?

If indeed they are a bunch of smaller sized key-value pairs, you can write
your own custom InputFormat that reads the data from your input files one
k-v pair after another, and feed it to your MR job. There isn't any need
for converting them to sequence-files at that point.

Thanks
+Vinod
Hortonworks Inc.
http://hortonworks.com/

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: Memory problems with BytesWritable and huge binary files

Posted by Vinod Kumar Vavilapalli <vi...@hortonworks.com>.
Is your data in any given file a bunch of key-value pairs? If that isn't
the case, I'm wondering how writing a single large key-value into a
sequence file helps. It won't. May be you can give an example of your input
data?

If indeed they are a bunch of smaller sized key-value pairs, you can write
your own custom InputFormat that reads the data from your input files one
k-v pair after another, and feed it to your MR job. There isn't any need
for converting them to sequence-files at that point.

Thanks
+Vinod
Hortonworks Inc.
http://hortonworks.com/

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.