You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Rahul Bhattacharjee <ra...@gmail.com> on 2013/04/01 05:32:54 UTC

Re: Streaming value of (200MB) from a SequenceFile

Hi Sandy,

I am also new to Hadoop and have a question here.
The writable does have a DataInput stream so that the objects can be
constructed from the byte stream.
Are you suggesting to save the stream for later use ,but late we cannot
ascertain the state of the stream.
For a large value , I think we can actually take the useful part and emmit
it out of from a mapper , we might also have a custom input format to do
this thing so that large value doesn't even reach the mapper.

Am I missing anything here?

Thanks,
Rahul

On Sat, Mar 30, 2013 at 11:22 PM, Jerry Lam <ch...@gmail.com> wrote:

> Hi everyone,
>
> I'm having a problem to stream individual key-value pair of 200MB to 1GB
> from a MapFile.
> I need to stream the large value to an outputstream instead of reading the
> entire value before processing because it potentially uses too much memory.
>
> I read the API for MapFile, the next(WritableComparable key, Writable val)
> does not return an input stream.
>
> How can I accomplish this?
>
> Thanks,
>
> Jerry
>

Re: Streaming value of (200MB) from a SequenceFile

Posted by Rahul Bhattacharjee <ra...@gmail.com>.

Thanks Sandy for the excellent explanation. Didn't think about the lose of
data-locality.

Regards,
Rahul


On Mon, Apr 1, 2013 at 11:29 AM, Sandy Ryza <sa...@cloudera.com> wrote:

> Hi Rahul,
>
> I don't think saving the stream for later use would work - I was just
> suggesting that if only some aggregate statistics needed to be calculated,
> they could be calculated at read time instead of in the mapper.  Nothing
> requires a Writable to contain all the data that it reads.
>
> That's a good point that you can pass the locations of the files.  A
> drawback of this is that Hadoop attempts to co-locate mappers with where
> their input data is stored, and this approach would negate the locality
> advantage.
>
> 200 MB is not too small a file for Hadoop.  A typical HDFS block size is
> 64 MB or 128 MB, so a file that's larger than that is not unreasonable.
>
> -Sandy
>
>
> On Sun, Mar 31, 2013 at 8:56 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
>> Sorry for the multiple replies.
>>
>> There is one more thing that can be done (I guess) for streaming the
>> values rather then constructing the whole object itself.We can store the
>> value in hdfs as file and have the location as value of the mapper.Mapper
>> can open a stream using the location specified.
>>
>> Not sure if 200 MB file would qualify as small file wrt hadoop or if too
>> many 200 MB size files would have any impact to the NN.
>>
>> Thanks,
>> Rahul
>>
>>
>>
>> On Mon, Apr 1, 2013 at 9:02 AM, Rahul Bhattacharjee <
>> rahul.rec.dgp@gmail.com> wrote:
>>
>>> Hi Sandy,
>>>
>>> I am also new to Hadoop and have a question here.
>>> The writable does have a DataInput stream so that the objects can be
>>> constructed from the byte stream.
>>> Are you suggesting to save the stream for later use ,but late we cannot
>>> ascertain the state of the stream.
>>> For a large value , I think we can actually take the useful part and
>>> emmit it out of from a mapper , we might also have a custom input format to
>>> do this thing so that large value doesn't even reach the mapper.
>>>
>>> Am I missing anything here?
>>>
>>> Thanks,
>>> Rahul
>>>
>>>
>>>
>>> On Sat, Mar 30, 2013 at 11:22 PM, Jerry Lam <ch...@gmail.com>wrote:
>>>
>>>> Hi everyone,
>>>>
>>>> I'm having a problem to stream individual key-value pair of 200MB to
>>>> 1GB from a MapFile.
>>>> I need to stream the large value to an outputstream instead of reading
>>>> the entire value before processing because it potentially uses too much
>>>> memory.
>>>>
>>>> I read the API for MapFile, the next(WritableComparable key, Writable
>>>> val) does not return an input stream.
>>>>
>>>> How can I accomplish this?
>>>>
>>>> Thanks,
>>>>
>>>> Jerry
>>>>
>>>
>>>
>>
>

Re: Streaming value of (200MB) from a SequenceFile

Posted by Rahul Bhattacharjee <ra...@gmail.com>.

Thanks Sandy for the excellent explanation. Didn't think about the lose of
data-locality.

Regards,
Rahul


On Mon, Apr 1, 2013 at 11:29 AM, Sandy Ryza <sa...@cloudera.com> wrote:

> Hi Rahul,
>
> I don't think saving the stream for later use would work - I was just
> suggesting that if only some aggregate statistics needed to be calculated,
> they could be calculated at read time instead of in the mapper.  Nothing
> requires a Writable to contain all the data that it reads.
>
> That's a good point that you can pass the locations of the files.  A
> drawback of this is that Hadoop attempts to co-locate mappers with where
> their input data is stored, and this approach would negate the locality
> advantage.
>
> 200 MB is not too small a file for Hadoop.  A typical HDFS block size is
> 64 MB or 128 MB, so a file that's larger than that is not unreasonable.
>
> -Sandy
>
>
> On Sun, Mar 31, 2013 at 8:56 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
>> Sorry for the multiple replies.
>>
>> There is one more thing that can be done (I guess) for streaming the
>> values rather then constructing the whole object itself.We can store the
>> value in hdfs as file and have the location as value of the mapper.Mapper
>> can open a stream using the location specified.
>>
>> Not sure if 200 MB file would qualify as small file wrt hadoop or if too
>> many 200 MB size files would have any impact to the NN.
>>
>> Thanks,
>> Rahul
>>
>>
>>
>> On Mon, Apr 1, 2013 at 9:02 AM, Rahul Bhattacharjee <
>> rahul.rec.dgp@gmail.com> wrote:
>>
>>> Hi Sandy,
>>>
>>> I am also new to Hadoop and have a question here.
>>> The writable does have a DataInput stream so that the objects can be
>>> constructed from the byte stream.
>>> Are you suggesting to save the stream for later use ,but late we cannot
>>> ascertain the state of the stream.
>>> For a large value , I think we can actually take the useful part and
>>> emmit it out of from a mapper , we might also have a custom input format to
>>> do this thing so that large value doesn't even reach the mapper.
>>>
>>> Am I missing anything here?
>>>
>>> Thanks,
>>> Rahul
>>>
>>>
>>>
>>> On Sat, Mar 30, 2013 at 11:22 PM, Jerry Lam <ch...@gmail.com>wrote:
>>>
>>>> Hi everyone,
>>>>
>>>> I'm having a problem to stream individual key-value pair of 200MB to
>>>> 1GB from a MapFile.
>>>> I need to stream the large value to an outputstream instead of reading
>>>> the entire value before processing because it potentially uses too much
>>>> memory.
>>>>
>>>> I read the API for MapFile, the next(WritableComparable key, Writable
>>>> val) does not return an input stream.
>>>>
>>>> How can I accomplish this?
>>>>
>>>> Thanks,
>>>>
>>>> Jerry
>>>>
>>>
>>>
>>
>

Re: Streaming value of (200MB) from a SequenceFile

Posted by Rahul Bhattacharjee <ra...@gmail.com>.

Thanks Sandy for the excellent explanation. Didn't think about the lose of
data-locality.

Regards,
Rahul


On Mon, Apr 1, 2013 at 11:29 AM, Sandy Ryza <sa...@cloudera.com> wrote:

> Hi Rahul,
>
> I don't think saving the stream for later use would work - I was just
> suggesting that if only some aggregate statistics needed to be calculated,
> they could be calculated at read time instead of in the mapper.  Nothing
> requires a Writable to contain all the data that it reads.
>
> That's a good point that you can pass the locations of the files.  A
> drawback of this is that Hadoop attempts to co-locate mappers with where
> their input data is stored, and this approach would negate the locality
> advantage.
>
> 200 MB is not too small a file for Hadoop.  A typical HDFS block size is
> 64 MB or 128 MB, so a file that's larger than that is not unreasonable.
>
> -Sandy
>
>
> On Sun, Mar 31, 2013 at 8:56 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
>> Sorry for the multiple replies.
>>
>> There is one more thing that can be done (I guess) for streaming the
>> values rather then constructing the whole object itself.We can store the
>> value in hdfs as file and have the location as value of the mapper.Mapper
>> can open a stream using the location specified.
>>
>> Not sure if 200 MB file would qualify as small file wrt hadoop or if too
>> many 200 MB size files would have any impact to the NN.
>>
>> Thanks,
>> Rahul
>>
>>
>>
>> On Mon, Apr 1, 2013 at 9:02 AM, Rahul Bhattacharjee <
>> rahul.rec.dgp@gmail.com> wrote:
>>
>>> Hi Sandy,
>>>
>>> I am also new to Hadoop and have a question here.
>>> The writable does have a DataInput stream so that the objects can be
>>> constructed from the byte stream.
>>> Are you suggesting to save the stream for later use ,but late we cannot
>>> ascertain the state of the stream.
>>> For a large value , I think we can actually take the useful part and
>>> emmit it out of from a mapper , we might also have a custom input format to
>>> do this thing so that large value doesn't even reach the mapper.
>>>
>>> Am I missing anything here?
>>>
>>> Thanks,
>>> Rahul
>>>
>>>
>>>
>>> On Sat, Mar 30, 2013 at 11:22 PM, Jerry Lam <ch...@gmail.com>wrote:
>>>
>>>> Hi everyone,
>>>>
>>>> I'm having a problem to stream individual key-value pair of 200MB to
>>>> 1GB from a MapFile.
>>>> I need to stream the large value to an outputstream instead of reading
>>>> the entire value before processing because it potentially uses too much
>>>> memory.
>>>>
>>>> I read the API for MapFile, the next(WritableComparable key, Writable
>>>> val) does not return an input stream.
>>>>
>>>> How can I accomplish this?
>>>>
>>>> Thanks,
>>>>
>>>> Jerry
>>>>
>>>
>>>
>>
>

Re: Streaming value of (200MB) from a SequenceFile

Posted by Rahul Bhattacharjee <ra...@gmail.com>.

Thanks Sandy for the excellent explanation. Didn't think about the lose of
data-locality.

Regards,
Rahul


On Mon, Apr 1, 2013 at 11:29 AM, Sandy Ryza <sa...@cloudera.com> wrote:

> Hi Rahul,
>
> I don't think saving the stream for later use would work - I was just
> suggesting that if only some aggregate statistics needed to be calculated,
> they could be calculated at read time instead of in the mapper.  Nothing
> requires a Writable to contain all the data that it reads.
>
> That's a good point that you can pass the locations of the files.  A
> drawback of this is that Hadoop attempts to co-locate mappers with where
> their input data is stored, and this approach would negate the locality
> advantage.
>
> 200 MB is not too small a file for Hadoop.  A typical HDFS block size is
> 64 MB or 128 MB, so a file that's larger than that is not unreasonable.
>
> -Sandy
>
>
> On Sun, Mar 31, 2013 at 8:56 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
>> Sorry for the multiple replies.
>>
>> There is one more thing that can be done (I guess) for streaming the
>> values rather then constructing the whole object itself.We can store the
>> value in hdfs as file and have the location as value of the mapper.Mapper
>> can open a stream using the location specified.
>>
>> Not sure if 200 MB file would qualify as small file wrt hadoop or if too
>> many 200 MB size files would have any impact to the NN.
>>
>> Thanks,
>> Rahul
>>
>>
>>
>> On Mon, Apr 1, 2013 at 9:02 AM, Rahul Bhattacharjee <
>> rahul.rec.dgp@gmail.com> wrote:
>>
>>> Hi Sandy,
>>>
>>> I am also new to Hadoop and have a question here.
>>> The writable does have a DataInput stream so that the objects can be
>>> constructed from the byte stream.
>>> Are you suggesting to save the stream for later use ,but late we cannot
>>> ascertain the state of the stream.
>>> For a large value , I think we can actually take the useful part and
>>> emmit it out of from a mapper , we might also have a custom input format to
>>> do this thing so that large value doesn't even reach the mapper.
>>>
>>> Am I missing anything here?
>>>
>>> Thanks,
>>> Rahul
>>>
>>>
>>>
>>> On Sat, Mar 30, 2013 at 11:22 PM, Jerry Lam <ch...@gmail.com>wrote:
>>>
>>>> Hi everyone,
>>>>
>>>> I'm having a problem to stream individual key-value pair of 200MB to
>>>> 1GB from a MapFile.
>>>> I need to stream the large value to an outputstream instead of reading
>>>> the entire value before processing because it potentially uses too much
>>>> memory.
>>>>
>>>> I read the API for MapFile, the next(WritableComparable key, Writable
>>>> val) does not return an input stream.
>>>>
>>>> How can I accomplish this?
>>>>
>>>> Thanks,
>>>>
>>>> Jerry
>>>>
>>>
>>>
>>
>

Re: Streaming value of (200MB) from a SequenceFile

Posted by Sandy Ryza <sa...@cloudera.com>.

Hi Rahul,

I don't think saving the stream for later use would work - I was just
suggesting that if only some aggregate statistics needed to be calculated,
they could be calculated at read time instead of in the mapper.  Nothing
requires a Writable to contain all the data that it reads.

That's a good point that you can pass the locations of the files.  A
drawback of this is that Hadoop attempts to co-locate mappers with where
their input data is stored, and this approach would negate the locality
advantage.

200 MB is not too small a file for Hadoop.  A typical HDFS block size is 64
MB or 128 MB, so a file that's larger than that is not unreasonable.

-Sandy

On Sun, Mar 31, 2013 at 8:56 PM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> Sorry for the multiple replies.
>
> There is one more thing that can be done (I guess) for streaming the
> values rather then constructing the whole object itself.We can store the
> value in hdfs as file and have the location as value of the mapper.Mapper
> can open a stream using the location specified.
>
> Not sure if 200 MB file would qualify as small file wrt hadoop or if too
> many 200 MB size files would have any impact to the NN.
>
> Thanks,
> Rahul
>
>
>
> On Mon, Apr 1, 2013 at 9:02 AM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
>> Hi Sandy,
>>
>> I am also new to Hadoop and have a question here.
>> The writable does have a DataInput stream so that the objects can be
>> constructed from the byte stream.
>> Are you suggesting to save the stream for later use ,but late we cannot
>> ascertain the state of the stream.
>> For a large value , I think we can actually take the useful part and
>> emmit it out of from a mapper , we might also have a custom input format to
>> do this thing so that large value doesn't even reach the mapper.
>>
>> Am I missing anything here?
>>
>> Thanks,
>> Rahul
>>
>>
>>
>> On Sat, Mar 30, 2013 at 11:22 PM, Jerry Lam <ch...@gmail.com> wrote:
>>
>>> Hi everyone,
>>>
>>> I'm having a problem to stream individual key-value pair of 200MB to 1GB
>>> from a MapFile.
>>> I need to stream the large value to an outputstream instead of reading
>>> the entire value before processing because it potentially uses too much
>>> memory.
>>>
>>> I read the API for MapFile, the next(WritableComparable key, Writable
>>> val) does not return an input stream.
>>>
>>> How can I accomplish this?
>>>
>>> Thanks,
>>>
>>> Jerry
>>>
>>
>>
>

Re: Streaming value of (200MB) from a SequenceFile

Posted by Sandy Ryza <sa...@cloudera.com>.

Hi Rahul,

I don't think saving the stream for later use would work - I was just
suggesting that if only some aggregate statistics needed to be calculated,
they could be calculated at read time instead of in the mapper.  Nothing
requires a Writable to contain all the data that it reads.

That's a good point that you can pass the locations of the files.  A
drawback of this is that Hadoop attempts to co-locate mappers with where
their input data is stored, and this approach would negate the locality
advantage.

200 MB is not too small a file for Hadoop.  A typical HDFS block size is 64
MB or 128 MB, so a file that's larger than that is not unreasonable.

-Sandy

On Sun, Mar 31, 2013 at 8:56 PM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> Sorry for the multiple replies.
>
> There is one more thing that can be done (I guess) for streaming the
> values rather then constructing the whole object itself.We can store the
> value in hdfs as file and have the location as value of the mapper.Mapper
> can open a stream using the location specified.
>
> Not sure if 200 MB file would qualify as small file wrt hadoop or if too
> many 200 MB size files would have any impact to the NN.
>
> Thanks,
> Rahul
>
>
>
> On Mon, Apr 1, 2013 at 9:02 AM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
>> Hi Sandy,
>>
>> I am also new to Hadoop and have a question here.
>> The writable does have a DataInput stream so that the objects can be
>> constructed from the byte stream.
>> Are you suggesting to save the stream for later use ,but late we cannot
>> ascertain the state of the stream.
>> For a large value , I think we can actually take the useful part and
>> emmit it out of from a mapper , we might also have a custom input format to
>> do this thing so that large value doesn't even reach the mapper.
>>
>> Am I missing anything here?
>>
>> Thanks,
>> Rahul
>>
>>
>>
>> On Sat, Mar 30, 2013 at 11:22 PM, Jerry Lam <ch...@gmail.com> wrote:
>>
>>> Hi everyone,
>>>
>>> I'm having a problem to stream individual key-value pair of 200MB to 1GB
>>> from a MapFile.
>>> I need to stream the large value to an outputstream instead of reading
>>> the entire value before processing because it potentially uses too much
>>> memory.
>>>
>>> I read the API for MapFile, the next(WritableComparable key, Writable
>>> val) does not return an input stream.
>>>
>>> How can I accomplish this?
>>>
>>> Thanks,
>>>
>>> Jerry
>>>
>>
>>
>

Re: Streaming value of (200MB) from a SequenceFile

Posted by Sandy Ryza <sa...@cloudera.com>.

Hi Rahul,

I don't think saving the stream for later use would work - I was just
suggesting that if only some aggregate statistics needed to be calculated,
they could be calculated at read time instead of in the mapper.  Nothing
requires a Writable to contain all the data that it reads.

That's a good point that you can pass the locations of the files.  A
drawback of this is that Hadoop attempts to co-locate mappers with where
their input data is stored, and this approach would negate the locality
advantage.

200 MB is not too small a file for Hadoop.  A typical HDFS block size is 64
MB or 128 MB, so a file that's larger than that is not unreasonable.

-Sandy

On Sun, Mar 31, 2013 at 8:56 PM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> Sorry for the multiple replies.
>
> There is one more thing that can be done (I guess) for streaming the
> values rather then constructing the whole object itself.We can store the
> value in hdfs as file and have the location as value of the mapper.Mapper
> can open a stream using the location specified.
>
> Not sure if 200 MB file would qualify as small file wrt hadoop or if too
> many 200 MB size files would have any impact to the NN.
>
> Thanks,
> Rahul
>
>
>
> On Mon, Apr 1, 2013 at 9:02 AM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
>> Hi Sandy,
>>
>> I am also new to Hadoop and have a question here.
>> The writable does have a DataInput stream so that the objects can be
>> constructed from the byte stream.
>> Are you suggesting to save the stream for later use ,but late we cannot
>> ascertain the state of the stream.
>> For a large value , I think we can actually take the useful part and
>> emmit it out of from a mapper , we might also have a custom input format to
>> do this thing so that large value doesn't even reach the mapper.
>>
>> Am I missing anything here?
>>
>> Thanks,
>> Rahul
>>
>>
>>
>> On Sat, Mar 30, 2013 at 11:22 PM, Jerry Lam <ch...@gmail.com> wrote:
>>
>>> Hi everyone,
>>>
>>> I'm having a problem to stream individual key-value pair of 200MB to 1GB
>>> from a MapFile.
>>> I need to stream the large value to an outputstream instead of reading
>>> the entire value before processing because it potentially uses too much
>>> memory.
>>>
>>> I read the API for MapFile, the next(WritableComparable key, Writable
>>> val) does not return an input stream.
>>>
>>> How can I accomplish this?
>>>
>>> Thanks,
>>>
>>> Jerry
>>>
>>
>>
>

Re: Streaming value of (200MB) from a SequenceFile

Posted by Sandy Ryza <sa...@cloudera.com>.

Hi Rahul,

I don't think saving the stream for later use would work - I was just
suggesting that if only some aggregate statistics needed to be calculated,
they could be calculated at read time instead of in the mapper.  Nothing
requires a Writable to contain all the data that it reads.

That's a good point that you can pass the locations of the files.  A
drawback of this is that Hadoop attempts to co-locate mappers with where
their input data is stored, and this approach would negate the locality
advantage.

200 MB is not too small a file for Hadoop.  A typical HDFS block size is 64
MB or 128 MB, so a file that's larger than that is not unreasonable.

-Sandy

On Sun, Mar 31, 2013 at 8:56 PM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> Sorry for the multiple replies.
>
> There is one more thing that can be done (I guess) for streaming the
> values rather then constructing the whole object itself.We can store the
> value in hdfs as file and have the location as value of the mapper.Mapper
> can open a stream using the location specified.
>
> Not sure if 200 MB file would qualify as small file wrt hadoop or if too
> many 200 MB size files would have any impact to the NN.
>
> Thanks,
> Rahul
>
>
>
> On Mon, Apr 1, 2013 at 9:02 AM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
>> Hi Sandy,
>>
>> I am also new to Hadoop and have a question here.
>> The writable does have a DataInput stream so that the objects can be
>> constructed from the byte stream.
>> Are you suggesting to save the stream for later use ,but late we cannot
>> ascertain the state of the stream.
>> For a large value , I think we can actually take the useful part and
>> emmit it out of from a mapper , we might also have a custom input format to
>> do this thing so that large value doesn't even reach the mapper.
>>
>> Am I missing anything here?
>>
>> Thanks,
>> Rahul
>>
>>
>>
>> On Sat, Mar 30, 2013 at 11:22 PM, Jerry Lam <ch...@gmail.com> wrote:
>>
>>> Hi everyone,
>>>
>>> I'm having a problem to stream individual key-value pair of 200MB to 1GB
>>> from a MapFile.
>>> I need to stream the large value to an outputstream instead of reading
>>> the entire value before processing because it potentially uses too much
>>> memory.
>>>
>>> I read the API for MapFile, the next(WritableComparable key, Writable
>>> val) does not return an input stream.
>>>
>>> How can I accomplish this?
>>>
>>> Thanks,
>>>
>>> Jerry
>>>
>>
>>
>

Re: Streaming value of (200MB) from a SequenceFile

Posted by Rahul Bhattacharjee <ra...@gmail.com>.

Sorry for the multiple replies.

There is one more thing that can be done (I guess) for streaming the values
rather then constructing the whole object itself.We can store the value in
hdfs as file and have the location as value of the mapper.Mapper can open a
stream using the location specified.

Not sure if 200 MB file would qualify as small file wrt hadoop or if too
many 200 MB size files would have any impact to the NN.

Thanks,
Rahul



On Mon, Apr 1, 2013 at 9:02 AM, Rahul Bhattacharjee <rahul.rec.dgp@gmail.com
> wrote:

> Hi Sandy,
>
> I am also new to Hadoop and have a question here.
> The writable does have a DataInput stream so that the objects can be
> constructed from the byte stream.
> Are you suggesting to save the stream for later use ,but late we cannot
> ascertain the state of the stream.
> For a large value , I think we can actually take the useful part and emmit
> it out of from a mapper , we might also have a custom input format to do
> this thing so that large value doesn't even reach the mapper.
>
> Am I missing anything here?
>
> Thanks,
> Rahul
>
>
>
> On Sat, Mar 30, 2013 at 11:22 PM, Jerry Lam <ch...@gmail.com> wrote:
>
>> Hi everyone,
>>
>> I'm having a problem to stream individual key-value pair of 200MB to 1GB
>> from a MapFile.
>> I need to stream the large value to an outputstream instead of reading
>> the entire value before processing because it potentially uses too much
>> memory.
>>
>> I read the API for MapFile, the next(WritableComparable key, Writable
>> val) does not return an input stream.
>>
>> How can I accomplish this?
>>
>> Thanks,
>>
>> Jerry
>>
>
>

Re: Streaming value of (200MB) from a SequenceFile

Posted by Rahul Bhattacharjee <ra...@gmail.com>.

Sorry for the multiple replies.

There is one more thing that can be done (I guess) for streaming the values
rather then constructing the whole object itself.We can store the value in
hdfs as file and have the location as value of the mapper.Mapper can open a
stream using the location specified.

Not sure if 200 MB file would qualify as small file wrt hadoop or if too
many 200 MB size files would have any impact to the NN.

Thanks,
Rahul



On Mon, Apr 1, 2013 at 9:02 AM, Rahul Bhattacharjee <rahul.rec.dgp@gmail.com
> wrote:

> Hi Sandy,
>
> I am also new to Hadoop and have a question here.
> The writable does have a DataInput stream so that the objects can be
> constructed from the byte stream.
> Are you suggesting to save the stream for later use ,but late we cannot
> ascertain the state of the stream.
> For a large value , I think we can actually take the useful part and emmit
> it out of from a mapper , we might also have a custom input format to do
> this thing so that large value doesn't even reach the mapper.
>
> Am I missing anything here?
>
> Thanks,
> Rahul
>
>
>
> On Sat, Mar 30, 2013 at 11:22 PM, Jerry Lam <ch...@gmail.com> wrote:
>
>> Hi everyone,
>>
>> I'm having a problem to stream individual key-value pair of 200MB to 1GB
>> from a MapFile.
>> I need to stream the large value to an outputstream instead of reading
>> the entire value before processing because it potentially uses too much
>> memory.
>>
>> I read the API for MapFile, the next(WritableComparable key, Writable
>> val) does not return an input stream.
>>
>> How can I accomplish this?
>>
>> Thanks,
>>
>> Jerry
>>
>
>

Re: Streaming value of (200MB) from a SequenceFile

Posted by Rahul Bhattacharjee <ra...@gmail.com>.

Sorry for the multiple replies.

There is one more thing that can be done (I guess) for streaming the values
rather then constructing the whole object itself.We can store the value in
hdfs as file and have the location as value of the mapper.Mapper can open a
stream using the location specified.

Not sure if 200 MB file would qualify as small file wrt hadoop or if too
many 200 MB size files would have any impact to the NN.

Thanks,
Rahul



On Mon, Apr 1, 2013 at 9:02 AM, Rahul Bhattacharjee <rahul.rec.dgp@gmail.com
> wrote:

> Hi Sandy,
>
> I am also new to Hadoop and have a question here.
> The writable does have a DataInput stream so that the objects can be
> constructed from the byte stream.
> Are you suggesting to save the stream for later use ,but late we cannot
> ascertain the state of the stream.
> For a large value , I think we can actually take the useful part and emmit
> it out of from a mapper , we might also have a custom input format to do
> this thing so that large value doesn't even reach the mapper.
>
> Am I missing anything here?
>
> Thanks,
> Rahul
>
>
>
> On Sat, Mar 30, 2013 at 11:22 PM, Jerry Lam <ch...@gmail.com> wrote:
>
>> Hi everyone,
>>
>> I'm having a problem to stream individual key-value pair of 200MB to 1GB
>> from a MapFile.
>> I need to stream the large value to an outputstream instead of reading
>> the entire value before processing because it potentially uses too much
>> memory.
>>
>> I read the API for MapFile, the next(WritableComparable key, Writable
>> val) does not return an input stream.
>>
>> How can I accomplish this?
>>
>> Thanks,
>>
>> Jerry
>>
>
>

Re: Streaming value of (200MB) from a SequenceFile

Posted by Rahul Bhattacharjee <ra...@gmail.com>.

Sorry for the multiple replies.

There is one more thing that can be done (I guess) for streaming the values
rather then constructing the whole object itself.We can store the value in
hdfs as file and have the location as value of the mapper.Mapper can open a
stream using the location specified.

Not sure if 200 MB file would qualify as small file wrt hadoop or if too
many 200 MB size files would have any impact to the NN.

Thanks,
Rahul



On Mon, Apr 1, 2013 at 9:02 AM, Rahul Bhattacharjee <rahul.rec.dgp@gmail.com
> wrote:

> Hi Sandy,
>
> I am also new to Hadoop and have a question here.
> The writable does have a DataInput stream so that the objects can be
> constructed from the byte stream.
> Are you suggesting to save the stream for later use ,but late we cannot
> ascertain the state of the stream.
> For a large value , I think we can actually take the useful part and emmit
> it out of from a mapper , we might also have a custom input format to do
> this thing so that large value doesn't even reach the mapper.
>
> Am I missing anything here?
>
> Thanks,
> Rahul
>
>
>
> On Sat, Mar 30, 2013 at 11:22 PM, Jerry Lam <ch...@gmail.com> wrote:
>
>> Hi everyone,
>>
>> I'm having a problem to stream individual key-value pair of 200MB to 1GB
>> from a MapFile.
>> I need to stream the large value to an outputstream instead of reading
>> the entire value before processing because it potentially uses too much
>> memory.
>>
>> I read the API for MapFile, the next(WritableComparable key, Writable
>> val) does not return an input stream.
>>
>> How can I accomplish this?
>>
>> Thanks,
>>
>> Jerry
>>
>
>