You are viewing a plain text version of this content. The canonical link for it is here.

Posted to general@hadoop.apache.org by Renato Marroquín Mogrovejo <re...@gmail.com> on 2010/05/05 17:29:43 UTC

Hadoop Data Sharing

Hi everyone, I have recently started to play around with hadoop, but I am
getting some into some "design" problems.
I need to make a loop to execute the same job several times, and in each
iteration get the processed values (not using a file because I would need to
read it). I was using an static vector in my main class (the one that
iterates and executes the job in each iteration) to retrieve those values,
and it did work while I was using a standalone mode. Now I tried to test it
on a pseudo-distributed manner and obviously is not working.
Any suggestions, please???

Thanks in advance,


Renato M.

Re: Hadoop Data Sharing

Posted by Renato Marroquín Mogrovejo <re...@gmail.com>.

Thanks for your replies. Yeah I have had to restructure a part of my code
but it is all good now.
Thanks again for your suggestions.

Renato M.

2010/5/11 Jay Booth <ja...@gmail.com>

> Probably the most direct route to get your desired result is to save
> the objects to either a SequenceFile or plain text file on DFS.  Then
> in the configure() section of your mapreduce jobs, you open the file
> on DFS, stream contents into a local variable and refer to it as  you
> need to.  Either way, you'll need some sort of serialization via
> Writable or plain text.
>
> On Tue, May 11, 2010 at 4:19 PM, Renato Marroquín Mogrovejo
> <re...@gmail.com> wrote:
> > Hi Aaron,
> >
> > The thing is that I had a data structure that is saved into a vector, and
> > this vector needs to be available for my MapReduce jobs while iterating.
> So
> > would you think it would a good and easy way to serialize this objects?
> It's
> > a vector that each node contains another user define data structure.
> Maybe I
> > will try to do it first just using files, and see how the throughput
> goes.
> > Hey do you know where I can find some examples of serializing objects for
> > Hadoop to save them into SequenceFiles?
> > Thanks in advance.
> >
> > Renato M.
> >
> >
> > 2010/5/11 Aaron Kimball <aa...@cloudera.com>
> >
> >> Perhaps this is guidance in the area you were hoping for: If your data
> is
> >> in
> >> objects that implement the interface 'Writable', then you can use the
> >> SequenceFileOutputFormat and SequenceFileInputFormat to store your
> >> intermediate data in binary form in disk-backed files called
> SequenceFiles.
> >> The serialization will happen through the write() and readFields()
> methods
> >> of your objects, which will automatically be called by the
> >> OutputFormat/InputFormat as they move through the system. So your
> >> subsequent
> >> MR pass will receive objects back in the same form as they were emitted.
> >> This is a considerably better idea (from both a throughput and a sanity
> >> perspective) in a chained MapReduce job.
> >>
> >> - Aaron
> >>
> >> On Tue, May 11, 2010 at 10:31 AM, Aaron Kimball <aa...@cloudera.com>
> >> wrote:
> >>
> >> > What objects are you referring to? I'm not sure I understand your
> >> question.
> >> > - Aaron
> >> >
> >> >
> >> > On Tue, May 11, 2010 at 6:38 AM, Renato Marroquín Mogrovejo <
> >> > renatoj.marroquin@gmail.com> wrote:
> >> >
> >> >> Thanks Aaron! I was thinking the same after doing some reading.
> >> >> Man what about serialize the objects? Would you think that is a good
> >> idea?
> >> >> Thanks again.
> >> >>
> >> >> Renato M.
> >> >>
> >> >>
> >> >> 2010/5/5 Aaron Kimball <aa...@cloudera.com>
> >> >>
> >> >> > Renato,
> >> >> >
> >> >> > In general if you need to perform a multi-pass MapReduce workflow,
> >> each
> >> >> > pass
> >> >> > materializes its output to files. The subsequent pass then reads
> those
> >> >> same
> >> >> > files back in as input. This allows the workflow to start at the
> last
> >> >> > "checkpoint" if it gets interrupted. There is no persistent
> in-memory
> >> >> > distributed storage feature in Hadoop that would allow a MapReduce
> job
> >> >> to
> >> >> > post results to memory for consumption by a subsequent job.
> >> >> >
> >> >> > So you would just read your initial data from /input, and write
> your
> >> >> > interim
> >> >> > results to /iteration0. Then the next pass reads from /iteration0
> and
> >> >> > writes
> >> >> > to /iteration1, etc..
> >> >> >
> >> >> > If your data is reasonably small and you think it could fit in
> memory
> >> >> > somewhere, then you could experiment with using other distributed
> >> >> key-value
> >> >> > stores (memcached[b], hbase, cassandra, etc..) to hold intermediate
> >> >> > results.
> >> >> > But this will require some integration work on your part.
> >> >> > - Aaron
> >> >> >
> >> >> > On Wed, May 5, 2010 at 8:29 AM, Renato Marroquín Mogrovejo <
> >> >> > renatoj.marroquin@gmail.com> wrote:
> >> >> >
> >> >> > > Hi everyone, I have recently started to play around with hadoop,
> but
> >> I
> >> >> am
> >> >> > > getting some into some "design" problems.
> >> >> > > I need to make a loop to execute the same job several times, and
> in
> >> >> each
> >> >> > > iteration get the processed values (not using a file because I
> would
> >> >> need
> >> >> > > to
> >> >> > > read it). I was using an static vector in my main class (the one
> >> that
> >> >> > > iterates and executes the job in each iteration) to retrieve
> those
> >> >> > values,
> >> >> > > and it did work while I was using a standalone mode. Now I tried
> to
> >> >> test
> >> >> > it
> >> >> > > on a pseudo-distributed manner and obviously is not working.
> >> >> > > Any suggestions, please???
> >> >> > >
> >> >> > > Thanks in advance,
> >> >> > >
> >> >> > >
> >> >> > > Renato M.
> >> >> > >
> >> >> >
> >> >>
> >> >
> >> >
> >>
> >
>

Re: Hadoop Data Sharing

Posted by Jay Booth <ja...@gmail.com>.

Probably the most direct route to get your desired result is to save
the objects to either a SequenceFile or plain text file on DFS.  Then
in the configure() section of your mapreduce jobs, you open the file
on DFS, stream contents into a local variable and refer to it as  you
need to.  Either way, you'll need some sort of serialization via
Writable or plain text.

On Tue, May 11, 2010 at 4:19 PM, Renato Marroquín Mogrovejo
<re...@gmail.com> wrote:
> Hi Aaron,
>
> The thing is that I had a data structure that is saved into a vector, and
> this vector needs to be available for my MapReduce jobs while iterating. So
> would you think it would a good and easy way to serialize this objects? It's
> a vector that each node contains another user define data structure. Maybe I
> will try to do it first just using files, and see how the throughput goes.
> Hey do you know where I can find some examples of serializing objects for
> Hadoop to save them into SequenceFiles?
> Thanks in advance.
>
> Renato M.
>
>
> 2010/5/11 Aaron Kimball <aa...@cloudera.com>
>
>> Perhaps this is guidance in the area you were hoping for: If your data is
>> in
>> objects that implement the interface 'Writable', then you can use the
>> SequenceFileOutputFormat and SequenceFileInputFormat to store your
>> intermediate data in binary form in disk-backed files called SequenceFiles.
>> The serialization will happen through the write() and readFields() methods
>> of your objects, which will automatically be called by the
>> OutputFormat/InputFormat as they move through the system. So your
>> subsequent
>> MR pass will receive objects back in the same form as they were emitted.
>> This is a considerably better idea (from both a throughput and a sanity
>> perspective) in a chained MapReduce job.
>>
>> - Aaron
>>
>> On Tue, May 11, 2010 at 10:31 AM, Aaron Kimball <aa...@cloudera.com>
>> wrote:
>>
>> > What objects are you referring to? I'm not sure I understand your
>> question.
>> > - Aaron
>> >
>> >
>> > On Tue, May 11, 2010 at 6:38 AM, Renato Marroquín Mogrovejo <
>> > renatoj.marroquin@gmail.com> wrote:
>> >
>> >> Thanks Aaron! I was thinking the same after doing some reading.
>> >> Man what about serialize the objects? Would you think that is a good
>> idea?
>> >> Thanks again.
>> >>
>> >> Renato M.
>> >>
>> >>
>> >> 2010/5/5 Aaron Kimball <aa...@cloudera.com>
>> >>
>> >> > Renato,
>> >> >
>> >> > In general if you need to perform a multi-pass MapReduce workflow,
>> each
>> >> > pass
>> >> > materializes its output to files. The subsequent pass then reads those
>> >> same
>> >> > files back in as input. This allows the workflow to start at the last
>> >> > "checkpoint" if it gets interrupted. There is no persistent in-memory
>> >> > distributed storage feature in Hadoop that would allow a MapReduce job
>> >> to
>> >> > post results to memory for consumption by a subsequent job.
>> >> >
>> >> > So you would just read your initial data from /input, and write your
>> >> > interim
>> >> > results to /iteration0. Then the next pass reads from /iteration0 and
>> >> > writes
>> >> > to /iteration1, etc..
>> >> >
>> >> > If your data is reasonably small and you think it could fit in memory
>> >> > somewhere, then you could experiment with using other distributed
>> >> key-value
>> >> > stores (memcached[b], hbase, cassandra, etc..) to hold intermediate
>> >> > results.
>> >> > But this will require some integration work on your part.
>> >> > - Aaron
>> >> >
>> >> > On Wed, May 5, 2010 at 8:29 AM, Renato Marroquín Mogrovejo <
>> >> > renatoj.marroquin@gmail.com> wrote:
>> >> >
>> >> > > Hi everyone, I have recently started to play around with hadoop, but
>> I
>> >> am
>> >> > > getting some into some "design" problems.
>> >> > > I need to make a loop to execute the same job several times, and in
>> >> each
>> >> > > iteration get the processed values (not using a file because I would
>> >> need
>> >> > > to
>> >> > > read it). I was using an static vector in my main class (the one
>> that
>> >> > > iterates and executes the job in each iteration) to retrieve those
>> >> > values,
>> >> > > and it did work while I was using a standalone mode. Now I tried to
>> >> test
>> >> > it
>> >> > > on a pseudo-distributed manner and obviously is not working.
>> >> > > Any suggestions, please???
>> >> > >
>> >> > > Thanks in advance,
>> >> > >
>> >> > >
>> >> > > Renato M.
>> >> > >
>> >> >
>> >>
>> >
>> >
>>
>

Re: Hadoop Data Sharing

Posted by Renato Marroquín Mogrovejo <re...@gmail.com>.

Hi Aaron,

The thing is that I had a data structure that is saved into a vector, and
this vector needs to be available for my MapReduce jobs while iterating. So
would you think it would a good and easy way to serialize this objects? It's
a vector that each node contains another user define data structure. Maybe I
will try to do it first just using files, and see how the throughput goes.
Hey do you know where I can find some examples of serializing objects for
Hadoop to save them into SequenceFiles?
Thanks in advance.

Renato M.


2010/5/11 Aaron Kimball <aa...@cloudera.com>

> Perhaps this is guidance in the area you were hoping for: If your data is
> in
> objects that implement the interface 'Writable', then you can use the
> SequenceFileOutputFormat and SequenceFileInputFormat to store your
> intermediate data in binary form in disk-backed files called SequenceFiles.
> The serialization will happen through the write() and readFields() methods
> of your objects, which will automatically be called by the
> OutputFormat/InputFormat as they move through the system. So your
> subsequent
> MR pass will receive objects back in the same form as they were emitted.
> This is a considerably better idea (from both a throughput and a sanity
> perspective) in a chained MapReduce job.
>
> - Aaron
>
> On Tue, May 11, 2010 at 10:31 AM, Aaron Kimball <aa...@cloudera.com>
> wrote:
>
> > What objects are you referring to? I'm not sure I understand your
> question.
> > - Aaron
> >
> >
> > On Tue, May 11, 2010 at 6:38 AM, Renato Marroquín Mogrovejo <
> > renatoj.marroquin@gmail.com> wrote:
> >
> >> Thanks Aaron! I was thinking the same after doing some reading.
> >> Man what about serialize the objects? Would you think that is a good
> idea?
> >> Thanks again.
> >>
> >> Renato M.
> >>
> >>
> >> 2010/5/5 Aaron Kimball <aa...@cloudera.com>
> >>
> >> > Renato,
> >> >
> >> > In general if you need to perform a multi-pass MapReduce workflow,
> each
> >> > pass
> >> > materializes its output to files. The subsequent pass then reads those
> >> same
> >> > files back in as input. This allows the workflow to start at the last
> >> > "checkpoint" if it gets interrupted. There is no persistent in-memory
> >> > distributed storage feature in Hadoop that would allow a MapReduce job
> >> to
> >> > post results to memory for consumption by a subsequent job.
> >> >
> >> > So you would just read your initial data from /input, and write your
> >> > interim
> >> > results to /iteration0. Then the next pass reads from /iteration0 and
> >> > writes
> >> > to /iteration1, etc..
> >> >
> >> > If your data is reasonably small and you think it could fit in memory
> >> > somewhere, then you could experiment with using other distributed
> >> key-value
> >> > stores (memcached[b], hbase, cassandra, etc..) to hold intermediate
> >> > results.
> >> > But this will require some integration work on your part.
> >> > - Aaron
> >> >
> >> > On Wed, May 5, 2010 at 8:29 AM, Renato Marroquín Mogrovejo <
> >> > renatoj.marroquin@gmail.com> wrote:
> >> >
> >> > > Hi everyone, I have recently started to play around with hadoop, but
> I
> >> am
> >> > > getting some into some "design" problems.
> >> > > I need to make a loop to execute the same job several times, and in
> >> each
> >> > > iteration get the processed values (not using a file because I would
> >> need
> >> > > to
> >> > > read it). I was using an static vector in my main class (the one
> that
> >> > > iterates and executes the job in each iteration) to retrieve those
> >> > values,
> >> > > and it did work while I was using a standalone mode. Now I tried to
> >> test
> >> > it
> >> > > on a pseudo-distributed manner and obviously is not working.
> >> > > Any suggestions, please???
> >> > >
> >> > > Thanks in advance,
> >> > >
> >> > >
> >> > > Renato M.
> >> > >
> >> >
> >>
> >
> >
>

Re: Hadoop Data Sharing

Posted by Aaron Kimball <aa...@cloudera.com>.

Perhaps this is guidance in the area you were hoping for: If your data is in
objects that implement the interface 'Writable', then you can use the
SequenceFileOutputFormat and SequenceFileInputFormat to store your
intermediate data in binary form in disk-backed files called SequenceFiles.
The serialization will happen through the write() and readFields() methods
of your objects, which will automatically be called by the
OutputFormat/InputFormat as they move through the system. So your subsequent
MR pass will receive objects back in the same form as they were emitted.
This is a considerably better idea (from both a throughput and a sanity
perspective) in a chained MapReduce job.

- Aaron

On Tue, May 11, 2010 at 10:31 AM, Aaron Kimball <aa...@cloudera.com> wrote:

> What objects are you referring to? I'm not sure I understand your question.
> - Aaron
>
>
> On Tue, May 11, 2010 at 6:38 AM, Renato Marroquín Mogrovejo <
> renatoj.marroquin@gmail.com> wrote:
>
>> Thanks Aaron! I was thinking the same after doing some reading.
>> Man what about serialize the objects? Would you think that is a good idea?
>> Thanks again.
>>
>> Renato M.
>>
>>
>> 2010/5/5 Aaron Kimball <aa...@cloudera.com>
>>
>> > Renato,
>> >
>> > In general if you need to perform a multi-pass MapReduce workflow, each
>> > pass
>> > materializes its output to files. The subsequent pass then reads those
>> same
>> > files back in as input. This allows the workflow to start at the last
>> > "checkpoint" if it gets interrupted. There is no persistent in-memory
>> > distributed storage feature in Hadoop that would allow a MapReduce job
>> to
>> > post results to memory for consumption by a subsequent job.
>> >
>> > So you would just read your initial data from /input, and write your
>> > interim
>> > results to /iteration0. Then the next pass reads from /iteration0 and
>> > writes
>> > to /iteration1, etc..
>> >
>> > If your data is reasonably small and you think it could fit in memory
>> > somewhere, then you could experiment with using other distributed
>> key-value
>> > stores (memcached[b], hbase, cassandra, etc..) to hold intermediate
>> > results.
>> > But this will require some integration work on your part.
>> > - Aaron
>> >
>> > On Wed, May 5, 2010 at 8:29 AM, Renato Marroquín Mogrovejo <
>> > renatoj.marroquin@gmail.com> wrote:
>> >
>> > > Hi everyone, I have recently started to play around with hadoop, but I
>> am
>> > > getting some into some "design" problems.
>> > > I need to make a loop to execute the same job several times, and in
>> each
>> > > iteration get the processed values (not using a file because I would
>> need
>> > > to
>> > > read it). I was using an static vector in my main class (the one that
>> > > iterates and executes the job in each iteration) to retrieve those
>> > values,
>> > > and it did work while I was using a standalone mode. Now I tried to
>> test
>> > it
>> > > on a pseudo-distributed manner and obviously is not working.
>> > > Any suggestions, please???
>> > >
>> > > Thanks in advance,
>> > >
>> > >
>> > > Renato M.
>> > >
>> >
>>
>
>

Re: Hadoop Data Sharing

Posted by Aaron Kimball <aa...@cloudera.com>.

What objects are you referring to? I'm not sure I understand your question.
- Aaron

On Tue, May 11, 2010 at 6:38 AM, Renato Marroquín Mogrovejo <
renatoj.marroquin@gmail.com> wrote:

> Thanks Aaron! I was thinking the same after doing some reading.
> Man what about serialize the objects? Would you think that is a good idea?
> Thanks again.
>
> Renato M.
>
>
> 2010/5/5 Aaron Kimball <aa...@cloudera.com>
>
> > Renato,
> >
> > In general if you need to perform a multi-pass MapReduce workflow, each
> > pass
> > materializes its output to files. The subsequent pass then reads those
> same
> > files back in as input. This allows the workflow to start at the last
> > "checkpoint" if it gets interrupted. There is no persistent in-memory
> > distributed storage feature in Hadoop that would allow a MapReduce job to
> > post results to memory for consumption by a subsequent job.
> >
> > So you would just read your initial data from /input, and write your
> > interim
> > results to /iteration0. Then the next pass reads from /iteration0 and
> > writes
> > to /iteration1, etc..
> >
> > If your data is reasonably small and you think it could fit in memory
> > somewhere, then you could experiment with using other distributed
> key-value
> > stores (memcached[b], hbase, cassandra, etc..) to hold intermediate
> > results.
> > But this will require some integration work on your part.
> > - Aaron
> >
> > On Wed, May 5, 2010 at 8:29 AM, Renato Marroquín Mogrovejo <
> > renatoj.marroquin@gmail.com> wrote:
> >
> > > Hi everyone, I have recently started to play around with hadoop, but I
> am
> > > getting some into some "design" problems.
> > > I need to make a loop to execute the same job several times, and in
> each
> > > iteration get the processed values (not using a file because I would
> need
> > > to
> > > read it). I was using an static vector in my main class (the one that
> > > iterates and executes the job in each iteration) to retrieve those
> > values,
> > > and it did work while I was using a standalone mode. Now I tried to
> test
> > it
> > > on a pseudo-distributed manner and obviously is not working.
> > > Any suggestions, please???
> > >
> > > Thanks in advance,
> > >
> > >
> > > Renato M.
> > >
> >
>

Re: Hadoop Data Sharing

Posted by Renato Marroquín Mogrovejo <re...@gmail.com>.

Thanks Aaron! I was thinking the same after doing some reading.
Man what about serialize the objects? Would you think that is a good idea?
Thanks again.

Renato M.


2010/5/5 Aaron Kimball <aa...@cloudera.com>

> Renato,
>
> In general if you need to perform a multi-pass MapReduce workflow, each
> pass
> materializes its output to files. The subsequent pass then reads those same
> files back in as input. This allows the workflow to start at the last
> "checkpoint" if it gets interrupted. There is no persistent in-memory
> distributed storage feature in Hadoop that would allow a MapReduce job to
> post results to memory for consumption by a subsequent job.
>
> So you would just read your initial data from /input, and write your
> interim
> results to /iteration0. Then the next pass reads from /iteration0 and
> writes
> to /iteration1, etc..
>
> If your data is reasonably small and you think it could fit in memory
> somewhere, then you could experiment with using other distributed key-value
> stores (memcached[b], hbase, cassandra, etc..) to hold intermediate
> results.
> But this will require some integration work on your part.
> - Aaron
>
> On Wed, May 5, 2010 at 8:29 AM, Renato Marroquín Mogrovejo <
> renatoj.marroquin@gmail.com> wrote:
>
> > Hi everyone, I have recently started to play around with hadoop, but I am
> > getting some into some "design" problems.
> > I need to make a loop to execute the same job several times, and in each
> > iteration get the processed values (not using a file because I would need
> > to
> > read it). I was using an static vector in my main class (the one that
> > iterates and executes the job in each iteration) to retrieve those
> values,
> > and it did work while I was using a standalone mode. Now I tried to test
> it
> > on a pseudo-distributed manner and obviously is not working.
> > Any suggestions, please???
> >
> > Thanks in advance,
> >
> >
> > Renato M.
> >
>

Re: Hadoop Data Sharing

Posted by Aaron Kimball <aa...@cloudera.com>.

Renato,

In general if you need to perform a multi-pass MapReduce workflow, each pass
materializes its output to files. The subsequent pass then reads those same
files back in as input. This allows the workflow to start at the last
"checkpoint" if it gets interrupted. There is no persistent in-memory
distributed storage feature in Hadoop that would allow a MapReduce job to
post results to memory for consumption by a subsequent job.

So you would just read your initial data from /input, and write your interim
results to /iteration0. Then the next pass reads from /iteration0 and writes
to /iteration1, etc..

If your data is reasonably small and you think it could fit in memory
somewhere, then you could experiment with using other distributed key-value
stores (memcached[b], hbase, cassandra, etc..) to hold intermediate results.
But this will require some integration work on your part.
- Aaron

On Wed, May 5, 2010 at 8:29 AM, Renato Marroquín Mogrovejo <
renatoj.marroquin@gmail.com> wrote:

> Hi everyone, I have recently started to play around with hadoop, but I am
> getting some into some "design" problems.
> I need to make a loop to execute the same job several times, and in each
> iteration get the processed values (not using a file because I would need
> to
> read it). I was using an static vector in my main class (the one that
> iterates and executes the job in each iteration) to retrieve those values,
> and it did work while I was using a standalone mode. Now I tried to test it
> on a pseudo-distributed manner and obviously is not working.
> Any suggestions, please???
>
> Thanks in advance,
>
>
> Renato M.
>