You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by S D <sd...@gmail.com> on 2009/02/14 23:46:01 UTC

Race Condition?

In my Hadoop 0.19.0 program each map function is assigned a directory
(representing a data location in my S3 datastore). The first thing each map
function does is copy the particular S3 data to the local machine that the
map task is running on and then being processing the data; e.g.,

command = "hadoop dfs -copyToLocal #{s3dir} #{localdir}"
system "#{command}"

In the above, "s3dir" is a directory that creates "localdir" - my
expectation is that "localdir" is created in the work directory for the
particular task attempt. Following this copy command I then run a function
that processes the data; e.g.,

processData(localdir)

In some instances my map/reduce program crashes and when I examine the logs
I get a message saying that "localdir" can not be found. This confuses me
since the hadoop shell command above is blocking so that localdir should
exist by the time processData() is called. I've found that if I add in some
diagnostic lines prior to processData() such as puts statements to print out
variables, I never run into the problem of the localdir not being found. It
is almost as if localdir needs time to be created before the call to
processData().

Has anyone encountered anything like this? Any suggestions on what could be
wrong are appreciated.

Thanks,
John

Re: Race Condition?

Posted by S D <sd...@gmail.com>.
I'm having difficulty capturing the output of any of the dfs commands
(either in Ruby or on  the command line). Supposedly the output is being
sent to stdout yet just running any of the commands on the command line does
not display the output nor does redirecting to a file (e.g., hadoop dfs
-copyToLocal src dest > out.txt). I'm not sure what I'm missing here...

John

On Sun, Feb 15, 2009 at 11:28 PM, Matei Zaharia <ma...@cloudera.com> wrote:

> I would capture the output of the dfs -copyToLocal command, because I still
> think that is the most likely cause of the data not making it. I don't know
> how to capture this output in Ruby but I'm sure it's possible. You want to
> capture both standard out and standard error.
> One other slim possibility is that if your localdir is a fixed absolute
> path, multiple map tasks on the machine may be trying to access it
> concurrently, and maybe one of them deletes it when it's done and one
> doesn't. Normally each task should run in its own temp directory though.
>
> On Sun, Feb 15, 2009 at 2:51 PM, S D <sd...@gmail.com> wrote:
>
> > I was not able to determine the command shell return value for
> >
> >     hadoop dfs -copyToLocal #{s3dir} #{localdir}
> >
> > but I did print out several variables after the call and determined that
> > the
> > call apparently did not go through successfully. In particular, prior to
> my
> > processData(localdir) command I use Ruby's puts to print out the contents
> > of
> > several directories including 'localdir' and '../localdir' - here is the
> > weird thing: if I execute the following
> >     list = `ls -l "#{localdir}"`
> >     puts "List: #{list}"
> > (where 'localdir' is the directory I need as an arg for processData) the
> > processData command will execute properly. At first I thought that
> running
> > the puts command was allowing enough time to elapse for a race condition
> to
> > be avoided so that 'localdir' was ready when the processData command was
> > called (I know that in certain ways that doesn't make sense given that
> > hadoop dfs -copyToLocal blocks until it completes...) but then I tried
> > other
> > time consuming commands such as
> >     list = `ls -l "../#{localdir}"`
> >     puts "List: #{list}"
> > and running processData(localdir) led to an error:
> >     'localdir' not found
> >
> > Any clues on what could be going on?
> >
> > Thanks,
> > John
> >
> >
> >
> > On Sat, Feb 14, 2009 at 6:45 PM, Matei Zaharia <ma...@cloudera.com>
> wrote:
> >
> > > Have you logged the output of the dfs command to see whether it's
> always
> > > succeeded the copy?
> > >
> > > On Sat, Feb 14, 2009 at 2:46 PM, S D <sd...@gmail.com> wrote:
> > >
> > > > In my Hadoop 0.19.0 program each map function is assigned a directory
> > > > (representing a data location in my S3 datastore). The first thing
> each
> > > map
> > > > function does is copy the particular S3 data to the local machine
> that
> > > the
> > > > map task is running on and then being processing the data; e.g.,
> > > >
> > > > command = "hadoop dfs -copyToLocal #{s3dir} #{localdir}"
> > > > system "#{command}"
> > > >
> > > > In the above, "s3dir" is a directory that creates "localdir" - my
> > > > expectation is that "localdir" is created in the work directory for
> the
> > > > particular task attempt. Following this copy command I then run a
> > > function
> > > > that processes the data; e.g.,
> > > >
> > > > processData(localdir)
> > > >
> > > > In some instances my map/reduce program crashes and when I examine
> the
> > > logs
> > > > I get a message saying that "localdir" can not be found. This
> confuses
> > me
> > > > since the hadoop shell command above is blocking so that localdir
> > should
> > > > exist by the time processData() is called. I've found that if I add
> in
> > > some
> > > > diagnostic lines prior to processData() such as puts statements to
> > print
> > > > out
> > > > variables, I never run into the problem of the localdir not being
> > found.
> > > It
> > > > is almost as if localdir needs time to be created before the call to
> > > > processData().
> > > >
> > > > Has anyone encountered anything like this? Any suggestions on what
> > could
> > > be
> > > > wrong are appreciated.
> > > >
> > > > Thanks,
> > > > John
> > > >
> > >
> >
>

Re: Race Condition?

Posted by Matei Zaharia <ma...@cloudera.com>.
I would capture the output of the dfs -copyToLocal command, because I still
think that is the most likely cause of the data not making it. I don't know
how to capture this output in Ruby but I'm sure it's possible. You want to
capture both standard out and standard error.
One other slim possibility is that if your localdir is a fixed absolute
path, multiple map tasks on the machine may be trying to access it
concurrently, and maybe one of them deletes it when it's done and one
doesn't. Normally each task should run in its own temp directory though.

On Sun, Feb 15, 2009 at 2:51 PM, S D <sd...@gmail.com> wrote:

> I was not able to determine the command shell return value for
>
>     hadoop dfs -copyToLocal #{s3dir} #{localdir}
>
> but I did print out several variables after the call and determined that
> the
> call apparently did not go through successfully. In particular, prior to my
> processData(localdir) command I use Ruby's puts to print out the contents
> of
> several directories including 'localdir' and '../localdir' - here is the
> weird thing: if I execute the following
>     list = `ls -l "#{localdir}"`
>     puts "List: #{list}"
> (where 'localdir' is the directory I need as an arg for processData) the
> processData command will execute properly. At first I thought that running
> the puts command was allowing enough time to elapse for a race condition to
> be avoided so that 'localdir' was ready when the processData command was
> called (I know that in certain ways that doesn't make sense given that
> hadoop dfs -copyToLocal blocks until it completes...) but then I tried
> other
> time consuming commands such as
>     list = `ls -l "../#{localdir}"`
>     puts "List: #{list}"
> and running processData(localdir) led to an error:
>     'localdir' not found
>
> Any clues on what could be going on?
>
> Thanks,
> John
>
>
>
> On Sat, Feb 14, 2009 at 6:45 PM, Matei Zaharia <ma...@cloudera.com> wrote:
>
> > Have you logged the output of the dfs command to see whether it's always
> > succeeded the copy?
> >
> > On Sat, Feb 14, 2009 at 2:46 PM, S D <sd...@gmail.com> wrote:
> >
> > > In my Hadoop 0.19.0 program each map function is assigned a directory
> > > (representing a data location in my S3 datastore). The first thing each
> > map
> > > function does is copy the particular S3 data to the local machine that
> > the
> > > map task is running on and then being processing the data; e.g.,
> > >
> > > command = "hadoop dfs -copyToLocal #{s3dir} #{localdir}"
> > > system "#{command}"
> > >
> > > In the above, "s3dir" is a directory that creates "localdir" - my
> > > expectation is that "localdir" is created in the work directory for the
> > > particular task attempt. Following this copy command I then run a
> > function
> > > that processes the data; e.g.,
> > >
> > > processData(localdir)
> > >
> > > In some instances my map/reduce program crashes and when I examine the
> > logs
> > > I get a message saying that "localdir" can not be found. This confuses
> me
> > > since the hadoop shell command above is blocking so that localdir
> should
> > > exist by the time processData() is called. I've found that if I add in
> > some
> > > diagnostic lines prior to processData() such as puts statements to
> print
> > > out
> > > variables, I never run into the problem of the localdir not being
> found.
> > It
> > > is almost as if localdir needs time to be created before the call to
> > > processData().
> > >
> > > Has anyone encountered anything like this? Any suggestions on what
> could
> > be
> > > wrong are appreciated.
> > >
> > > Thanks,
> > > John
> > >
> >
>

Re: Race Condition?

Posted by S D <sd...@gmail.com>.
I was not able to determine the command shell return value for

     hadoop dfs -copyToLocal #{s3dir} #{localdir}

but I did print out several variables after the call and determined that the
call apparently did not go through successfully. In particular, prior to my
processData(localdir) command I use Ruby's puts to print out the contents of
several directories including 'localdir' and '../localdir' - here is the
weird thing: if I execute the following
     list = `ls -l "#{localdir}"`
     puts "List: #{list}"
(where 'localdir' is the directory I need as an arg for processData) the
processData command will execute properly. At first I thought that running
the puts command was allowing enough time to elapse for a race condition to
be avoided so that 'localdir' was ready when the processData command was
called (I know that in certain ways that doesn't make sense given that
hadoop dfs -copyToLocal blocks until it completes...) but then I tried other
time consuming commands such as
     list = `ls -l "../#{localdir}"`
     puts "List: #{list}"
and running processData(localdir) led to an error:
     'localdir' not found

Any clues on what could be going on?

Thanks,
John



On Sat, Feb 14, 2009 at 6:45 PM, Matei Zaharia <ma...@cloudera.com> wrote:

> Have you logged the output of the dfs command to see whether it's always
> succeeded the copy?
>
> On Sat, Feb 14, 2009 at 2:46 PM, S D <sd...@gmail.com> wrote:
>
> > In my Hadoop 0.19.0 program each map function is assigned a directory
> > (representing a data location in my S3 datastore). The first thing each
> map
> > function does is copy the particular S3 data to the local machine that
> the
> > map task is running on and then being processing the data; e.g.,
> >
> > command = "hadoop dfs -copyToLocal #{s3dir} #{localdir}"
> > system "#{command}"
> >
> > In the above, "s3dir" is a directory that creates "localdir" - my
> > expectation is that "localdir" is created in the work directory for the
> > particular task attempt. Following this copy command I then run a
> function
> > that processes the data; e.g.,
> >
> > processData(localdir)
> >
> > In some instances my map/reduce program crashes and when I examine the
> logs
> > I get a message saying that "localdir" can not be found. This confuses me
> > since the hadoop shell command above is blocking so that localdir should
> > exist by the time processData() is called. I've found that if I add in
> some
> > diagnostic lines prior to processData() such as puts statements to print
> > out
> > variables, I never run into the problem of the localdir not being found.
> It
> > is almost as if localdir needs time to be created before the call to
> > processData().
> >
> > Has anyone encountered anything like this? Any suggestions on what could
> be
> > wrong are appreciated.
> >
> > Thanks,
> > John
> >
>

Re: Race Condition?

Posted by Matei Zaharia <ma...@cloudera.com>.
Have you logged the output of the dfs command to see whether it's always
succeeded the copy?

On Sat, Feb 14, 2009 at 2:46 PM, S D <sd...@gmail.com> wrote:

> In my Hadoop 0.19.0 program each map function is assigned a directory
> (representing a data location in my S3 datastore). The first thing each map
> function does is copy the particular S3 data to the local machine that the
> map task is running on and then being processing the data; e.g.,
>
> command = "hadoop dfs -copyToLocal #{s3dir} #{localdir}"
> system "#{command}"
>
> In the above, "s3dir" is a directory that creates "localdir" - my
> expectation is that "localdir" is created in the work directory for the
> particular task attempt. Following this copy command I then run a function
> that processes the data; e.g.,
>
> processData(localdir)
>
> In some instances my map/reduce program crashes and when I examine the logs
> I get a message saying that "localdir" can not be found. This confuses me
> since the hadoop shell command above is blocking so that localdir should
> exist by the time processData() is called. I've found that if I add in some
> diagnostic lines prior to processData() such as puts statements to print
> out
> variables, I never run into the problem of the localdir not being found. It
> is almost as if localdir needs time to be created before the call to
> processData().
>
> Has anyone encountered anything like this? Any suggestions on what could be
> wrong are appreciated.
>
> Thanks,
> John
>