You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-dev@hadoop.apache.org by saurabh jain <sa...@gmail.com> on 2014/08/13 03:34:07 UTC

Re: Synchronization among Mappers in map-reduce task

Hi Wangda ,

I am not sure making overwrite=false , will solve the problem. As per java
doc by making overwrite=false , it will throw an exception if the file
already exists. So, for all the remaining mappers it will throw an
exception.

Also I am very new to ZK and have very basic knowledge of it , I am not
sure if it can solve the problem and if yes how. I am still going through
available sources on the ZK.

Can you please refer to me some source or link on ZK , that can help me in
solving the problem.

Best
Saurabh

On Tue, Aug 12, 2014 at 3:08 AM, Wangda Tan <wh...@gmail.com> wrote:

> Hi Saurabh,
> It's an interesting topic,
>
> >> So , here is the question , is it possible to make sure that when one of
> the mapper tasks is writing to a file , other should wait until the first
> one is finished. ? I read that all the mappers task don't interact with
> each other
>
> A simple way to do this is using HDFS namespace:
> Create file using "public FSDataOutputStream create(Path f, boolean
> overwrite)", overwrite=false. Only one mapper can successfully create file.
>
> After write completed, the mapper will create a flag file like "completed"
> in the same folder. Other mappers can wait for the "completed" file
> created.
>
> >> Is there any way to have synchronization between two independent map
> reduce jobs?
> I think ZK can do some complex synchronization here. Like mutex, master
> election, etc.
>
> Hope this helps,
>
> Wangda Tan
>
>
>
>
> On Tue, Aug 12, 2014 at 10:43 AM, saurabh jain <sa...@gmail.com>
> wrote:
>
> > Hi Folks ,
> >
> > I have been writing a map-reduce application where I am having an input
> > file containing records and every field in the record is separated by
> some
> > delimiter.
> >
> > In addition to this user will also provide a list of columns that he
> wants
> > to lookup in a master properties file (stored in HDFS). If this columns
> > (lets say it a key) is present in master properties file then get the
> > corresponding value and update the key with this value and if the key is
> > not present it in the master properties file then it will create a new
> > value for this key and will write to this property file and will also
> > update in the record.
> >
> > I have written this application , tested it and everything worked fine
> > till now.
> >
> > *e.g :* *I/P Record :* This | is | the | test | record
> >
> > *Columns :* 2,4 (that means code will look up only field *"is" and
> "test"* in
> > the master properties file.)
> >
> > Here , I have a question.
> >
> > *Q 1:* In the case when my input file is huge and it is splitted across
> > the multiple mappers , I was getting the below mentioned exception where
> > all the other mappers tasks were failing. *Also initially when I started
> > the job my master properties file is empty.* In my code I have a check if
> > this file (master properties) doesn't exist create a new empty file
> before
> > submitting the job itself.
> >
> > e.g : If i have 4 splits of data , then 3 map tasks are failing. But
> after
> > this all the failed map tasks restarts and finally the job become
> > successful.
> >
> > So , *here is the question , is it possible to make sure that when one of
> > the mapper tasks is writing to a file , other should wait until the first
> > one is finished. ?* I read that all the mappers task don't interact with
> > each other.
> >
> > Also what will happen in the scenario when I start multiple parallel
> > map-reduce jobs and all of them working on the same properties files. *Is
> > there any way to have synchronization between two independent map reduce
> > jobs*?
> >
> > I have also read that ZooKeeper can be used in such scenarios , Is that
> > correct ?
> >
> >
> > Error:
> com.techidiocy.hadoop.filesystem.api.exceptions.HDFSFileSystemException:
> IOException - failed while appending data to the file ->Failed to create
> file [/user/cloudera/lob/master/bank.properties] for
> [DFSClient_attempt_1407778869492_0032_m_000002_0_1618418105_1] on client
> [10.X.X.17], because this file is already being created by
> > [DFSClient_attempt_1407778869492_0032_m_000005_0_-949968337_1] on
> [10.X.X.17]
> >                 at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:2548)
> >                 at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInternal(FSNamesystem.java:2377)
> >                 at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInt(FSNamesystem.java:2612)
> >                 at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:2575)
> >                 at
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.append(NameNodeRpcServer.java:522)
> >                 at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.append(ClientNamenodeProtocolServerSideTranslatorPB.java:373)
> >                 at
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> >                 at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
> >                 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)
> >                 at
> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1986)
> >                 at
> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1982)
> >                 at java.security.AccessController.doPrivileged(Native
> Method)
> >                 at javax.security.auth.Subject.doAs(Subject.java:415)
> >                 at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1554)
> >                 at
> org.apache.hadoop.ipc.Server$Handler.run(Server.java:1980)
> >
> >
>

Re: Synchronization among Mappers in map-reduce task

Posted by Wangda Tan <wh...@gmail.com>.

Hi Saurabh,

>> am not sure making overwrite=false , will solve the problem. As per java
doc by making overwrite=false , it will throw an exception if the file
already exists. So, for all the remaining mappers it will throw an
exception.
You can catch the exception and wait.

>> Can you please refer to me some source or link on ZK , that can help me
in solving the problem.
You can check this: http://zookeeper.apache.org/doc/r3.4.6/recipes.html

Thanks,
Wangda



On Wed, Aug 13, 2014 at 9:34 AM, saurabh jain <sa...@gmail.com> wrote:

> Hi Wangda ,
>
> I am not sure making overwrite=false , will solve the problem. As per java
> doc by making overwrite=false , it will throw an exception if the file
> already exists. So, for all the remaining mappers it will throw an
> exception.
>
> Also I am very new to ZK and have very basic knowledge of it , I am not
> sure if it can solve the problem and if yes how. I am still going through
> available sources on the ZK.
>
> Can you please refer to me some source or link on ZK , that can help me in
> solving the problem.
>
> Best
> Saurabh
>
> On Tue, Aug 12, 2014 at 3:08 AM, Wangda Tan <wh...@gmail.com> wrote:
>
>> Hi Saurabh,
>> It's an interesting topic,
>>
>> >> So , here is the question , is it possible to make sure that when one
>> of
>> the mapper tasks is writing to a file , other should wait until the first
>> one is finished. ? I read that all the mappers task don't interact with
>> each other
>>
>> A simple way to do this is using HDFS namespace:
>> Create file using "public FSDataOutputStream create(Path f, boolean
>> overwrite)", overwrite=false. Only one mapper can successfully create
>> file.
>>
>> After write completed, the mapper will create a flag file like "completed"
>> in the same folder. Other mappers can wait for the "completed" file
>> created.
>>
>> >> Is there any way to have synchronization between two independent map
>> reduce jobs?
>> I think ZK can do some complex synchronization here. Like mutex, master
>> election, etc.
>>
>> Hope this helps,
>>
>> Wangda Tan
>>
>>
>>
>>
>> On Tue, Aug 12, 2014 at 10:43 AM, saurabh jain <sa...@gmail.com>
>> wrote:
>>
>> > Hi Folks ,
>> >
>> > I have been writing a map-reduce application where I am having an input
>> > file containing records and every field in the record is separated by
>> some
>> > delimiter.
>> >
>> > In addition to this user will also provide a list of columns that he
>> wants
>> > to lookup in a master properties file (stored in HDFS). If this columns
>> > (lets say it a key) is present in master properties file then get the
>> > corresponding value and update the key with this value and if the key is
>> > not present it in the master properties file then it will create a new
>> > value for this key and will write to this property file and will also
>> > update in the record.
>> >
>> > I have written this application , tested it and everything worked fine
>> > till now.
>> >
>> > *e.g :* *I/P Record :* This | is | the | test | record
>> >
>> > *Columns :* 2,4 (that means code will look up only field *"is" and
>> "test"* in
>> > the master properties file.)
>> >
>> > Here , I have a question.
>> >
>> > *Q 1:* In the case when my input file is huge and it is splitted across
>> > the multiple mappers , I was getting the below mentioned exception where
>> > all the other mappers tasks were failing. *Also initially when I started
>> > the job my master properties file is empty.* In my code I have a check
>> if
>> > this file (master properties) doesn't exist create a new empty file
>> before
>> > submitting the job itself.
>> >
>> > e.g : If i have 4 splits of data , then 3 map tasks are failing. But
>> after
>> > this all the failed map tasks restarts and finally the job become
>> > successful.
>> >
>> > So , *here is the question , is it possible to make sure that when one
>> of
>> > the mapper tasks is writing to a file , other should wait until the
>> first
>> > one is finished. ?* I read that all the mappers task don't interact with
>> > each other.
>> >
>> > Also what will happen in the scenario when I start multiple parallel
>> > map-reduce jobs and all of them working on the same properties files.
>> *Is
>> > there any way to have synchronization between two independent map reduce
>> > jobs*?
>> >
>> > I have also read that ZooKeeper can be used in such scenarios , Is that
>> > correct ?
>> >
>> >
>> > Error:
>> com.techidiocy.hadoop.filesystem.api.exceptions.HDFSFileSystemException:
>> IOException - failed while appending data to the file ->Failed to create
>> file [/user/cloudera/lob/master/bank.properties] for
>> [DFSClient_attempt_1407778869492_0032_m_000002_0_1618418105_1] on client
>> [10.X.X.17], because this file is already being created by
>> > [DFSClient_attempt_1407778869492_0032_m_000005_0_-949968337_1] on
>> [10.X.X.17]
>> >                 at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:2548)
>> >                 at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInternal(FSNamesystem.java:2377)
>> >                 at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInt(FSNamesystem.java:2612)
>> >                 at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:2575)
>> >                 at
>> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.append(NameNodeRpcServer.java:522)
>> >                 at
>> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.append(ClientNamenodeProtocolServerSideTranslatorPB.java:373)
>> >                 at
>> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>> >                 at
>> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
>> >                 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)
>> >                 at
>> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1986)
>> >                 at
>> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1982)
>> >                 at java.security.AccessController.doPrivileged(Native
>> Method)
>> >                 at javax.security.auth.Subject.doAs(Subject.java:415)
>> >                 at
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1554)
>> >                 at
>> org.apache.hadoop.ipc.Server$Handler.run(Server.java:1980)
>> >
>> >
>>
>
>

Re: Synchronization among Mappers in map-reduce task

Posted by Wangda Tan <wh...@gmail.com>.

Hi Saurabh,

>> am not sure making overwrite=false , will solve the problem. As per java
doc by making overwrite=false , it will throw an exception if the file
already exists. So, for all the remaining mappers it will throw an
exception.
You can catch the exception and wait.

>> Can you please refer to me some source or link on ZK , that can help me
in solving the problem.
You can check this: http://zookeeper.apache.org/doc/r3.4.6/recipes.html

Thanks,
Wangda



On Wed, Aug 13, 2014 at 9:34 AM, saurabh jain <sa...@gmail.com> wrote:

> Hi Wangda ,
>
> I am not sure making overwrite=false , will solve the problem. As per java
> doc by making overwrite=false , it will throw an exception if the file
> already exists. So, for all the remaining mappers it will throw an
> exception.
>
> Also I am very new to ZK and have very basic knowledge of it , I am not
> sure if it can solve the problem and if yes how. I am still going through
> available sources on the ZK.
>
> Can you please refer to me some source or link on ZK , that can help me in
> solving the problem.
>
> Best
> Saurabh
>
> On Tue, Aug 12, 2014 at 3:08 AM, Wangda Tan <wh...@gmail.com> wrote:
>
>> Hi Saurabh,
>> It's an interesting topic,
>>
>> >> So , here is the question , is it possible to make sure that when one
>> of
>> the mapper tasks is writing to a file , other should wait until the first
>> one is finished. ? I read that all the mappers task don't interact with
>> each other
>>
>> A simple way to do this is using HDFS namespace:
>> Create file using "public FSDataOutputStream create(Path f, boolean
>> overwrite)", overwrite=false. Only one mapper can successfully create
>> file.
>>
>> After write completed, the mapper will create a flag file like "completed"
>> in the same folder. Other mappers can wait for the "completed" file
>> created.
>>
>> >> Is there any way to have synchronization between two independent map
>> reduce jobs?
>> I think ZK can do some complex synchronization here. Like mutex, master
>> election, etc.
>>
>> Hope this helps,
>>
>> Wangda Tan
>>
>>
>>
>>
>> On Tue, Aug 12, 2014 at 10:43 AM, saurabh jain <sa...@gmail.com>
>> wrote:
>>
>> > Hi Folks ,
>> >
>> > I have been writing a map-reduce application where I am having an input
>> > file containing records and every field in the record is separated by
>> some
>> > delimiter.
>> >
>> > In addition to this user will also provide a list of columns that he
>> wants
>> > to lookup in a master properties file (stored in HDFS). If this columns
>> > (lets say it a key) is present in master properties file then get the
>> > corresponding value and update the key with this value and if the key is
>> > not present it in the master properties file then it will create a new
>> > value for this key and will write to this property file and will also
>> > update in the record.
>> >
>> > I have written this application , tested it and everything worked fine
>> > till now.
>> >
>> > *e.g :* *I/P Record :* This | is | the | test | record
>> >
>> > *Columns :* 2,4 (that means code will look up only field *"is" and
>> "test"* in
>> > the master properties file.)
>> >
>> > Here , I have a question.
>> >
>> > *Q 1:* In the case when my input file is huge and it is splitted across
>> > the multiple mappers , I was getting the below mentioned exception where
>> > all the other mappers tasks were failing. *Also initially when I started
>> > the job my master properties file is empty.* In my code I have a check
>> if
>> > this file (master properties) doesn't exist create a new empty file
>> before
>> > submitting the job itself.
>> >
>> > e.g : If i have 4 splits of data , then 3 map tasks are failing. But
>> after
>> > this all the failed map tasks restarts and finally the job become
>> > successful.
>> >
>> > So , *here is the question , is it possible to make sure that when one
>> of
>> > the mapper tasks is writing to a file , other should wait until the
>> first
>> > one is finished. ?* I read that all the mappers task don't interact with
>> > each other.
>> >
>> > Also what will happen in the scenario when I start multiple parallel
>> > map-reduce jobs and all of them working on the same properties files.
>> *Is
>> > there any way to have synchronization between two independent map reduce
>> > jobs*?
>> >
>> > I have also read that ZooKeeper can be used in such scenarios , Is that
>> > correct ?
>> >
>> >
>> > Error:
>> com.techidiocy.hadoop.filesystem.api.exceptions.HDFSFileSystemException:
>> IOException - failed while appending data to the file ->Failed to create
>> file [/user/cloudera/lob/master/bank.properties] for
>> [DFSClient_attempt_1407778869492_0032_m_000002_0_1618418105_1] on client
>> [10.X.X.17], because this file is already being created by
>> > [DFSClient_attempt_1407778869492_0032_m_000005_0_-949968337_1] on
>> [10.X.X.17]
>> >                 at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:2548)
>> >                 at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInternal(FSNamesystem.java:2377)
>> >                 at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInt(FSNamesystem.java:2612)
>> >                 at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:2575)
>> >                 at
>> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.append(NameNodeRpcServer.java:522)
>> >                 at
>> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.append(ClientNamenodeProtocolServerSideTranslatorPB.java:373)
>> >                 at
>> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>> >                 at
>> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
>> >                 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)
>> >                 at
>> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1986)
>> >                 at
>> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1982)
>> >                 at java.security.AccessController.doPrivileged(Native
>> Method)
>> >                 at javax.security.auth.Subject.doAs(Subject.java:415)
>> >                 at
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1554)
>> >                 at
>> org.apache.hadoop.ipc.Server$Handler.run(Server.java:1980)
>> >
>> >
>>
>
>

Re: Synchronization among Mappers in map-reduce task

Posted by Wangda Tan <wh...@gmail.com>.

Hi Saurabh,

>> am not sure making overwrite=false , will solve the problem. As per java
doc by making overwrite=false , it will throw an exception if the file
already exists. So, for all the remaining mappers it will throw an
exception.
You can catch the exception and wait.

>> Can you please refer to me some source or link on ZK , that can help me
in solving the problem.
You can check this: http://zookeeper.apache.org/doc/r3.4.6/recipes.html

Thanks,
Wangda



On Wed, Aug 13, 2014 at 9:34 AM, saurabh jain <sa...@gmail.com> wrote:

> Hi Wangda ,
>
> I am not sure making overwrite=false , will solve the problem. As per java
> doc by making overwrite=false , it will throw an exception if the file
> already exists. So, for all the remaining mappers it will throw an
> exception.
>
> Also I am very new to ZK and have very basic knowledge of it , I am not
> sure if it can solve the problem and if yes how. I am still going through
> available sources on the ZK.
>
> Can you please refer to me some source or link on ZK , that can help me in
> solving the problem.
>
> Best
> Saurabh
>
> On Tue, Aug 12, 2014 at 3:08 AM, Wangda Tan <wh...@gmail.com> wrote:
>
>> Hi Saurabh,
>> It's an interesting topic,
>>
>> >> So , here is the question , is it possible to make sure that when one
>> of
>> the mapper tasks is writing to a file , other should wait until the first
>> one is finished. ? I read that all the mappers task don't interact with
>> each other
>>
>> A simple way to do this is using HDFS namespace:
>> Create file using "public FSDataOutputStream create(Path f, boolean
>> overwrite)", overwrite=false. Only one mapper can successfully create
>> file.
>>
>> After write completed, the mapper will create a flag file like "completed"
>> in the same folder. Other mappers can wait for the "completed" file
>> created.
>>
>> >> Is there any way to have synchronization between two independent map
>> reduce jobs?
>> I think ZK can do some complex synchronization here. Like mutex, master
>> election, etc.
>>
>> Hope this helps,
>>
>> Wangda Tan
>>
>>
>>
>>
>> On Tue, Aug 12, 2014 at 10:43 AM, saurabh jain <sa...@gmail.com>
>> wrote:
>>
>> > Hi Folks ,
>> >
>> > I have been writing a map-reduce application where I am having an input
>> > file containing records and every field in the record is separated by
>> some
>> > delimiter.
>> >
>> > In addition to this user will also provide a list of columns that he
>> wants
>> > to lookup in a master properties file (stored in HDFS). If this columns
>> > (lets say it a key) is present in master properties file then get the
>> > corresponding value and update the key with this value and if the key is
>> > not present it in the master properties file then it will create a new
>> > value for this key and will write to this property file and will also
>> > update in the record.
>> >
>> > I have written this application , tested it and everything worked fine
>> > till now.
>> >
>> > *e.g :* *I/P Record :* This | is | the | test | record
>> >
>> > *Columns :* 2,4 (that means code will look up only field *"is" and
>> "test"* in
>> > the master properties file.)
>> >
>> > Here , I have a question.
>> >
>> > *Q 1:* In the case when my input file is huge and it is splitted across
>> > the multiple mappers , I was getting the below mentioned exception where
>> > all the other mappers tasks were failing. *Also initially when I started
>> > the job my master properties file is empty.* In my code I have a check
>> if
>> > this file (master properties) doesn't exist create a new empty file
>> before
>> > submitting the job itself.
>> >
>> > e.g : If i have 4 splits of data , then 3 map tasks are failing. But
>> after
>> > this all the failed map tasks restarts and finally the job become
>> > successful.
>> >
>> > So , *here is the question , is it possible to make sure that when one
>> of
>> > the mapper tasks is writing to a file , other should wait until the
>> first
>> > one is finished. ?* I read that all the mappers task don't interact with
>> > each other.
>> >
>> > Also what will happen in the scenario when I start multiple parallel
>> > map-reduce jobs and all of them working on the same properties files.
>> *Is
>> > there any way to have synchronization between two independent map reduce
>> > jobs*?
>> >
>> > I have also read that ZooKeeper can be used in such scenarios , Is that
>> > correct ?
>> >
>> >
>> > Error:
>> com.techidiocy.hadoop.filesystem.api.exceptions.HDFSFileSystemException:
>> IOException - failed while appending data to the file ->Failed to create
>> file [/user/cloudera/lob/master/bank.properties] for
>> [DFSClient_attempt_1407778869492_0032_m_000002_0_1618418105_1] on client
>> [10.X.X.17], because this file is already being created by
>> > [DFSClient_attempt_1407778869492_0032_m_000005_0_-949968337_1] on
>> [10.X.X.17]
>> >                 at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:2548)
>> >                 at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInternal(FSNamesystem.java:2377)
>> >                 at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInt(FSNamesystem.java:2612)
>> >                 at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:2575)
>> >                 at
>> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.append(NameNodeRpcServer.java:522)
>> >                 at
>> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.append(ClientNamenodeProtocolServerSideTranslatorPB.java:373)
>> >                 at
>> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>> >                 at
>> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
>> >                 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)
>> >                 at
>> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1986)
>> >                 at
>> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1982)
>> >                 at java.security.AccessController.doPrivileged(Native
>> Method)
>> >                 at javax.security.auth.Subject.doAs(Subject.java:415)
>> >                 at
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1554)
>> >                 at
>> org.apache.hadoop.ipc.Server$Handler.run(Server.java:1980)
>> >
>> >
>>
>
>

Re: Synchronization among Mappers in map-reduce task

Posted by Wangda Tan <wh...@gmail.com>.

Hi Saurabh,

>> am not sure making overwrite=false , will solve the problem. As per java
doc by making overwrite=false , it will throw an exception if the file
already exists. So, for all the remaining mappers it will throw an
exception.
You can catch the exception and wait.

>> Can you please refer to me some source or link on ZK , that can help me
in solving the problem.
You can check this: http://zookeeper.apache.org/doc/r3.4.6/recipes.html

Thanks,
Wangda



On Wed, Aug 13, 2014 at 9:34 AM, saurabh jain <sa...@gmail.com> wrote:

> Hi Wangda ,
>
> I am not sure making overwrite=false , will solve the problem. As per java
> doc by making overwrite=false , it will throw an exception if the file
> already exists. So, for all the remaining mappers it will throw an
> exception.
>
> Also I am very new to ZK and have very basic knowledge of it , I am not
> sure if it can solve the problem and if yes how. I am still going through
> available sources on the ZK.
>
> Can you please refer to me some source or link on ZK , that can help me in
> solving the problem.
>
> Best
> Saurabh
>
> On Tue, Aug 12, 2014 at 3:08 AM, Wangda Tan <wh...@gmail.com> wrote:
>
>> Hi Saurabh,
>> It's an interesting topic,
>>
>> >> So , here is the question , is it possible to make sure that when one
>> of
>> the mapper tasks is writing to a file , other should wait until the first
>> one is finished. ? I read that all the mappers task don't interact with
>> each other
>>
>> A simple way to do this is using HDFS namespace:
>> Create file using "public FSDataOutputStream create(Path f, boolean
>> overwrite)", overwrite=false. Only one mapper can successfully create
>> file.
>>
>> After write completed, the mapper will create a flag file like "completed"
>> in the same folder. Other mappers can wait for the "completed" file
>> created.
>>
>> >> Is there any way to have synchronization between two independent map
>> reduce jobs?
>> I think ZK can do some complex synchronization here. Like mutex, master
>> election, etc.
>>
>> Hope this helps,
>>
>> Wangda Tan
>>
>>
>>
>>
>> On Tue, Aug 12, 2014 at 10:43 AM, saurabh jain <sa...@gmail.com>
>> wrote:
>>
>> > Hi Folks ,
>> >
>> > I have been writing a map-reduce application where I am having an input
>> > file containing records and every field in the record is separated by
>> some
>> > delimiter.
>> >
>> > In addition to this user will also provide a list of columns that he
>> wants
>> > to lookup in a master properties file (stored in HDFS). If this columns
>> > (lets say it a key) is present in master properties file then get the
>> > corresponding value and update the key with this value and if the key is
>> > not present it in the master properties file then it will create a new
>> > value for this key and will write to this property file and will also
>> > update in the record.
>> >
>> > I have written this application , tested it and everything worked fine
>> > till now.
>> >
>> > *e.g :* *I/P Record :* This | is | the | test | record
>> >
>> > *Columns :* 2,4 (that means code will look up only field *"is" and
>> "test"* in
>> > the master properties file.)
>> >
>> > Here , I have a question.
>> >
>> > *Q 1:* In the case when my input file is huge and it is splitted across
>> > the multiple mappers , I was getting the below mentioned exception where
>> > all the other mappers tasks were failing. *Also initially when I started
>> > the job my master properties file is empty.* In my code I have a check
>> if
>> > this file (master properties) doesn't exist create a new empty file
>> before
>> > submitting the job itself.
>> >
>> > e.g : If i have 4 splits of data , then 3 map tasks are failing. But
>> after
>> > this all the failed map tasks restarts and finally the job become
>> > successful.
>> >
>> > So , *here is the question , is it possible to make sure that when one
>> of
>> > the mapper tasks is writing to a file , other should wait until the
>> first
>> > one is finished. ?* I read that all the mappers task don't interact with
>> > each other.
>> >
>> > Also what will happen in the scenario when I start multiple parallel
>> > map-reduce jobs and all of them working on the same properties files.
>> *Is
>> > there any way to have synchronization between two independent map reduce
>> > jobs*?
>> >
>> > I have also read that ZooKeeper can be used in such scenarios , Is that
>> > correct ?
>> >
>> >
>> > Error:
>> com.techidiocy.hadoop.filesystem.api.exceptions.HDFSFileSystemException:
>> IOException - failed while appending data to the file ->Failed to create
>> file [/user/cloudera/lob/master/bank.properties] for
>> [DFSClient_attempt_1407778869492_0032_m_000002_0_1618418105_1] on client
>> [10.X.X.17], because this file is already being created by
>> > [DFSClient_attempt_1407778869492_0032_m_000005_0_-949968337_1] on
>> [10.X.X.17]
>> >                 at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:2548)
>> >                 at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInternal(FSNamesystem.java:2377)
>> >                 at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInt(FSNamesystem.java:2612)
>> >                 at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:2575)
>> >                 at
>> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.append(NameNodeRpcServer.java:522)
>> >                 at
>> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.append(ClientNamenodeProtocolServerSideTranslatorPB.java:373)
>> >                 at
>> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>> >                 at
>> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
>> >                 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)
>> >                 at
>> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1986)
>> >                 at
>> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1982)
>> >                 at java.security.AccessController.doPrivileged(Native
>> Method)
>> >                 at javax.security.auth.Subject.doAs(Subject.java:415)
>> >                 at
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1554)
>> >                 at
>> org.apache.hadoop.ipc.Server$Handler.run(Server.java:1980)
>> >
>> >
>>
>
>

Re: Synchronization among Mappers in map-reduce task

Posted by Wangda Tan <wh...@gmail.com>.

Hi Saurabh,

>> am not sure making overwrite=false , will solve the problem. As per java
doc by making overwrite=false , it will throw an exception if the file
already exists. So, for all the remaining mappers it will throw an
exception.
You can catch the exception and wait.

>> Can you please refer to me some source or link on ZK , that can help me
in solving the problem.
You can check this: http://zookeeper.apache.org/doc/r3.4.6/recipes.html

Thanks,
Wangda



On Wed, Aug 13, 2014 at 9:34 AM, saurabh jain <sa...@gmail.com> wrote:

> Hi Wangda ,
>
> I am not sure making overwrite=false , will solve the problem. As per java
> doc by making overwrite=false , it will throw an exception if the file
> already exists. So, for all the remaining mappers it will throw an
> exception.
>
> Also I am very new to ZK and have very basic knowledge of it , I am not
> sure if it can solve the problem and if yes how. I am still going through
> available sources on the ZK.
>
> Can you please refer to me some source or link on ZK , that can help me in
> solving the problem.
>
> Best
> Saurabh
>
> On Tue, Aug 12, 2014 at 3:08 AM, Wangda Tan <wh...@gmail.com> wrote:
>
>> Hi Saurabh,
>> It's an interesting topic,
>>
>> >> So , here is the question , is it possible to make sure that when one
>> of
>> the mapper tasks is writing to a file , other should wait until the first
>> one is finished. ? I read that all the mappers task don't interact with
>> each other
>>
>> A simple way to do this is using HDFS namespace:
>> Create file using "public FSDataOutputStream create(Path f, boolean
>> overwrite)", overwrite=false. Only one mapper can successfully create
>> file.
>>
>> After write completed, the mapper will create a flag file like "completed"
>> in the same folder. Other mappers can wait for the "completed" file
>> created.
>>
>> >> Is there any way to have synchronization between two independent map
>> reduce jobs?
>> I think ZK can do some complex synchronization here. Like mutex, master
>> election, etc.
>>
>> Hope this helps,
>>
>> Wangda Tan
>>
>>
>>
>>
>> On Tue, Aug 12, 2014 at 10:43 AM, saurabh jain <sa...@gmail.com>
>> wrote:
>>
>> > Hi Folks ,
>> >
>> > I have been writing a map-reduce application where I am having an input
>> > file containing records and every field in the record is separated by
>> some
>> > delimiter.
>> >
>> > In addition to this user will also provide a list of columns that he
>> wants
>> > to lookup in a master properties file (stored in HDFS). If this columns
>> > (lets say it a key) is present in master properties file then get the
>> > corresponding value and update the key with this value and if the key is
>> > not present it in the master properties file then it will create a new
>> > value for this key and will write to this property file and will also
>> > update in the record.
>> >
>> > I have written this application , tested it and everything worked fine
>> > till now.
>> >
>> > *e.g :* *I/P Record :* This | is | the | test | record
>> >
>> > *Columns :* 2,4 (that means code will look up only field *"is" and
>> "test"* in
>> > the master properties file.)
>> >
>> > Here , I have a question.
>> >
>> > *Q 1:* In the case when my input file is huge and it is splitted across
>> > the multiple mappers , I was getting the below mentioned exception where
>> > all the other mappers tasks were failing. *Also initially when I started
>> > the job my master properties file is empty.* In my code I have a check
>> if
>> > this file (master properties) doesn't exist create a new empty file
>> before
>> > submitting the job itself.
>> >
>> > e.g : If i have 4 splits of data , then 3 map tasks are failing. But
>> after
>> > this all the failed map tasks restarts and finally the job become
>> > successful.
>> >
>> > So , *here is the question , is it possible to make sure that when one
>> of
>> > the mapper tasks is writing to a file , other should wait until the
>> first
>> > one is finished. ?* I read that all the mappers task don't interact with
>> > each other.
>> >
>> > Also what will happen in the scenario when I start multiple parallel
>> > map-reduce jobs and all of them working on the same properties files.
>> *Is
>> > there any way to have synchronization between two independent map reduce
>> > jobs*?
>> >
>> > I have also read that ZooKeeper can be used in such scenarios , Is that
>> > correct ?
>> >
>> >
>> > Error:
>> com.techidiocy.hadoop.filesystem.api.exceptions.HDFSFileSystemException:
>> IOException - failed while appending data to the file ->Failed to create
>> file [/user/cloudera/lob/master/bank.properties] for
>> [DFSClient_attempt_1407778869492_0032_m_000002_0_1618418105_1] on client
>> [10.X.X.17], because this file is already being created by
>> > [DFSClient_attempt_1407778869492_0032_m_000005_0_-949968337_1] on
>> [10.X.X.17]
>> >                 at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:2548)
>> >                 at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInternal(FSNamesystem.java:2377)
>> >                 at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInt(FSNamesystem.java:2612)
>> >                 at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:2575)
>> >                 at
>> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.append(NameNodeRpcServer.java:522)
>> >                 at
>> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.append(ClientNamenodeProtocolServerSideTranslatorPB.java:373)
>> >                 at
>> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>> >                 at
>> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
>> >                 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)
>> >                 at
>> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1986)
>> >                 at
>> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1982)
>> >                 at java.security.AccessController.doPrivileged(Native
>> Method)
>> >                 at javax.security.auth.Subject.doAs(Subject.java:415)
>> >                 at
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1554)
>> >                 at
>> org.apache.hadoop.ipc.Server$Handler.run(Server.java:1980)
>> >
>> >
>>
>
>

Re: Synchronization among Mappers in map-reduce task

Posted by Wangda Tan <wh...@gmail.com>.

Hi Saurabh,

>> am not sure making overwrite=false , will solve the problem. As per java
doc by making overwrite=false , it will throw an exception if the file
already exists. So, for all the remaining mappers it will throw an
exception.
You can catch the exception and wait.

>> Can you please refer to me some source or link on ZK , that can help me
in solving the problem.
You can check this: http://zookeeper.apache.org/doc/r3.4.6/recipes.html

Thanks,
Wangda



On Wed, Aug 13, 2014 at 9:34 AM, saurabh jain <sa...@gmail.com> wrote:

> Hi Wangda ,
>
> I am not sure making overwrite=false , will solve the problem. As per java
> doc by making overwrite=false , it will throw an exception if the file
> already exists. So, for all the remaining mappers it will throw an
> exception.
>
> Also I am very new to ZK and have very basic knowledge of it , I am not
> sure if it can solve the problem and if yes how. I am still going through
> available sources on the ZK.
>
> Can you please refer to me some source or link on ZK , that can help me in
> solving the problem.
>
> Best
> Saurabh
>
> On Tue, Aug 12, 2014 at 3:08 AM, Wangda Tan <wh...@gmail.com> wrote:
>
>> Hi Saurabh,
>> It's an interesting topic,
>>
>> >> So , here is the question , is it possible to make sure that when one
>> of
>> the mapper tasks is writing to a file , other should wait until the first
>> one is finished. ? I read that all the mappers task don't interact with
>> each other
>>
>> A simple way to do this is using HDFS namespace:
>> Create file using "public FSDataOutputStream create(Path f, boolean
>> overwrite)", overwrite=false. Only one mapper can successfully create
>> file.
>>
>> After write completed, the mapper will create a flag file like "completed"
>> in the same folder. Other mappers can wait for the "completed" file
>> created.
>>
>> >> Is there any way to have synchronization between two independent map
>> reduce jobs?
>> I think ZK can do some complex synchronization here. Like mutex, master
>> election, etc.
>>
>> Hope this helps,
>>
>> Wangda Tan
>>
>>
>>
>>
>> On Tue, Aug 12, 2014 at 10:43 AM, saurabh jain <sa...@gmail.com>
>> wrote:
>>
>> > Hi Folks ,
>> >
>> > I have been writing a map-reduce application where I am having an input
>> > file containing records and every field in the record is separated by
>> some
>> > delimiter.
>> >
>> > In addition to this user will also provide a list of columns that he
>> wants
>> > to lookup in a master properties file (stored in HDFS). If this columns
>> > (lets say it a key) is present in master properties file then get the
>> > corresponding value and update the key with this value and if the key is
>> > not present it in the master properties file then it will create a new
>> > value for this key and will write to this property file and will also
>> > update in the record.
>> >
>> > I have written this application , tested it and everything worked fine
>> > till now.
>> >
>> > *e.g :* *I/P Record :* This | is | the | test | record
>> >
>> > *Columns :* 2,4 (that means code will look up only field *"is" and
>> "test"* in
>> > the master properties file.)
>> >
>> > Here , I have a question.
>> >
>> > *Q 1:* In the case when my input file is huge and it is splitted across
>> > the multiple mappers , I was getting the below mentioned exception where
>> > all the other mappers tasks were failing. *Also initially when I started
>> > the job my master properties file is empty.* In my code I have a check
>> if
>> > this file (master properties) doesn't exist create a new empty file
>> before
>> > submitting the job itself.
>> >
>> > e.g : If i have 4 splits of data , then 3 map tasks are failing. But
>> after
>> > this all the failed map tasks restarts and finally the job become
>> > successful.
>> >
>> > So , *here is the question , is it possible to make sure that when one
>> of
>> > the mapper tasks is writing to a file , other should wait until the
>> first
>> > one is finished. ?* I read that all the mappers task don't interact with
>> > each other.
>> >
>> > Also what will happen in the scenario when I start multiple parallel
>> > map-reduce jobs and all of them working on the same properties files.
>> *Is
>> > there any way to have synchronization between two independent map reduce
>> > jobs*?
>> >
>> > I have also read that ZooKeeper can be used in such scenarios , Is that
>> > correct ?
>> >
>> >
>> > Error:
>> com.techidiocy.hadoop.filesystem.api.exceptions.HDFSFileSystemException:
>> IOException - failed while appending data to the file ->Failed to create
>> file [/user/cloudera/lob/master/bank.properties] for
>> [DFSClient_attempt_1407778869492_0032_m_000002_0_1618418105_1] on client
>> [10.X.X.17], because this file is already being created by
>> > [DFSClient_attempt_1407778869492_0032_m_000005_0_-949968337_1] on
>> [10.X.X.17]
>> >                 at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:2548)
>> >                 at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInternal(FSNamesystem.java:2377)
>> >                 at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInt(FSNamesystem.java:2612)
>> >                 at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:2575)
>> >                 at
>> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.append(NameNodeRpcServer.java:522)
>> >                 at
>> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.append(ClientNamenodeProtocolServerSideTranslatorPB.java:373)
>> >                 at
>> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>> >                 at
>> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
>> >                 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)
>> >                 at
>> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1986)
>> >                 at
>> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1982)
>> >                 at java.security.AccessController.doPrivileged(Native
>> Method)
>> >                 at javax.security.auth.Subject.doAs(Subject.java:415)
>> >                 at
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1554)
>> >                 at
>> org.apache.hadoop.ipc.Server$Handler.run(Server.java:1980)
>> >
>> >
>>
>
>