You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-dev@hadoop.apache.org by saurabh jain <sa...@gmail.com> on 2014/08/12 04:43:06 UTC

Synchronization among Mappers in map-reduce task

Hi Folks ,

I have been writing a map-reduce application where I am having an input
file containing records and every field in the record is separated by some
delimiter.

In addition to this user will also provide a list of columns that he wants
to lookup in a master properties file (stored in HDFS). If this columns
(lets say it a key) is present in master properties file then get the
corresponding value and update the key with this value and if the key is
not present it in the master properties file then it will create a new
value for this key and will write to this property file and will also
update in the record.

I have written this application , tested it and everything worked fine till
now.

*e.g :* *I/P Record :* This | is | the | test | record

*Columns :* 2,4 (that means code will look up only field *"is" and "test"* in
the master properties file.)

Here , I have a question.

*Q 1:* In the case when my input file is huge and it is splitted across the
multiple mappers , I was getting the below mentioned exception where all
the other mappers tasks were failing. *Also initially when I started the
job my master properties file is empty.* In my code I have a check if this
file (master properties) doesn't exist create a new empty file before
submitting the job itself.

e.g : If i have 4 splits of data , then 3 map tasks are failing. But after
this all the failed map tasks restarts and finally the job become
successful.

So , *here is the question , is it possible to make sure that when one of
the mapper tasks is writing to a file , other should wait until the first
one is finished. ?* I read that all the mappers task don't interact with
each other.

Also what will happen in the scenario when I start multiple parallel
map-reduce jobs and all of them working on the same properties files. *Is
there any way to have synchronization between two independent map reduce
jobs*?

I have also read that ZooKeeper can be used in such scenarios , Is that
correct ?


Error: com.techidiocy.hadoop.filesystem.api.exceptions.HDFSFileSystemException:
IOException - failed while appending data to the file ->Failed to
create file [/user/cloudera/lob/master/bank.properties] for
[DFSClient_attempt_1407778869492_0032_m_000002_0_1618418105_1] on
client [10.X.X.17], because this file is already being created by
[DFSClient_attempt_1407778869492_0032_m_000005_0_-949968337_1] on
[10.X.X.17]
                at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:2548)
                at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInternal(FSNamesystem.java:2377)
                at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInt(FSNamesystem.java:2612)
                at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:2575)
                at
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.append(NameNodeRpcServer.java:522)
                at
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.append(ClientNamenodeProtocolServerSideTranslatorPB.java:373)
                at
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
                at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
                at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)
                at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1986)
                at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1982)
                at java.security.AccessController.doPrivileged(Native Method)
                at javax.security.auth.Subject.doAs(Subject.java:415)
                at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1554)
                at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1980)

Re: Synchronization among Mappers in map-reduce task

Posted by Wangda Tan <wh...@gmail.com>.
Hi Saurabh,
It's an interesting topic,

>> So , here is the question , is it possible to make sure that when one of
the mapper tasks is writing to a file , other should wait until the first
one is finished. ? I read that all the mappers task don't interact with
each other

A simple way to do this is using HDFS namespace:
Create file using "public FSDataOutputStream create(Path f, boolean
overwrite)", overwrite=false. Only one mapper can successfully create file.

After write completed, the mapper will create a flag file like "completed"
in the same folder. Other mappers can wait for the "completed" file created.

>> Is there any way to have synchronization between two independent map
reduce jobs?
I think ZK can do some complex synchronization here. Like mutex, master
election, etc.

Hope this helps,

Wangda Tan




On Tue, Aug 12, 2014 at 10:43 AM, saurabh jain <sa...@gmail.com>
wrote:

> Hi Folks ,
>
> I have been writing a map-reduce application where I am having an input
> file containing records and every field in the record is separated by some
> delimiter.
>
> In addition to this user will also provide a list of columns that he wants
> to lookup in a master properties file (stored in HDFS). If this columns
> (lets say it a key) is present in master properties file then get the
> corresponding value and update the key with this value and if the key is
> not present it in the master properties file then it will create a new
> value for this key and will write to this property file and will also
> update in the record.
>
> I have written this application , tested it and everything worked fine
> till now.
>
> *e.g :* *I/P Record :* This | is | the | test | record
>
> *Columns :* 2,4 (that means code will look up only field *"is" and "test"* in
> the master properties file.)
>
> Here , I have a question.
>
> *Q 1:* In the case when my input file is huge and it is splitted across
> the multiple mappers , I was getting the below mentioned exception where
> all the other mappers tasks were failing. *Also initially when I started
> the job my master properties file is empty.* In my code I have a check if
> this file (master properties) doesn't exist create a new empty file before
> submitting the job itself.
>
> e.g : If i have 4 splits of data , then 3 map tasks are failing. But after
> this all the failed map tasks restarts and finally the job become
> successful.
>
> So , *here is the question , is it possible to make sure that when one of
> the mapper tasks is writing to a file , other should wait until the first
> one is finished. ?* I read that all the mappers task don't interact with
> each other.
>
> Also what will happen in the scenario when I start multiple parallel
> map-reduce jobs and all of them working on the same properties files. *Is
> there any way to have synchronization between two independent map reduce
> jobs*?
>
> I have also read that ZooKeeper can be used in such scenarios , Is that
> correct ?
>
>
> Error: com.techidiocy.hadoop.filesystem.api.exceptions.HDFSFileSystemException: IOException - failed while appending data to the file ->Failed to create file [/user/cloudera/lob/master/bank.properties] for [DFSClient_attempt_1407778869492_0032_m_000002_0_1618418105_1] on client [10.X.X.17], because this file is already being created by
> [DFSClient_attempt_1407778869492_0032_m_000005_0_-949968337_1] on [10.X.X.17]
>                 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:2548)
>                 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInternal(FSNamesystem.java:2377)
>                 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInt(FSNamesystem.java:2612)
>                 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:2575)
>                 at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.append(NameNodeRpcServer.java:522)
>                 at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.append(ClientNamenodeProtocolServerSideTranslatorPB.java:373)
>                 at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>                 at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
>                 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)
>                 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1986)
>                 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1982)
>                 at java.security.AccessController.doPrivileged(Native Method)
>                 at javax.security.auth.Subject.doAs(Subject.java:415)
>                 at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1554)
>                 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1980)
>
>

Re: Synchronization among Mappers in map-reduce task

Posted by saurabh jain <sa...@gmail.com>.
> Hi Folks ,
>
> I have been writing a map-reduce application where I am having an input
> file containing records and every field in the record is separated by some
> delimiter.
>
> In addition to this user will also provide a list of columns that he wants
> to lookup in a master properties file (stored in HDFS). If this columns
> (lets say it a key) is present in master properties file then get the
> corresponding value and update the key with this value and if the key is
> not present it in the master properties file then it will create a new
> value for this key and will write to this property file and will also
> update in the record.
>
> I have written this application , tested it and everything worked fine
> till now.
>
> *e.g :* *I/P Record :* This | is | the | test | record
>
> *Columns :* 2,4 (that means code will look up only field *"is" and "test"* in
> the master properties file.)
>
> Here , I have a question.
>
> *Q 1:* In the case when my input file is huge and it is splitted across
> the multiple mappers , I was getting the below mentioned exception where
> all the other mappers tasks were failing. *Also initially when I started
> the job my master properties file is empty.* In my code I have a check if
> this file (master properties) doesn't exist create a new empty file before
> submitting the job itself.
>
> e.g : If i have 4 splits of data , then 3 map tasks are failing. But after
> this all the failed map tasks restarts and finally the job become
> successful.
>
> So , *here is the question , is it possible to make sure that when one of
> the mapper tasks is writing to a file , other should wait until the first
> one is finished. ?* I read that all the mappers task don't interact with
> each other.
>
> Also what will happen in the scenario when I start multiple parallel
> map-reduce jobs and all of them working on the same properties files. *Is
> there any way to have synchronization between two independent map reduce
> jobs*?
>
> I have also read that ZooKeeper can be used in such scenarios , Is that
> correct ?
>
>
> Error: com.techidiocy.hadoop.filesystem.api.exceptions.HDFSFileSystemException: IOException - failed while appending data to the file ->Failed to create file [/user/cloudera/lob/master/bank.properties] for [DFSClient_attempt_1407778869492_0032_m_000002_0_1618418105_1] on client [10.X.X.17], because this file is already being created by
> [DFSClient_attempt_1407778869492_0032_m_000005_0_-949968337_1] on [10.X.X.17]
>                 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:2548)
>                 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInternal(FSNamesystem.java:2377)
>                 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInt(FSNamesystem.java:2612)
>                 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:2575)
>                 at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.append(NameNodeRpcServer.java:522)
>                 at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.append(ClientNamenodeProtocolServerSideTranslatorPB.java:373)
>                 at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>                 at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
>                 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)
>                 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1986)
>                 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1982)
>                 at java.security.AccessController.doPrivileged(Native Method)
>                 at javax.security.auth.Subject.doAs(Subject.java:415)
>                 at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1554)
>                 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1980)
>
>

Re: Synchronization among Mappers in map-reduce task

Posted by Wangda Tan <wh...@gmail.com>.
Hi Saurabh,
It's an interesting topic,

>> So , here is the question , is it possible to make sure that when one of
the mapper tasks is writing to a file , other should wait until the first
one is finished. ? I read that all the mappers task don't interact with
each other

A simple way to do this is using HDFS namespace:
Create file using "public FSDataOutputStream create(Path f, boolean
overwrite)", overwrite=false. Only one mapper can successfully create file.

After write completed, the mapper will create a flag file like "completed"
in the same folder. Other mappers can wait for the "completed" file created.

>> Is there any way to have synchronization between two independent map
reduce jobs?
I think ZK can do some complex synchronization here. Like mutex, master
election, etc.

Hope this helps,

Wangda Tan




On Tue, Aug 12, 2014 at 10:43 AM, saurabh jain <sa...@gmail.com>
wrote:

> Hi Folks ,
>
> I have been writing a map-reduce application where I am having an input
> file containing records and every field in the record is separated by some
> delimiter.
>
> In addition to this user will also provide a list of columns that he wants
> to lookup in a master properties file (stored in HDFS). If this columns
> (lets say it a key) is present in master properties file then get the
> corresponding value and update the key with this value and if the key is
> not present it in the master properties file then it will create a new
> value for this key and will write to this property file and will also
> update in the record.
>
> I have written this application , tested it and everything worked fine
> till now.
>
> *e.g :* *I/P Record :* This | is | the | test | record
>
> *Columns :* 2,4 (that means code will look up only field *"is" and "test"* in
> the master properties file.)
>
> Here , I have a question.
>
> *Q 1:* In the case when my input file is huge and it is splitted across
> the multiple mappers , I was getting the below mentioned exception where
> all the other mappers tasks were failing. *Also initially when I started
> the job my master properties file is empty.* In my code I have a check if
> this file (master properties) doesn't exist create a new empty file before
> submitting the job itself.
>
> e.g : If i have 4 splits of data , then 3 map tasks are failing. But after
> this all the failed map tasks restarts and finally the job become
> successful.
>
> So , *here is the question , is it possible to make sure that when one of
> the mapper tasks is writing to a file , other should wait until the first
> one is finished. ?* I read that all the mappers task don't interact with
> each other.
>
> Also what will happen in the scenario when I start multiple parallel
> map-reduce jobs and all of them working on the same properties files. *Is
> there any way to have synchronization between two independent map reduce
> jobs*?
>
> I have also read that ZooKeeper can be used in such scenarios , Is that
> correct ?
>
>
> Error: com.techidiocy.hadoop.filesystem.api.exceptions.HDFSFileSystemException: IOException - failed while appending data to the file ->Failed to create file [/user/cloudera/lob/master/bank.properties] for [DFSClient_attempt_1407778869492_0032_m_000002_0_1618418105_1] on client [10.X.X.17], because this file is already being created by
> [DFSClient_attempt_1407778869492_0032_m_000005_0_-949968337_1] on [10.X.X.17]
>                 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:2548)
>                 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInternal(FSNamesystem.java:2377)
>                 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInt(FSNamesystem.java:2612)
>                 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:2575)
>                 at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.append(NameNodeRpcServer.java:522)
>                 at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.append(ClientNamenodeProtocolServerSideTranslatorPB.java:373)
>                 at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>                 at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
>                 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)
>                 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1986)
>                 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1982)
>                 at java.security.AccessController.doPrivileged(Native Method)
>                 at javax.security.auth.Subject.doAs(Subject.java:415)
>                 at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1554)
>                 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1980)
>
>

Re: Synchronization among Mappers in map-reduce task

Posted by Wangda Tan <wh...@gmail.com>.
Hi Saurabh,
It's an interesting topic,

>> So , here is the question , is it possible to make sure that when one of
the mapper tasks is writing to a file , other should wait until the first
one is finished. ? I read that all the mappers task don't interact with
each other

A simple way to do this is using HDFS namespace:
Create file using "public FSDataOutputStream create(Path f, boolean
overwrite)", overwrite=false. Only one mapper can successfully create file.

After write completed, the mapper will create a flag file like "completed"
in the same folder. Other mappers can wait for the "completed" file created.

>> Is there any way to have synchronization between two independent map
reduce jobs?
I think ZK can do some complex synchronization here. Like mutex, master
election, etc.

Hope this helps,

Wangda Tan




On Tue, Aug 12, 2014 at 10:43 AM, saurabh jain <sa...@gmail.com>
wrote:

> Hi Folks ,
>
> I have been writing a map-reduce application where I am having an input
> file containing records and every field in the record is separated by some
> delimiter.
>
> In addition to this user will also provide a list of columns that he wants
> to lookup in a master properties file (stored in HDFS). If this columns
> (lets say it a key) is present in master properties file then get the
> corresponding value and update the key with this value and if the key is
> not present it in the master properties file then it will create a new
> value for this key and will write to this property file and will also
> update in the record.
>
> I have written this application , tested it and everything worked fine
> till now.
>
> *e.g :* *I/P Record :* This | is | the | test | record
>
> *Columns :* 2,4 (that means code will look up only field *"is" and "test"* in
> the master properties file.)
>
> Here , I have a question.
>
> *Q 1:* In the case when my input file is huge and it is splitted across
> the multiple mappers , I was getting the below mentioned exception where
> all the other mappers tasks were failing. *Also initially when I started
> the job my master properties file is empty.* In my code I have a check if
> this file (master properties) doesn't exist create a new empty file before
> submitting the job itself.
>
> e.g : If i have 4 splits of data , then 3 map tasks are failing. But after
> this all the failed map tasks restarts and finally the job become
> successful.
>
> So , *here is the question , is it possible to make sure that when one of
> the mapper tasks is writing to a file , other should wait until the first
> one is finished. ?* I read that all the mappers task don't interact with
> each other.
>
> Also what will happen in the scenario when I start multiple parallel
> map-reduce jobs and all of them working on the same properties files. *Is
> there any way to have synchronization between two independent map reduce
> jobs*?
>
> I have also read that ZooKeeper can be used in such scenarios , Is that
> correct ?
>
>
> Error: com.techidiocy.hadoop.filesystem.api.exceptions.HDFSFileSystemException: IOException - failed while appending data to the file ->Failed to create file [/user/cloudera/lob/master/bank.properties] for [DFSClient_attempt_1407778869492_0032_m_000002_0_1618418105_1] on client [10.X.X.17], because this file is already being created by
> [DFSClient_attempt_1407778869492_0032_m_000005_0_-949968337_1] on [10.X.X.17]
>                 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:2548)
>                 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInternal(FSNamesystem.java:2377)
>                 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInt(FSNamesystem.java:2612)
>                 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:2575)
>                 at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.append(NameNodeRpcServer.java:522)
>                 at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.append(ClientNamenodeProtocolServerSideTranslatorPB.java:373)
>                 at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>                 at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
>                 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)
>                 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1986)
>                 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1982)
>                 at java.security.AccessController.doPrivileged(Native Method)
>                 at javax.security.auth.Subject.doAs(Subject.java:415)
>                 at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1554)
>                 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1980)
>
>

Re: Synchronization among Mappers in map-reduce task

Posted by Wangda Tan <wh...@gmail.com>.
Hi Saurabh,
It's an interesting topic,

>> So , here is the question , is it possible to make sure that when one of
the mapper tasks is writing to a file , other should wait until the first
one is finished. ? I read that all the mappers task don't interact with
each other

A simple way to do this is using HDFS namespace:
Create file using "public FSDataOutputStream create(Path f, boolean
overwrite)", overwrite=false. Only one mapper can successfully create file.

After write completed, the mapper will create a flag file like "completed"
in the same folder. Other mappers can wait for the "completed" file created.

>> Is there any way to have synchronization between two independent map
reduce jobs?
I think ZK can do some complex synchronization here. Like mutex, master
election, etc.

Hope this helps,

Wangda Tan




On Tue, Aug 12, 2014 at 10:43 AM, saurabh jain <sa...@gmail.com>
wrote:

> Hi Folks ,
>
> I have been writing a map-reduce application where I am having an input
> file containing records and every field in the record is separated by some
> delimiter.
>
> In addition to this user will also provide a list of columns that he wants
> to lookup in a master properties file (stored in HDFS). If this columns
> (lets say it a key) is present in master properties file then get the
> corresponding value and update the key with this value and if the key is
> not present it in the master properties file then it will create a new
> value for this key and will write to this property file and will also
> update in the record.
>
> I have written this application , tested it and everything worked fine
> till now.
>
> *e.g :* *I/P Record :* This | is | the | test | record
>
> *Columns :* 2,4 (that means code will look up only field *"is" and "test"* in
> the master properties file.)
>
> Here , I have a question.
>
> *Q 1:* In the case when my input file is huge and it is splitted across
> the multiple mappers , I was getting the below mentioned exception where
> all the other mappers tasks were failing. *Also initially when I started
> the job my master properties file is empty.* In my code I have a check if
> this file (master properties) doesn't exist create a new empty file before
> submitting the job itself.
>
> e.g : If i have 4 splits of data , then 3 map tasks are failing. But after
> this all the failed map tasks restarts and finally the job become
> successful.
>
> So , *here is the question , is it possible to make sure that when one of
> the mapper tasks is writing to a file , other should wait until the first
> one is finished. ?* I read that all the mappers task don't interact with
> each other.
>
> Also what will happen in the scenario when I start multiple parallel
> map-reduce jobs and all of them working on the same properties files. *Is
> there any way to have synchronization between two independent map reduce
> jobs*?
>
> I have also read that ZooKeeper can be used in such scenarios , Is that
> correct ?
>
>
> Error: com.techidiocy.hadoop.filesystem.api.exceptions.HDFSFileSystemException: IOException - failed while appending data to the file ->Failed to create file [/user/cloudera/lob/master/bank.properties] for [DFSClient_attempt_1407778869492_0032_m_000002_0_1618418105_1] on client [10.X.X.17], because this file is already being created by
> [DFSClient_attempt_1407778869492_0032_m_000005_0_-949968337_1] on [10.X.X.17]
>                 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:2548)
>                 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInternal(FSNamesystem.java:2377)
>                 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInt(FSNamesystem.java:2612)
>                 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:2575)
>                 at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.append(NameNodeRpcServer.java:522)
>                 at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.append(ClientNamenodeProtocolServerSideTranslatorPB.java:373)
>                 at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>                 at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
>                 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)
>                 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1986)
>                 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1982)
>                 at java.security.AccessController.doPrivileged(Native Method)
>                 at javax.security.auth.Subject.doAs(Subject.java:415)
>                 at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1554)
>                 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1980)
>
>

Re: Synchronization among Mappers in map-reduce task

Posted by saurabh jain <sa...@gmail.com>.
> Hi Folks ,
>
> I have been writing a map-reduce application where I am having an input
> file containing records and every field in the record is separated by some
> delimiter.
>
> In addition to this user will also provide a list of columns that he wants
> to lookup in a master properties file (stored in HDFS). If this columns
> (lets say it a key) is present in master properties file then get the
> corresponding value and update the key with this value and if the key is
> not present it in the master properties file then it will create a new
> value for this key and will write to this property file and will also
> update in the record.
>
> I have written this application , tested it and everything worked fine
> till now.
>
> *e.g :* *I/P Record :* This | is | the | test | record
>
> *Columns :* 2,4 (that means code will look up only field *"is" and "test"* in
> the master properties file.)
>
> Here , I have a question.
>
> *Q 1:* In the case when my input file is huge and it is splitted across
> the multiple mappers , I was getting the below mentioned exception where
> all the other mappers tasks were failing. *Also initially when I started
> the job my master properties file is empty.* In my code I have a check if
> this file (master properties) doesn't exist create a new empty file before
> submitting the job itself.
>
> e.g : If i have 4 splits of data , then 3 map tasks are failing. But after
> this all the failed map tasks restarts and finally the job become
> successful.
>
> So , *here is the question , is it possible to make sure that when one of
> the mapper tasks is writing to a file , other should wait until the first
> one is finished. ?* I read that all the mappers task don't interact with
> each other.
>
> Also what will happen in the scenario when I start multiple parallel
> map-reduce jobs and all of them working on the same properties files. *Is
> there any way to have synchronization between two independent map reduce
> jobs*?
>
> I have also read that ZooKeeper can be used in such scenarios , Is that
> correct ?
>
>
> Error: com.techidiocy.hadoop.filesystem.api.exceptions.HDFSFileSystemException: IOException - failed while appending data to the file ->Failed to create file [/user/cloudera/lob/master/bank.properties] for [DFSClient_attempt_1407778869492_0032_m_000002_0_1618418105_1] on client [10.X.X.17], because this file is already being created by
> [DFSClient_attempt_1407778869492_0032_m_000005_0_-949968337_1] on [10.X.X.17]
>                 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:2548)
>                 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInternal(FSNamesystem.java:2377)
>                 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInt(FSNamesystem.java:2612)
>                 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:2575)
>                 at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.append(NameNodeRpcServer.java:522)
>                 at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.append(ClientNamenodeProtocolServerSideTranslatorPB.java:373)
>                 at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>                 at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
>                 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)
>                 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1986)
>                 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1982)
>                 at java.security.AccessController.doPrivileged(Native Method)
>                 at javax.security.auth.Subject.doAs(Subject.java:415)
>                 at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1554)
>                 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1980)
>
>

Re: Synchronization among Mappers in map-reduce task

Posted by Wangda Tan <wh...@gmail.com>.
Hi Saurabh,
It's an interesting topic,

>> So , here is the question , is it possible to make sure that when one of
the mapper tasks is writing to a file , other should wait until the first
one is finished. ? I read that all the mappers task don't interact with
each other

A simple way to do this is using HDFS namespace:
Create file using "public FSDataOutputStream create(Path f, boolean
overwrite)", overwrite=false. Only one mapper can successfully create file.

After write completed, the mapper will create a flag file like "completed"
in the same folder. Other mappers can wait for the "completed" file created.

>> Is there any way to have synchronization between two independent map
reduce jobs?
I think ZK can do some complex synchronization here. Like mutex, master
election, etc.

Hope this helps,

Wangda Tan




On Tue, Aug 12, 2014 at 10:43 AM, saurabh jain <sa...@gmail.com>
wrote:

> Hi Folks ,
>
> I have been writing a map-reduce application where I am having an input
> file containing records and every field in the record is separated by some
> delimiter.
>
> In addition to this user will also provide a list of columns that he wants
> to lookup in a master properties file (stored in HDFS). If this columns
> (lets say it a key) is present in master properties file then get the
> corresponding value and update the key with this value and if the key is
> not present it in the master properties file then it will create a new
> value for this key and will write to this property file and will also
> update in the record.
>
> I have written this application , tested it and everything worked fine
> till now.
>
> *e.g :* *I/P Record :* This | is | the | test | record
>
> *Columns :* 2,4 (that means code will look up only field *"is" and "test"* in
> the master properties file.)
>
> Here , I have a question.
>
> *Q 1:* In the case when my input file is huge and it is splitted across
> the multiple mappers , I was getting the below mentioned exception where
> all the other mappers tasks were failing. *Also initially when I started
> the job my master properties file is empty.* In my code I have a check if
> this file (master properties) doesn't exist create a new empty file before
> submitting the job itself.
>
> e.g : If i have 4 splits of data , then 3 map tasks are failing. But after
> this all the failed map tasks restarts and finally the job become
> successful.
>
> So , *here is the question , is it possible to make sure that when one of
> the mapper tasks is writing to a file , other should wait until the first
> one is finished. ?* I read that all the mappers task don't interact with
> each other.
>
> Also what will happen in the scenario when I start multiple parallel
> map-reduce jobs and all of them working on the same properties files. *Is
> there any way to have synchronization between two independent map reduce
> jobs*?
>
> I have also read that ZooKeeper can be used in such scenarios , Is that
> correct ?
>
>
> Error: com.techidiocy.hadoop.filesystem.api.exceptions.HDFSFileSystemException: IOException - failed while appending data to the file ->Failed to create file [/user/cloudera/lob/master/bank.properties] for [DFSClient_attempt_1407778869492_0032_m_000002_0_1618418105_1] on client [10.X.X.17], because this file is already being created by
> [DFSClient_attempt_1407778869492_0032_m_000005_0_-949968337_1] on [10.X.X.17]
>                 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:2548)
>                 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInternal(FSNamesystem.java:2377)
>                 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInt(FSNamesystem.java:2612)
>                 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:2575)
>                 at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.append(NameNodeRpcServer.java:522)
>                 at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.append(ClientNamenodeProtocolServerSideTranslatorPB.java:373)
>                 at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>                 at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
>                 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)
>                 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1986)
>                 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1982)
>                 at java.security.AccessController.doPrivileged(Native Method)
>                 at javax.security.auth.Subject.doAs(Subject.java:415)
>                 at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1554)
>                 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1980)
>
>

Re: Synchronization among Mappers in map-reduce task

Posted by Wangda Tan <wh...@gmail.com>.
Hi Saurabh,
It's an interesting topic,

>> So , here is the question , is it possible to make sure that when one of
the mapper tasks is writing to a file , other should wait until the first
one is finished. ? I read that all the mappers task don't interact with
each other

A simple way to do this is using HDFS namespace:
Create file using "public FSDataOutputStream create(Path f, boolean
overwrite)", overwrite=false. Only one mapper can successfully create file.

After write completed, the mapper will create a flag file like "completed"
in the same folder. Other mappers can wait for the "completed" file created.

>> Is there any way to have synchronization between two independent map
reduce jobs?
I think ZK can do some complex synchronization here. Like mutex, master
election, etc.

Hope this helps,

Wangda Tan




On Tue, Aug 12, 2014 at 10:43 AM, saurabh jain <sa...@gmail.com>
wrote:

> Hi Folks ,
>
> I have been writing a map-reduce application where I am having an input
> file containing records and every field in the record is separated by some
> delimiter.
>
> In addition to this user will also provide a list of columns that he wants
> to lookup in a master properties file (stored in HDFS). If this columns
> (lets say it a key) is present in master properties file then get the
> corresponding value and update the key with this value and if the key is
> not present it in the master properties file then it will create a new
> value for this key and will write to this property file and will also
> update in the record.
>
> I have written this application , tested it and everything worked fine
> till now.
>
> *e.g :* *I/P Record :* This | is | the | test | record
>
> *Columns :* 2,4 (that means code will look up only field *"is" and "test"* in
> the master properties file.)
>
> Here , I have a question.
>
> *Q 1:* In the case when my input file is huge and it is splitted across
> the multiple mappers , I was getting the below mentioned exception where
> all the other mappers tasks were failing. *Also initially when I started
> the job my master properties file is empty.* In my code I have a check if
> this file (master properties) doesn't exist create a new empty file before
> submitting the job itself.
>
> e.g : If i have 4 splits of data , then 3 map tasks are failing. But after
> this all the failed map tasks restarts and finally the job become
> successful.
>
> So , *here is the question , is it possible to make sure that when one of
> the mapper tasks is writing to a file , other should wait until the first
> one is finished. ?* I read that all the mappers task don't interact with
> each other.
>
> Also what will happen in the scenario when I start multiple parallel
> map-reduce jobs and all of them working on the same properties files. *Is
> there any way to have synchronization between two independent map reduce
> jobs*?
>
> I have also read that ZooKeeper can be used in such scenarios , Is that
> correct ?
>
>
> Error: com.techidiocy.hadoop.filesystem.api.exceptions.HDFSFileSystemException: IOException - failed while appending data to the file ->Failed to create file [/user/cloudera/lob/master/bank.properties] for [DFSClient_attempt_1407778869492_0032_m_000002_0_1618418105_1] on client [10.X.X.17], because this file is already being created by
> [DFSClient_attempt_1407778869492_0032_m_000005_0_-949968337_1] on [10.X.X.17]
>                 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:2548)
>                 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInternal(FSNamesystem.java:2377)
>                 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInt(FSNamesystem.java:2612)
>                 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:2575)
>                 at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.append(NameNodeRpcServer.java:522)
>                 at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.append(ClientNamenodeProtocolServerSideTranslatorPB.java:373)
>                 at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>                 at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
>                 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)
>                 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1986)
>                 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1982)
>                 at java.security.AccessController.doPrivileged(Native Method)
>                 at javax.security.auth.Subject.doAs(Subject.java:415)
>                 at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1554)
>                 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1980)
>
>

Re: Synchronization among Mappers in map-reduce task

Posted by saurabh jain <sa...@gmail.com>.
> Hi Folks ,
>
> I have been writing a map-reduce application where I am having an input
> file containing records and every field in the record is separated by some
> delimiter.
>
> In addition to this user will also provide a list of columns that he wants
> to lookup in a master properties file (stored in HDFS). If this columns
> (lets say it a key) is present in master properties file then get the
> corresponding value and update the key with this value and if the key is
> not present it in the master properties file then it will create a new
> value for this key and will write to this property file and will also
> update in the record.
>
> I have written this application , tested it and everything worked fine
> till now.
>
> *e.g :* *I/P Record :* This | is | the | test | record
>
> *Columns :* 2,4 (that means code will look up only field *"is" and "test"* in
> the master properties file.)
>
> Here , I have a question.
>
> *Q 1:* In the case when my input file is huge and it is splitted across
> the multiple mappers , I was getting the below mentioned exception where
> all the other mappers tasks were failing. *Also initially when I started
> the job my master properties file is empty.* In my code I have a check if
> this file (master properties) doesn't exist create a new empty file before
> submitting the job itself.
>
> e.g : If i have 4 splits of data , then 3 map tasks are failing. But after
> this all the failed map tasks restarts and finally the job become
> successful.
>
> So , *here is the question , is it possible to make sure that when one of
> the mapper tasks is writing to a file , other should wait until the first
> one is finished. ?* I read that all the mappers task don't interact with
> each other.
>
> Also what will happen in the scenario when I start multiple parallel
> map-reduce jobs and all of them working on the same properties files. *Is
> there any way to have synchronization between two independent map reduce
> jobs*?
>
> I have also read that ZooKeeper can be used in such scenarios , Is that
> correct ?
>
>
> Error: com.techidiocy.hadoop.filesystem.api.exceptions.HDFSFileSystemException: IOException - failed while appending data to the file ->Failed to create file [/user/cloudera/lob/master/bank.properties] for [DFSClient_attempt_1407778869492_0032_m_000002_0_1618418105_1] on client [10.X.X.17], because this file is already being created by
> [DFSClient_attempt_1407778869492_0032_m_000005_0_-949968337_1] on [10.X.X.17]
>                 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:2548)
>                 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInternal(FSNamesystem.java:2377)
>                 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInt(FSNamesystem.java:2612)
>                 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:2575)
>                 at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.append(NameNodeRpcServer.java:522)
>                 at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.append(ClientNamenodeProtocolServerSideTranslatorPB.java:373)
>                 at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>                 at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
>                 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)
>                 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1986)
>                 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1982)
>                 at java.security.AccessController.doPrivileged(Native Method)
>                 at javax.security.auth.Subject.doAs(Subject.java:415)
>                 at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1554)
>                 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1980)
>
>