You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hadoop.apache.org by parnab kumar <pa...@gmail.com> on 2013/06/14 16:06:03 UTC

How to design the mapper and reducer for the following problem

An input file where each line corresponds to a document .Each document is
identfied by some fingerPrints .For example a line in the input file
is of the following form :

input:
---------------------
DOCID1   HASH1 HASH2 HASH3 HASH4
DOCID2   HASH5 HASH3 HASH1 HASH4

The output of the mapreduce job should write the pair of DOCIDS which share
a threshold number of HASH in common.

output:
--------------------------
DOCID1 DOCID2
DOCID3 DOCID5

RE: How to design the mapper and reducer for the following problem

Posted by John Lilley <jo...@redpoint.net>.
On further thought, it would be simpler to augment Reducer1 to use disk when it does not fit into memory.  Nested looping over the disk file is sequential and will be fast.  Then you can avoid the distributed join.
john

From: John Lilley [mailto:john.lilley@redpoint.net]
Sent: Sunday, June 16, 2013 1:25 PM
To: user@hadoop.apache.org
Subject: RE: How to design the mapper and reducer for the following problem

You basically have a "record similarity scoring and linking" problem -- common in data-quality software like ours.  This could be thought of as computing the cross-product of all records, counting the number of hash keys in common, and then outputting those that exceed a threshold.  This is very slow for large data because of N-squared size of intermediate data set or at least the number of iterations.

If you have assurance that the frequency of a given HASH value is low, such that all instances of records containing a given hash key can fit into memory, it can be done as follows:

1)      Mapper1 outputs four tuples with hash as key: {HASH1, DOCID}, {HASH2,DOCID},{HASH3,DOCID},{HASH4,DOCID} per input record

2)      Reducer1 loads all tuples with same HASH into memory.

3)      Reducer1 outputs all tuples { DOCID1, DOCID2, HASH } that share the hash key (nested loop, but only output where DOCID1 < DOCID2)

4)      Mapper2 load tuples from Reducer1 and treats { DOCID1, DOCID2 } as key

5)      Reducer2 counts {DOCID1,DOCID2} instances and outputs DOCID pairs for those exceeding threshold.

If you have no such assurance, make Mapper1 a map-only job, and replace Reducer1 with a new job that joins by HASH.  Joins are not standardized in MR but can be done with MultipleInputs, and of course Pig has this built in.  Searching on "Hadoop join" will give you some ideas of how to implement in straight MR.

John


From: parnab kumar [mailto:parnab.2007@gmail.com]
Sent: Friday, June 14, 2013 8:06 AM
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: How to design the mapper and reducer for the following problem

An input file where each line corresponds to a document .Each document is identfied by some fingerPrints .For example a line in the input file
is of the following form :

input:
---------------------
DOCID1   HASH1 HASH2 HASH3 HASH4
DOCID2   HASH5 HASH3 HASH1 HASH4

The output of the mapreduce job should write the pair of DOCIDS which share a threshold number of HASH in common.

output:
--------------------------
DOCID1 DOCID2
DOCID3 DOCID5

RE: How to design the mapper and reducer for the following problem

Posted by John Lilley <jo...@redpoint.net>.
On further thought, it would be simpler to augment Reducer1 to use disk when it does not fit into memory.  Nested looping over the disk file is sequential and will be fast.  Then you can avoid the distributed join.
john

From: John Lilley [mailto:john.lilley@redpoint.net]
Sent: Sunday, June 16, 2013 1:25 PM
To: user@hadoop.apache.org
Subject: RE: How to design the mapper and reducer for the following problem

You basically have a "record similarity scoring and linking" problem -- common in data-quality software like ours.  This could be thought of as computing the cross-product of all records, counting the number of hash keys in common, and then outputting those that exceed a threshold.  This is very slow for large data because of N-squared size of intermediate data set or at least the number of iterations.

If you have assurance that the frequency of a given HASH value is low, such that all instances of records containing a given hash key can fit into memory, it can be done as follows:

1)      Mapper1 outputs four tuples with hash as key: {HASH1, DOCID}, {HASH2,DOCID},{HASH3,DOCID},{HASH4,DOCID} per input record

2)      Reducer1 loads all tuples with same HASH into memory.

3)      Reducer1 outputs all tuples { DOCID1, DOCID2, HASH } that share the hash key (nested loop, but only output where DOCID1 < DOCID2)

4)      Mapper2 load tuples from Reducer1 and treats { DOCID1, DOCID2 } as key

5)      Reducer2 counts {DOCID1,DOCID2} instances and outputs DOCID pairs for those exceeding threshold.

If you have no such assurance, make Mapper1 a map-only job, and replace Reducer1 with a new job that joins by HASH.  Joins are not standardized in MR but can be done with MultipleInputs, and of course Pig has this built in.  Searching on "Hadoop join" will give you some ideas of how to implement in straight MR.

John


From: parnab kumar [mailto:parnab.2007@gmail.com]
Sent: Friday, June 14, 2013 8:06 AM
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: How to design the mapper and reducer for the following problem

An input file where each line corresponds to a document .Each document is identfied by some fingerPrints .For example a line in the input file
is of the following form :

input:
---------------------
DOCID1   HASH1 HASH2 HASH3 HASH4
DOCID2   HASH5 HASH3 HASH1 HASH4

The output of the mapreduce job should write the pair of DOCIDS which share a threshold number of HASH in common.

output:
--------------------------
DOCID1 DOCID2
DOCID3 DOCID5

RE: How to design the mapper and reducer for the following problem

Posted by John Lilley <jo...@redpoint.net>.
On further thought, it would be simpler to augment Reducer1 to use disk when it does not fit into memory.  Nested looping over the disk file is sequential and will be fast.  Then you can avoid the distributed join.
john

From: John Lilley [mailto:john.lilley@redpoint.net]
Sent: Sunday, June 16, 2013 1:25 PM
To: user@hadoop.apache.org
Subject: RE: How to design the mapper and reducer for the following problem

You basically have a "record similarity scoring and linking" problem -- common in data-quality software like ours.  This could be thought of as computing the cross-product of all records, counting the number of hash keys in common, and then outputting those that exceed a threshold.  This is very slow for large data because of N-squared size of intermediate data set or at least the number of iterations.

If you have assurance that the frequency of a given HASH value is low, such that all instances of records containing a given hash key can fit into memory, it can be done as follows:

1)      Mapper1 outputs four tuples with hash as key: {HASH1, DOCID}, {HASH2,DOCID},{HASH3,DOCID},{HASH4,DOCID} per input record

2)      Reducer1 loads all tuples with same HASH into memory.

3)      Reducer1 outputs all tuples { DOCID1, DOCID2, HASH } that share the hash key (nested loop, but only output where DOCID1 < DOCID2)

4)      Mapper2 load tuples from Reducer1 and treats { DOCID1, DOCID2 } as key

5)      Reducer2 counts {DOCID1,DOCID2} instances and outputs DOCID pairs for those exceeding threshold.

If you have no such assurance, make Mapper1 a map-only job, and replace Reducer1 with a new job that joins by HASH.  Joins are not standardized in MR but can be done with MultipleInputs, and of course Pig has this built in.  Searching on "Hadoop join" will give you some ideas of how to implement in straight MR.

John


From: parnab kumar [mailto:parnab.2007@gmail.com]
Sent: Friday, June 14, 2013 8:06 AM
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: How to design the mapper and reducer for the following problem

An input file where each line corresponds to a document .Each document is identfied by some fingerPrints .For example a line in the input file
is of the following form :

input:
---------------------
DOCID1   HASH1 HASH2 HASH3 HASH4
DOCID2   HASH5 HASH3 HASH1 HASH4

The output of the mapreduce job should write the pair of DOCIDS which share a threshold number of HASH in common.

output:
--------------------------
DOCID1 DOCID2
DOCID3 DOCID5

RE: How to design the mapper and reducer for the following problem

Posted by John Lilley <jo...@redpoint.net>.
On further thought, it would be simpler to augment Reducer1 to use disk when it does not fit into memory.  Nested looping over the disk file is sequential and will be fast.  Then you can avoid the distributed join.
john

From: John Lilley [mailto:john.lilley@redpoint.net]
Sent: Sunday, June 16, 2013 1:25 PM
To: user@hadoop.apache.org
Subject: RE: How to design the mapper and reducer for the following problem

You basically have a "record similarity scoring and linking" problem -- common in data-quality software like ours.  This could be thought of as computing the cross-product of all records, counting the number of hash keys in common, and then outputting those that exceed a threshold.  This is very slow for large data because of N-squared size of intermediate data set or at least the number of iterations.

If you have assurance that the frequency of a given HASH value is low, such that all instances of records containing a given hash key can fit into memory, it can be done as follows:

1)      Mapper1 outputs four tuples with hash as key: {HASH1, DOCID}, {HASH2,DOCID},{HASH3,DOCID},{HASH4,DOCID} per input record

2)      Reducer1 loads all tuples with same HASH into memory.

3)      Reducer1 outputs all tuples { DOCID1, DOCID2, HASH } that share the hash key (nested loop, but only output where DOCID1 < DOCID2)

4)      Mapper2 load tuples from Reducer1 and treats { DOCID1, DOCID2 } as key

5)      Reducer2 counts {DOCID1,DOCID2} instances and outputs DOCID pairs for those exceeding threshold.

If you have no such assurance, make Mapper1 a map-only job, and replace Reducer1 with a new job that joins by HASH.  Joins are not standardized in MR but can be done with MultipleInputs, and of course Pig has this built in.  Searching on "Hadoop join" will give you some ideas of how to implement in straight MR.

John


From: parnab kumar [mailto:parnab.2007@gmail.com]
Sent: Friday, June 14, 2013 8:06 AM
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: How to design the mapper and reducer for the following problem

An input file where each line corresponds to a document .Each document is identfied by some fingerPrints .For example a line in the input file
is of the following form :

input:
---------------------
DOCID1   HASH1 HASH2 HASH3 HASH4
DOCID2   HASH5 HASH3 HASH1 HASH4

The output of the mapreduce job should write the pair of DOCIDS which share a threshold number of HASH in common.

output:
--------------------------
DOCID1 DOCID2
DOCID3 DOCID5

RE: How to design the mapper and reducer for the following problem

Posted by John Lilley <jo...@redpoint.net>.
You basically have a "record similarity scoring and linking" problem -- common in data-quality software like ours.  This could be thought of as computing the cross-product of all records, counting the number of hash keys in common, and then outputting those that exceed a threshold.  This is very slow for large data because of N-squared size of intermediate data set or at least the number of iterations.

If you have assurance that the frequency of a given HASH value is low, such that all instances of records containing a given hash key can fit into memory, it can be done as follows:

1)      Mapper1 outputs four tuples with hash as key: {HASH1, DOCID}, {HASH2,DOCID},{HASH3,DOCID},{HASH4,DOCID} per input record

2)      Reducer1 loads all tuples with same HASH into memory.

3)      Reducer1 outputs all tuples { DOCID1, DOCID2, HASH } that share the hash key (nested loop, but only output where DOCID1 < DOCID2)

4)      Mapper2 load tuples from Reducer1 and treats { DOCID1, DOCID2 } as key

5)      Reducer2 counts {DOCID1,DOCID2} instances and outputs DOCID pairs for those exceeding threshold.

If you have no such assurance, make Mapper1 a map-only job, and replace Reducer1 with a new job that joins by HASH.  Joins are not standardized in MR but can be done with MultipleInputs, and of course Pig has this built in.  Searching on "Hadoop join" will give you some ideas of how to implement in straight MR.

John


From: parnab kumar [mailto:parnab.2007@gmail.com]
Sent: Friday, June 14, 2013 8:06 AM
To: user@hadoop.apache.org
Subject: How to design the mapper and reducer for the following problem

An input file where each line corresponds to a document .Each document is identfied by some fingerPrints .For example a line in the input file
is of the following form :

input:
---------------------
DOCID1   HASH1 HASH2 HASH3 HASH4
DOCID2   HASH5 HASH3 HASH1 HASH4

The output of the mapreduce job should write the pair of DOCIDS which share a threshold number of HASH in common.

output:
--------------------------
DOCID1 DOCID2
DOCID3 DOCID5

RE: How to design the mapper and reducer for the following problem

Posted by John Lilley <jo...@redpoint.net>.
You basically have a "record similarity scoring and linking" problem -- common in data-quality software like ours.  This could be thought of as computing the cross-product of all records, counting the number of hash keys in common, and then outputting those that exceed a threshold.  This is very slow for large data because of N-squared size of intermediate data set or at least the number of iterations.

If you have assurance that the frequency of a given HASH value is low, such that all instances of records containing a given hash key can fit into memory, it can be done as follows:

1)      Mapper1 outputs four tuples with hash as key: {HASH1, DOCID}, {HASH2,DOCID},{HASH3,DOCID},{HASH4,DOCID} per input record

2)      Reducer1 loads all tuples with same HASH into memory.

3)      Reducer1 outputs all tuples { DOCID1, DOCID2, HASH } that share the hash key (nested loop, but only output where DOCID1 < DOCID2)

4)      Mapper2 load tuples from Reducer1 and treats { DOCID1, DOCID2 } as key

5)      Reducer2 counts {DOCID1,DOCID2} instances and outputs DOCID pairs for those exceeding threshold.

If you have no such assurance, make Mapper1 a map-only job, and replace Reducer1 with a new job that joins by HASH.  Joins are not standardized in MR but can be done with MultipleInputs, and of course Pig has this built in.  Searching on "Hadoop join" will give you some ideas of how to implement in straight MR.

John


From: parnab kumar [mailto:parnab.2007@gmail.com]
Sent: Friday, June 14, 2013 8:06 AM
To: user@hadoop.apache.org
Subject: How to design the mapper and reducer for the following problem

An input file where each line corresponds to a document .Each document is identfied by some fingerPrints .For example a line in the input file
is of the following form :

input:
---------------------
DOCID1   HASH1 HASH2 HASH3 HASH4
DOCID2   HASH5 HASH3 HASH1 HASH4

The output of the mapreduce job should write the pair of DOCIDS which share a threshold number of HASH in common.

output:
--------------------------
DOCID1 DOCID2
DOCID3 DOCID5

RE: How to design the mapper and reducer for the following problem

Posted by John Lilley <jo...@redpoint.net>.
You basically have a "record similarity scoring and linking" problem -- common in data-quality software like ours.  This could be thought of as computing the cross-product of all records, counting the number of hash keys in common, and then outputting those that exceed a threshold.  This is very slow for large data because of N-squared size of intermediate data set or at least the number of iterations.

If you have assurance that the frequency of a given HASH value is low, such that all instances of records containing a given hash key can fit into memory, it can be done as follows:

1)      Mapper1 outputs four tuples with hash as key: {HASH1, DOCID}, {HASH2,DOCID},{HASH3,DOCID},{HASH4,DOCID} per input record

2)      Reducer1 loads all tuples with same HASH into memory.

3)      Reducer1 outputs all tuples { DOCID1, DOCID2, HASH } that share the hash key (nested loop, but only output where DOCID1 < DOCID2)

4)      Mapper2 load tuples from Reducer1 and treats { DOCID1, DOCID2 } as key

5)      Reducer2 counts {DOCID1,DOCID2} instances and outputs DOCID pairs for those exceeding threshold.

If you have no such assurance, make Mapper1 a map-only job, and replace Reducer1 with a new job that joins by HASH.  Joins are not standardized in MR but can be done with MultipleInputs, and of course Pig has this built in.  Searching on "Hadoop join" will give you some ideas of how to implement in straight MR.

John


From: parnab kumar [mailto:parnab.2007@gmail.com]
Sent: Friday, June 14, 2013 8:06 AM
To: user@hadoop.apache.org
Subject: How to design the mapper and reducer for the following problem

An input file where each line corresponds to a document .Each document is identfied by some fingerPrints .For example a line in the input file
is of the following form :

input:
---------------------
DOCID1   HASH1 HASH2 HASH3 HASH4
DOCID2   HASH5 HASH3 HASH1 HASH4

The output of the mapreduce job should write the pair of DOCIDS which share a threshold number of HASH in common.

output:
--------------------------
DOCID1 DOCID2
DOCID3 DOCID5

RE: How to design the mapper and reducer for the following problem

Posted by John Lilley <jo...@redpoint.net>.
You basically have a "record similarity scoring and linking" problem -- common in data-quality software like ours.  This could be thought of as computing the cross-product of all records, counting the number of hash keys in common, and then outputting those that exceed a threshold.  This is very slow for large data because of N-squared size of intermediate data set or at least the number of iterations.

If you have assurance that the frequency of a given HASH value is low, such that all instances of records containing a given hash key can fit into memory, it can be done as follows:

1)      Mapper1 outputs four tuples with hash as key: {HASH1, DOCID}, {HASH2,DOCID},{HASH3,DOCID},{HASH4,DOCID} per input record

2)      Reducer1 loads all tuples with same HASH into memory.

3)      Reducer1 outputs all tuples { DOCID1, DOCID2, HASH } that share the hash key (nested loop, but only output where DOCID1 < DOCID2)

4)      Mapper2 load tuples from Reducer1 and treats { DOCID1, DOCID2 } as key

5)      Reducer2 counts {DOCID1,DOCID2} instances and outputs DOCID pairs for those exceeding threshold.

If you have no such assurance, make Mapper1 a map-only job, and replace Reducer1 with a new job that joins by HASH.  Joins are not standardized in MR but can be done with MultipleInputs, and of course Pig has this built in.  Searching on "Hadoop join" will give you some ideas of how to implement in straight MR.

John


From: parnab kumar [mailto:parnab.2007@gmail.com]
Sent: Friday, June 14, 2013 8:06 AM
To: user@hadoop.apache.org
Subject: How to design the mapper and reducer for the following problem

An input file where each line corresponds to a document .Each document is identfied by some fingerPrints .For example a line in the input file
is of the following form :

input:
---------------------
DOCID1   HASH1 HASH2 HASH3 HASH4
DOCID2   HASH5 HASH3 HASH1 HASH4

The output of the mapreduce job should write the pair of DOCIDS which share a threshold number of HASH in common.

output:
--------------------------
DOCID1 DOCID2
DOCID3 DOCID5

Re: How to design the mapper and reducer for the following problem

Posted by Sanjay Subramanian <Sa...@wizecommerce.com>.
Hi

My quick and dirty non-optimized solution would be as follows

MAPPER
=======
OUTPUT from Mapper
    <Key = Sorted List {HASH1,HASH2,HASH3,HASH4} >      <Value = DOCID1~HASH1 HASH2 HASH3 HASH4>
    <Key = Sorted List {HASH1,HASH2,HASH3,HASH4} >      <Value = DOCID1~DOCID2   HASH5 HASH3 HASH1 HASH4>

REDUCER
========
Iterate over keys
For a key = (say) {HASH1,HASH2,HASH3,HASH4}
     Format the collection of values into some StringBuilder kind of class

Output
KEY = {DOCID1 DOCID2}  value = null
KEY = {DOCID3 DOCID5} value = null

Hope I have understood your problem correctly…If not sorry about that

sanjay

From: parnab kumar <pa...@gmail.com>>
Reply-To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Date: Friday, June 14, 2013 7:06 AM
To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Subject: How to design the mapper and reducer for the following problem

An input file where each line corresponds to a document .Each document is identfied by some fingerPrints .For example a line in the input file
is of the following form :

input:
---------------------
DOCID1   HASH1 HASH2 HASH3 HASH4
DOCID2   HASH5 HASH3 HASH1 HASH4

The output of the mapreduce job should write the pair of DOCIDS which share a threshold number of HASH in common.

output:
--------------------------
DOCID1 DOCID2
DOCID3 DOCID5

CONFIDENTIALITY NOTICE
======================
This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.

Re: How to design the mapper and reducer for the following problem

Posted by Sanjay Subramanian <Sa...@wizecommerce.com>.
Hi

My quick and dirty non-optimized solution would be as follows

MAPPER
=======
OUTPUT from Mapper
    <Key = Sorted List {HASH1,HASH2,HASH3,HASH4} >      <Value = DOCID1~HASH1 HASH2 HASH3 HASH4>
    <Key = Sorted List {HASH1,HASH2,HASH3,HASH4} >      <Value = DOCID1~DOCID2   HASH5 HASH3 HASH1 HASH4>

REDUCER
========
Iterate over keys
For a key = (say) {HASH1,HASH2,HASH3,HASH4}
     Format the collection of values into some StringBuilder kind of class

Output
KEY = {DOCID1 DOCID2}  value = null
KEY = {DOCID3 DOCID5} value = null

Hope I have understood your problem correctly…If not sorry about that

sanjay

From: parnab kumar <pa...@gmail.com>>
Reply-To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Date: Friday, June 14, 2013 7:06 AM
To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Subject: How to design the mapper and reducer for the following problem

An input file where each line corresponds to a document .Each document is identfied by some fingerPrints .For example a line in the input file
is of the following form :

input:
---------------------
DOCID1   HASH1 HASH2 HASH3 HASH4
DOCID2   HASH5 HASH3 HASH1 HASH4

The output of the mapreduce job should write the pair of DOCIDS which share a threshold number of HASH in common.

output:
--------------------------
DOCID1 DOCID2
DOCID3 DOCID5

CONFIDENTIALITY NOTICE
======================
This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.

Many Errors at the last step of copying files from _temporary to Output Directory

Posted by Sanjay Subramanian <Sa...@wizecommerce.com>.
Hi

My environment is like this

INPUT FILES
==========
400 GZIP files , one from each server - average size gzipped 25MB

REDUCER
=======
Uses MultipleOutput

OUTPUT  (Snappy)
=======
/path/to/output/dir1
/path/to/output/dir2
/path/to/output/dir3
/path/to/output/dir4

Number of output directories = 1600
Number of output files = 17000

SETTINGS
=========
Maximum Number of Transfer Threads
dfs.datanode.max.xcievers, dfs.datanode.max.transfer.threads  = 16384

ERRORS
=======
I am getting errors consistently at the last step of  copying files from _temporary to Output Directory.

ERROR 1
=======
BADF: Bad file descriptor
at org.apache.hadoop.io.nativeio.NativeIO.posix_fadvise(Native Method)
at org.apache.hadoop.io.nativeio.NativeIO.posixFadviseIfPossible(NativeIO.java:145)
at org.apache.hadoop.io.ReadaheadPool$ReadaheadRequestImpl.run(ReadaheadPool.java:205)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)


ERROR 2
=======
2013-06-13 23:35:15,902 WARN [main] org.apache.hadoop.hdfs.DFSClient: Failed to connect to /10.28.21.171:50010 for block, add to deadNodes and continue. java.io.IOException: Got error for OP_READ_BLOCK, self=/10.28.21.171:57436, remote=/10.28.21.171:50010, for file /user/nextag/oozie-workflows/config/aggregations.conf, for pool BP-64441488-10.28.21.167-1364511907893 block 213045727251858949_8466884
java.io.IOException: Got error for OP_READ_BLOCK, self=/10.28.21.171:57436, remote=/10.28.21.171:50010, for file /user/nextag/oozie-workflows/config/aggregations.conf, for pool BP-64441488-10.28.21.167-1364511907893 block 213045727251858949_8466884
at org.apache.hadoop.hdfs.RemoteBlockReader2.checkSuccess(RemoteBlockReader2.java:444)
at org.apache.hadoop.hdfs.RemoteBlockReader2.newBlockReader(RemoteBlockReader2.java:409)
at org.apache.hadoop.hdfs.BlockReaderFactory.newBlockReader(BlockReaderFactory.java:105)
at org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:937)
at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:455)
at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:645)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:689)
at java.io.DataInputStream.read(DataInputStream.java:132)
at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:264)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:306)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:158)
at java.io.InputStreamReader.read(InputStreamReader.java:167)
at java.io.BufferedReader.fill(BufferedReader.java:136)
at java.io.BufferedReader.readLine(BufferedReader.java:299)
at java.io.BufferedReader.readLine(BufferedReader.java:362)
at com.wizecommerce.utils.mapred.HdfsUtils.readFileIntoList(HdfsUtils.java:83)
at com.wizecommerce.utils.mapred.HdfsUtils.getConfigParamMap(HdfsUtils.java:214)
at com.wizecommerce.utils.mapred.NextagFileOutputFormat.getOutputPath(NextagFileOutputFormat.java:171)
at com.wizecommerce.utils.mapred.NextagFileOutputFormat.getOutputCommitter(NextagFileOutputFormat.java:330)
at com.wizecommerce.utils.mapred.NextagFileOutputFormat.getDefaultWorkFile(NextagFileOutputFormat.java:306)
at com.wizecommerce.utils.mapred.NextagTextOutputFormat.getRecordWriter(NextagTextOutputFormat.java:111)
at org.apache.hadoop.mapreduce.lib.output.MultipleOutputs.getRecordWriter(MultipleOutputs.java:413)
at org.apache.hadoop.mapreduce.lib.output.MultipleOutputs.write(MultipleOutputs.java:395)
at com.wizecommerce.parser.mapred.OutpdirImpressionLogReducer.writePtitleExplanationBlob(OutpdirImpressionLogReducer.java:337)
at com.wizecommerce.parser.mapred.OutpdirImpressionLogReducer.processPTitle(OutpdirImpressionLogReducer.java:171)
at com.wizecommerce.parser.mapred.OutpdirImpressionLogReducer.reduce(OutpdirImpressionLogReducer.java:91)
at com.wizecommerce.parser.mapred.OutpdirImpressionLogReducer.reduce(OutpdirImpressionLogReducer.java:24)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:170)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:636)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:396)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:152)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1332)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:147)


Thanks
Sanjay


CONFIDENTIALITY NOTICE
======================
This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.

Re: How to design the mapper and reducer for the following problem

Posted by Sanjay Subramanian <Sa...@wizecommerce.com>.
Hi

My quick and dirty non-optimized solution would be as follows

MAPPER
=======
OUTPUT from Mapper
    <Key = Sorted List {HASH1,HASH2,HASH3,HASH4} >      <Value = DOCID1~HASH1 HASH2 HASH3 HASH4>
    <Key = Sorted List {HASH1,HASH2,HASH3,HASH4} >      <Value = DOCID1~DOCID2   HASH5 HASH3 HASH1 HASH4>

REDUCER
========
Iterate over keys
For a key = (say) {HASH1,HASH2,HASH3,HASH4}
     Format the collection of values into some StringBuilder kind of class

Output
KEY = {DOCID1 DOCID2}  value = null
KEY = {DOCID3 DOCID5} value = null

Hope I have understood your problem correctly…If not sorry about that

sanjay

From: parnab kumar <pa...@gmail.com>>
Reply-To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Date: Friday, June 14, 2013 7:06 AM
To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Subject: How to design the mapper and reducer for the following problem

An input file where each line corresponds to a document .Each document is identfied by some fingerPrints .For example a line in the input file
is of the following form :

input:
---------------------
DOCID1   HASH1 HASH2 HASH3 HASH4
DOCID2   HASH5 HASH3 HASH1 HASH4

The output of the mapreduce job should write the pair of DOCIDS which share a threshold number of HASH in common.

output:
--------------------------
DOCID1 DOCID2
DOCID3 DOCID5

CONFIDENTIALITY NOTICE
======================
This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.

Many Errors at the last step of copying files from _temporary to Output Directory

Posted by Sanjay Subramanian <Sa...@wizecommerce.com>.
Hi

My environment is like this

INPUT FILES
==========
400 GZIP files , one from each server - average size gzipped 25MB

REDUCER
=======
Uses MultipleOutput

OUTPUT  (Snappy)
=======
/path/to/output/dir1
/path/to/output/dir2
/path/to/output/dir3
/path/to/output/dir4

Number of output directories = 1600
Number of output files = 17000

SETTINGS
=========
Maximum Number of Transfer Threads
dfs.datanode.max.xcievers, dfs.datanode.max.transfer.threads  = 16384

ERRORS
=======
I am getting errors consistently at the last step of  copying files from _temporary to Output Directory.

ERROR 1
=======
BADF: Bad file descriptor
at org.apache.hadoop.io.nativeio.NativeIO.posix_fadvise(Native Method)
at org.apache.hadoop.io.nativeio.NativeIO.posixFadviseIfPossible(NativeIO.java:145)
at org.apache.hadoop.io.ReadaheadPool$ReadaheadRequestImpl.run(ReadaheadPool.java:205)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)


ERROR 2
=======
2013-06-13 23:35:15,902 WARN [main] org.apache.hadoop.hdfs.DFSClient: Failed to connect to /10.28.21.171:50010 for block, add to deadNodes and continue. java.io.IOException: Got error for OP_READ_BLOCK, self=/10.28.21.171:57436, remote=/10.28.21.171:50010, for file /user/nextag/oozie-workflows/config/aggregations.conf, for pool BP-64441488-10.28.21.167-1364511907893 block 213045727251858949_8466884
java.io.IOException: Got error for OP_READ_BLOCK, self=/10.28.21.171:57436, remote=/10.28.21.171:50010, for file /user/nextag/oozie-workflows/config/aggregations.conf, for pool BP-64441488-10.28.21.167-1364511907893 block 213045727251858949_8466884
at org.apache.hadoop.hdfs.RemoteBlockReader2.checkSuccess(RemoteBlockReader2.java:444)
at org.apache.hadoop.hdfs.RemoteBlockReader2.newBlockReader(RemoteBlockReader2.java:409)
at org.apache.hadoop.hdfs.BlockReaderFactory.newBlockReader(BlockReaderFactory.java:105)
at org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:937)
at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:455)
at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:645)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:689)
at java.io.DataInputStream.read(DataInputStream.java:132)
at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:264)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:306)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:158)
at java.io.InputStreamReader.read(InputStreamReader.java:167)
at java.io.BufferedReader.fill(BufferedReader.java:136)
at java.io.BufferedReader.readLine(BufferedReader.java:299)
at java.io.BufferedReader.readLine(BufferedReader.java:362)
at com.wizecommerce.utils.mapred.HdfsUtils.readFileIntoList(HdfsUtils.java:83)
at com.wizecommerce.utils.mapred.HdfsUtils.getConfigParamMap(HdfsUtils.java:214)
at com.wizecommerce.utils.mapred.NextagFileOutputFormat.getOutputPath(NextagFileOutputFormat.java:171)
at com.wizecommerce.utils.mapred.NextagFileOutputFormat.getOutputCommitter(NextagFileOutputFormat.java:330)
at com.wizecommerce.utils.mapred.NextagFileOutputFormat.getDefaultWorkFile(NextagFileOutputFormat.java:306)
at com.wizecommerce.utils.mapred.NextagTextOutputFormat.getRecordWriter(NextagTextOutputFormat.java:111)
at org.apache.hadoop.mapreduce.lib.output.MultipleOutputs.getRecordWriter(MultipleOutputs.java:413)
at org.apache.hadoop.mapreduce.lib.output.MultipleOutputs.write(MultipleOutputs.java:395)
at com.wizecommerce.parser.mapred.OutpdirImpressionLogReducer.writePtitleExplanationBlob(OutpdirImpressionLogReducer.java:337)
at com.wizecommerce.parser.mapred.OutpdirImpressionLogReducer.processPTitle(OutpdirImpressionLogReducer.java:171)
at com.wizecommerce.parser.mapred.OutpdirImpressionLogReducer.reduce(OutpdirImpressionLogReducer.java:91)
at com.wizecommerce.parser.mapred.OutpdirImpressionLogReducer.reduce(OutpdirImpressionLogReducer.java:24)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:170)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:636)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:396)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:152)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1332)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:147)


Thanks
Sanjay


CONFIDENTIALITY NOTICE
======================
This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.

Re: How to design the mapper and reducer for the following problem

Posted by Sanjay Subramanian <Sa...@wizecommerce.com>.
Hi

My quick and dirty non-optimized solution would be as follows

MAPPER
=======
OUTPUT from Mapper
    <Key = Sorted List {HASH1,HASH2,HASH3,HASH4} >      <Value = DOCID1~HASH1 HASH2 HASH3 HASH4>
    <Key = Sorted List {HASH1,HASH2,HASH3,HASH4} >      <Value = DOCID1~DOCID2   HASH5 HASH3 HASH1 HASH4>

REDUCER
========
Iterate over keys
For a key = (say) {HASH1,HASH2,HASH3,HASH4}
     Format the collection of values into some StringBuilder kind of class

Output
KEY = {DOCID1 DOCID2}  value = null
KEY = {DOCID3 DOCID5} value = null

Hope I have understood your problem correctly…If not sorry about that

sanjay

From: parnab kumar <pa...@gmail.com>>
Reply-To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Date: Friday, June 14, 2013 7:06 AM
To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Subject: How to design the mapper and reducer for the following problem

An input file where each line corresponds to a document .Each document is identfied by some fingerPrints .For example a line in the input file
is of the following form :

input:
---------------------
DOCID1   HASH1 HASH2 HASH3 HASH4
DOCID2   HASH5 HASH3 HASH1 HASH4

The output of the mapreduce job should write the pair of DOCIDS which share a threshold number of HASH in common.

output:
--------------------------
DOCID1 DOCID2
DOCID3 DOCID5

CONFIDENTIALITY NOTICE
======================
This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.

Many Errors at the last step of copying files from _temporary to Output Directory

Posted by Sanjay Subramanian <Sa...@wizecommerce.com>.
Hi

My environment is like this

INPUT FILES
==========
400 GZIP files , one from each server - average size gzipped 25MB

REDUCER
=======
Uses MultipleOutput

OUTPUT  (Snappy)
=======
/path/to/output/dir1
/path/to/output/dir2
/path/to/output/dir3
/path/to/output/dir4

Number of output directories = 1600
Number of output files = 17000

SETTINGS
=========
Maximum Number of Transfer Threads
dfs.datanode.max.xcievers, dfs.datanode.max.transfer.threads  = 16384

ERRORS
=======
I am getting errors consistently at the last step of  copying files from _temporary to Output Directory.

ERROR 1
=======
BADF: Bad file descriptor
at org.apache.hadoop.io.nativeio.NativeIO.posix_fadvise(Native Method)
at org.apache.hadoop.io.nativeio.NativeIO.posixFadviseIfPossible(NativeIO.java:145)
at org.apache.hadoop.io.ReadaheadPool$ReadaheadRequestImpl.run(ReadaheadPool.java:205)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)


ERROR 2
=======
2013-06-13 23:35:15,902 WARN [main] org.apache.hadoop.hdfs.DFSClient: Failed to connect to /10.28.21.171:50010 for block, add to deadNodes and continue. java.io.IOException: Got error for OP_READ_BLOCK, self=/10.28.21.171:57436, remote=/10.28.21.171:50010, for file /user/nextag/oozie-workflows/config/aggregations.conf, for pool BP-64441488-10.28.21.167-1364511907893 block 213045727251858949_8466884
java.io.IOException: Got error for OP_READ_BLOCK, self=/10.28.21.171:57436, remote=/10.28.21.171:50010, for file /user/nextag/oozie-workflows/config/aggregations.conf, for pool BP-64441488-10.28.21.167-1364511907893 block 213045727251858949_8466884
at org.apache.hadoop.hdfs.RemoteBlockReader2.checkSuccess(RemoteBlockReader2.java:444)
at org.apache.hadoop.hdfs.RemoteBlockReader2.newBlockReader(RemoteBlockReader2.java:409)
at org.apache.hadoop.hdfs.BlockReaderFactory.newBlockReader(BlockReaderFactory.java:105)
at org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:937)
at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:455)
at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:645)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:689)
at java.io.DataInputStream.read(DataInputStream.java:132)
at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:264)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:306)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:158)
at java.io.InputStreamReader.read(InputStreamReader.java:167)
at java.io.BufferedReader.fill(BufferedReader.java:136)
at java.io.BufferedReader.readLine(BufferedReader.java:299)
at java.io.BufferedReader.readLine(BufferedReader.java:362)
at com.wizecommerce.utils.mapred.HdfsUtils.readFileIntoList(HdfsUtils.java:83)
at com.wizecommerce.utils.mapred.HdfsUtils.getConfigParamMap(HdfsUtils.java:214)
at com.wizecommerce.utils.mapred.NextagFileOutputFormat.getOutputPath(NextagFileOutputFormat.java:171)
at com.wizecommerce.utils.mapred.NextagFileOutputFormat.getOutputCommitter(NextagFileOutputFormat.java:330)
at com.wizecommerce.utils.mapred.NextagFileOutputFormat.getDefaultWorkFile(NextagFileOutputFormat.java:306)
at com.wizecommerce.utils.mapred.NextagTextOutputFormat.getRecordWriter(NextagTextOutputFormat.java:111)
at org.apache.hadoop.mapreduce.lib.output.MultipleOutputs.getRecordWriter(MultipleOutputs.java:413)
at org.apache.hadoop.mapreduce.lib.output.MultipleOutputs.write(MultipleOutputs.java:395)
at com.wizecommerce.parser.mapred.OutpdirImpressionLogReducer.writePtitleExplanationBlob(OutpdirImpressionLogReducer.java:337)
at com.wizecommerce.parser.mapred.OutpdirImpressionLogReducer.processPTitle(OutpdirImpressionLogReducer.java:171)
at com.wizecommerce.parser.mapred.OutpdirImpressionLogReducer.reduce(OutpdirImpressionLogReducer.java:91)
at com.wizecommerce.parser.mapred.OutpdirImpressionLogReducer.reduce(OutpdirImpressionLogReducer.java:24)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:170)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:636)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:396)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:152)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1332)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:147)


Thanks
Sanjay


CONFIDENTIALITY NOTICE
======================
This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.

Many Errors at the last step of copying files from _temporary to Output Directory

Posted by Sanjay Subramanian <Sa...@wizecommerce.com>.
Hi

My environment is like this

INPUT FILES
==========
400 GZIP files , one from each server - average size gzipped 25MB

REDUCER
=======
Uses MultipleOutput

OUTPUT  (Snappy)
=======
/path/to/output/dir1
/path/to/output/dir2
/path/to/output/dir3
/path/to/output/dir4

Number of output directories = 1600
Number of output files = 17000

SETTINGS
=========
Maximum Number of Transfer Threads
dfs.datanode.max.xcievers, dfs.datanode.max.transfer.threads  = 16384

ERRORS
=======
I am getting errors consistently at the last step of  copying files from _temporary to Output Directory.

ERROR 1
=======
BADF: Bad file descriptor
at org.apache.hadoop.io.nativeio.NativeIO.posix_fadvise(Native Method)
at org.apache.hadoop.io.nativeio.NativeIO.posixFadviseIfPossible(NativeIO.java:145)
at org.apache.hadoop.io.ReadaheadPool$ReadaheadRequestImpl.run(ReadaheadPool.java:205)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)


ERROR 2
=======
2013-06-13 23:35:15,902 WARN [main] org.apache.hadoop.hdfs.DFSClient: Failed to connect to /10.28.21.171:50010 for block, add to deadNodes and continue. java.io.IOException: Got error for OP_READ_BLOCK, self=/10.28.21.171:57436, remote=/10.28.21.171:50010, for file /user/nextag/oozie-workflows/config/aggregations.conf, for pool BP-64441488-10.28.21.167-1364511907893 block 213045727251858949_8466884
java.io.IOException: Got error for OP_READ_BLOCK, self=/10.28.21.171:57436, remote=/10.28.21.171:50010, for file /user/nextag/oozie-workflows/config/aggregations.conf, for pool BP-64441488-10.28.21.167-1364511907893 block 213045727251858949_8466884
at org.apache.hadoop.hdfs.RemoteBlockReader2.checkSuccess(RemoteBlockReader2.java:444)
at org.apache.hadoop.hdfs.RemoteBlockReader2.newBlockReader(RemoteBlockReader2.java:409)
at org.apache.hadoop.hdfs.BlockReaderFactory.newBlockReader(BlockReaderFactory.java:105)
at org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:937)
at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:455)
at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:645)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:689)
at java.io.DataInputStream.read(DataInputStream.java:132)
at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:264)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:306)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:158)
at java.io.InputStreamReader.read(InputStreamReader.java:167)
at java.io.BufferedReader.fill(BufferedReader.java:136)
at java.io.BufferedReader.readLine(BufferedReader.java:299)
at java.io.BufferedReader.readLine(BufferedReader.java:362)
at com.wizecommerce.utils.mapred.HdfsUtils.readFileIntoList(HdfsUtils.java:83)
at com.wizecommerce.utils.mapred.HdfsUtils.getConfigParamMap(HdfsUtils.java:214)
at com.wizecommerce.utils.mapred.NextagFileOutputFormat.getOutputPath(NextagFileOutputFormat.java:171)
at com.wizecommerce.utils.mapred.NextagFileOutputFormat.getOutputCommitter(NextagFileOutputFormat.java:330)
at com.wizecommerce.utils.mapred.NextagFileOutputFormat.getDefaultWorkFile(NextagFileOutputFormat.java:306)
at com.wizecommerce.utils.mapred.NextagTextOutputFormat.getRecordWriter(NextagTextOutputFormat.java:111)
at org.apache.hadoop.mapreduce.lib.output.MultipleOutputs.getRecordWriter(MultipleOutputs.java:413)
at org.apache.hadoop.mapreduce.lib.output.MultipleOutputs.write(MultipleOutputs.java:395)
at com.wizecommerce.parser.mapred.OutpdirImpressionLogReducer.writePtitleExplanationBlob(OutpdirImpressionLogReducer.java:337)
at com.wizecommerce.parser.mapred.OutpdirImpressionLogReducer.processPTitle(OutpdirImpressionLogReducer.java:171)
at com.wizecommerce.parser.mapred.OutpdirImpressionLogReducer.reduce(OutpdirImpressionLogReducer.java:91)
at com.wizecommerce.parser.mapred.OutpdirImpressionLogReducer.reduce(OutpdirImpressionLogReducer.java:24)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:170)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:636)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:396)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:152)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1332)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:147)


Thanks
Sanjay


CONFIDENTIALITY NOTICE
======================
This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.