You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by Leonidas Fegaras <le...@hotmail.com> on 2011/02/10 20:21:22 UTC

Self-joins with

Hi,It try to do a self-join on a file using MultipleInputs on hadoop 0.21.0. A self-join is when you join a file with itself(for example, if you want to dereference the idrefs in an XML document). I use the following code: 
	MultipleInputs.addInputPath(job,new Path(file1),TextInputFormat.class,JoinMapperLeft.class);	MultipleInputs.addInputPath(job,new Path(file2),TextInputFormat.class,JoinMapperRight.class);
It works fine for two different files file1 and file2. It also works if I copy file1 to file2 or if I create a symbolic linkfrom file2 to file1. It does not work if the file1 path is exactly the same as the file2 path (it parses file2 completelyand applies the map function to the file2 content, but it thinks that file1 is empty). Is this a bug or is it done intentionally?Is there any way to do a self-join other than copying the file or creating a symbolic link?Thank youLeonidas Fegaras
 		 	   		  

RE: Self-joins with

Posted by "Ghigliotti, Matthew" <Ma...@garmin.com>.
I'm inclined to say that this is (another) bug in MultipleInputs. Staring at the source code shows that the source of your headache is coming from the getMapperTypeMap() method, which builds a mapping between the input paths and the mapper classes. Since you're reusing the same path for multiple mappers, the earlier entry in the Configuration object is overwritten.

A while back, I pointed out a different bug in MultipleInputs, where you could not use input paths which utilized globs with commas (such as "/data/{January,February,March}.txt"). Since commas are used as delimiters to separate (Path, Mapper) and (Path, InputFormat) pairs within the Configuration object, such paths-with-comma-globs explode horribly.

*Matthew Ghigliotti*

________________________________
From: Leonidas Fegaras [mailto:leofeg@hotmail.com]
Sent: Thursday, February 10, 2011 12:21 PM
To: mapreduce-user@hadoop.apache.org
Subject: Self-joins with

Hi,
It try to do a self-join on a file using MultipleInputs on hadoop 0.21.0. A self-join is when you join a file with itself
(for example, if you want to dereference the idrefs in an XML document). I use the following code:

MultipleInputs.addInputPath(job,new Path(file1),TextInputFormat.class,JoinMapperLeft.class);
MultipleInputs.addInputPath(job,new Path(file2),TextInputFormat.class,JoinMapperRight.class);

It works fine for two different files file1 and file2. It also works if I copy file1 to file2 or if I create a symbolic link
from file2 to file1. It does not work if the file1 path is exactly the same as the file2 path (it parses file2 completely
and applies the map function to the file2 content, but it thinks that file1 is empty). Is this a bug or is it done intentionally?
Is there any way to do a self-join other than copying the file or creating a symbolic link?
Thank you
Leonidas Fegaras


________________________________
This e-mail and any attachments may contain confidential material for the sole use of the intended recipient. If you are not the intended recipient, please be aware that any disclosure, copying, distribution or use of this e-mail or any attachment is prohibited. If you have received this e-mail in error, please contact the sender and delete all copies.

Thank you for your cooperation.