You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Zhang, Liyun" <li...@intel.com> on 2015/04/17 09:11:43 UTC
a question about TestSkewedJoin# testSkewedJoinKeyPartition
Hi all:
I want to ask a question about TestSkewedJoin# testSkewedJoinKeyPartition:
@Test
public void testSkewedJoinKeyPartition() throws IOException {
String outputDir = "testSkewedJoinKeyPartition";
try{
Util.deleteFile(cluster, outputDir);
}catch(Exception e){
// it is ok if directory not exist
}
pigServer.registerQuery("A = LOAD '" + INPUT_FILE1 + "' as (id, name, n);");
pigServer.registerQuery("B = LOAD '" + INPUT_FILE2 + "' as (id, name);");
pigServer.registerQuery("E = join A by id, B by id using 'skewed' parallel 7;");
pigServer.store("E", outputDir);
int[][] lineCount = new int[3][7];
FileStatus[] outputFiles = fs.listStatus(new Path(outputDir), Util.getSuccessMarkerPathFilter());
// check how many times a key appear in each part- file
for (int i=0; i<7; i++) {
String filename = outputFiles[i].getPath().toString();
Util.copyFromClusterToLocal(cluster, filename, OUTPUT_DIR + "/" + i);
BufferedReader reader = new BufferedReader(new FileReader(OUTPUT_DIR + "/" + i));
String line = null;
while((line = reader.readLine()) != null) {
String[] cols = line.split("\t");
int key = Integer.parseInt(cols[0])/100 -1;
lineCount[key][i] ++;
}
reader.close();
}
int fc = 0;
for(int i=0; i<3; i++) {
for(int j=0; j<7; j++) {
if (lineCount[i][j] > 0) {
fc ++;
}
}
}
// atleast one key should be a skewed key
// check atleast one key should appear in more than 1 part- file
assertTrue(fc > 3);
}
When I run this unit test , I found the result is in OUTPUT_DIR/0 ~OUTPUT_DIR/6( because the parallel number is 7). One key appears in more 1 part-file.
But when I the script in command, I found the result in OUTPUT_DIR/part-00002, OUTPUT_DIR/part-00004, OUTPUT_DIR/part-00006. Other part-0000x is empty. One key only appears in 1 part-file.
A = LOAD './SkewedJoinInput1.txt' as (id, name, n);
B = LOAD './SkewedJoinInput2.txt' as (id, name);
E = join A by id, B by id using 'skewed' parallel 7;
store E into './testSkewedJoin.out';
I don't understand why have different results when running in unit test environment and running in command directly?
I'm appreciated if anyone can give me some suggestions.
Kelly Zhang/Zhang,Liyun
Best Regards