You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Zhang, Liyun" <li...@intel.com> on 2015/04/17 09:11:43 UTC

a question about TestSkewedJoin# testSkewedJoinKeyPartition

Hi all:
   I want to ask a question about TestSkewedJoin# testSkewedJoinKeyPartition:

@Test
    public void testSkewedJoinKeyPartition() throws IOException {
        String outputDir = "testSkewedJoinKeyPartition";
        try{
             Util.deleteFile(cluster, outputDir);
        }catch(Exception e){
            // it is ok if directory not exist
        }

         pigServer.registerQuery("A = LOAD '" + INPUT_FILE1 + "' as (id, name, n);");
         pigServer.registerQuery("B = LOAD '" + INPUT_FILE2 + "' as (id, name);");
         pigServer.registerQuery("E = join A by id, B by id using 'skewed' parallel 7;");
         pigServer.store("E", outputDir);

         int[][] lineCount = new int[3][7];

         FileStatus[] outputFiles = fs.listStatus(new Path(outputDir), Util.getSuccessMarkerPathFilter());
         // check how many times a key appear in each part- file
         for (int i=0; i<7; i++) {
             String filename = outputFiles[i].getPath().toString();
             Util.copyFromClusterToLocal(cluster, filename, OUTPUT_DIR + "/" + i);
             BufferedReader reader = new BufferedReader(new FileReader(OUTPUT_DIR + "/" + i));
             String line = null;
             while((line = reader.readLine()) != null) {
                 String[] cols = line.split("\t");
                 int key = Integer.parseInt(cols[0])/100 -1;
                 lineCount[key][i] ++;
             }
             reader.close();
         }

         int fc = 0;
         for(int i=0; i<3; i++) {
             for(int j=0; j<7; j++) {
                 if (lineCount[i][j] > 0) {
                     fc ++;
                 }
             }
         }
         // atleast one key should be a skewed key
         // check atleast one key should appear in more than 1 part- file
         assertTrue(fc > 3);
    }


When  I run this unit test ,  I found the result is in OUTPUT_DIR/0 ~OUTPUT_DIR/6( because the parallel number is 7).   One key appears in more 1 part-file.

But when  I the script in command, I found the result in OUTPUT_DIR/part-00002, OUTPUT_DIR/part-00004, OUTPUT_DIR/part-00006.  Other part-0000x is empty. One key only appears in 1 part-file.
A = LOAD './SkewedJoinInput1.txt' as (id, name, n);
B = LOAD './SkewedJoinInput2.txt' as (id, name);
E = join A by id, B by id using 'skewed' parallel 7;
store E into './testSkewedJoin.out';


I don't understand why have different results when running in unit test environment and running in command directly?

I'm appreciated if anyone can give me some suggestions.




Kelly Zhang/Zhang,Liyun
Best Regards