You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@sqoop.apache.org by "Lan Jiang (JIRA)" <ji...@apache.org> on 2016/04/11 02:45:25 UTC

[jira] [Created] (SQOOP-2904) Oraoop does not distribute data evenly in mappers

Lan Jiang created SQOOP-2904:
--------------------------------

             Summary: Oraoop does not distribute data evenly in mappers
                 Key: SQOOP-2904
                 URL: https://issues.apache.org/jira/browse/SQOOP-2904
             Project: Sqoop
          Issue Type: Bug
          Components: connectors/oracle
    Affects Versions: 1.4.6
         Environment: RedHat 6.7 
            Reporter: Lan Jiang


When executing sqoop command below with direct option and import data from Oracle

sqoop import -Doracle.row.fetch.size=20000 -Doraoop.timestamp.string=false --connect jdbc:oracle:thin:@xxx.xx.xx.xxxx -m 50 --direct --username xxx --password xxxx --table my_table_name --fetch-size 20000 --target-dir /data/temp --null-string '\\N' --null-non-string '\\N'

The message stdout message has

16/04/08 10:39:06 INFO oracle.OraOopDataDrivenDBInputFormat: The table being imported by sqoop has 138310664 blocks that have been divided into 101 chunks which will be processed in 50 splits. The chunks will be allocated to the splits using the method : ROUNDROBIN
16/04/08 10:39:07 INFO mapreduce.JobSubmitter: number of splits:50

Thus 49 mapper is going to work on 2 chunks while 1 mapper is going to work on 3 chunks. Because that single mapper takes 50% more data then rest of the mapper, it takes 50% longer time to finish.  

First of all, in the OraoopUtilities.java, it has a method getNumberOfDataChunksPerOracleDataFile

  public static int getNumberOfDataChunksPerOracleDataFile(
    
    int desiredNumberOfMappers, org.apache.hadoop.conf.Configuration conf) {
    final String MAPPER_MULTIPLIER = "oraoop.datachunk.mapper.multiplier";
    final String RESULT_INCREMENT = "oraoop.datachunk.result.increment";

    int numberToMultiplyMappersBy = conf.getInt(MAPPER_MULTIPLIER, 2);
    int numberToIncrementResultBy = conf.getInt(RESULT_INCREMENT, 1);

    // The number of chunks generated will *not* be a multiple of the number of
    // splits,
    // to ensure that each split doesn't always get data from the start of each
    // data-file...
    int numberOfDataChunksPerOracleDataFile =
        (desiredNumberOfMappers * numberToMultiplyMappersBy)
            + numberToIncrementResultBy;

So it looks like it was designed this way on purpose so that the each split will not always get data from the start of each data file. 

I thought I could simply configure property oraoop.datachunk.result.increment=0 to solve the issue, but after testing, it seems it does not change the behavior. I then dig deeper and found this method is not actually called anywhere in the Sqoop. Instead, in class OraOopDataDrivenDBInputFormat (method getSplits), it implements the similar logic again, but this time using hard-coded values

    int desiredNumberOfMappers = getDesiredNumberOfMappers(jobContext);
    …
    ...

      // The number of chunks generated will *not* be a multiple of the number
      // of splits,
      // to ensure that each split doesn't always get data from the start of
      // each data-file...
      int numberOfChunksPerOracleDataFile = (desiredNumberOfMappers * 2) + 1;

Thus there is no way to change this behavior other than fixing the code.  

The proposed fixes are:

1. Because the number of chunk is 2* number of mappers + 1, it causes data to be distributed unevenly across mappers, prolonging the whole Sqoop process by 50%. IMHO, the benefit gained by ensuring that each split doesn't always get data from the start of each data-file is insignificant compared to the drawback of uneven distribution of data.
2. The getSplits method in OraOopDataDrivenDBInputFormat.java should call OraoopUtilities class getNumberOfDataChunksPerOracleDataFile so that this behavior can be controlled by customization of oraoop.datachunk.mapper.multiplier and raoop.datachunk.result.increment options



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)