You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@sqoop.apache.org by Sachneet Singh Bains <sa...@impetus.co.in> on 2015/04/01 10:29:42 UTC

Sqoop import: Issue in extracting primary key for 'split-by'

Hi,

We are using Sqoop (1.4.5)to import tables from Teradata to Hadoop.
Came across an issue that was breaking the import every time. We debugged and resolved it.
Sharing as it may be of some help.

The Scenario and failure:
(I have renamed table and db for explanation)

*         Importing a table named TABLE1 (Primary Key - PK1) from database DB1

*         Not specifying the split by field in the import command.(Sqoop in this case fetches the primary key and uses it as the split-by column)

*         The import fails every time saying PK2 does not exist in TABLE1 with the following error
ERROR tool.ImportTool: Encountered IOException running import job: java.io.IOException: com.teradata.jdbc.jdbc_4.util.JDBCException: [Teradata Database] [TeraJDBC 14.10.00.26] [Error 5628] [SQLState HY000] Column PK2 not found in DB1.TABLE1.
        at org.apache.sqoop.mapreduce.db.DataDrivenDBInputFormat.getSplits(DataDrivenDBInputFormat.java:170)
        at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:597)
        at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:614)
        at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:492)

The Issue and solution:
Issue: The  issue is that in the same Teradata DBMS there exists a table by the same name TABLE1 (in different database DB2) with a different primary key -PK2.
Sqoop is using the primary key of this table instead of the correct one. Below code is the problem:
org.apache.sqoop.manager.SqlManager.java (line number 559)
getPrimaryKeys(null, null, tableName) returns a result set of pk's for all the tables in the DBMS and the code uses the first result without confirming the database it belongs to.
In our case the first result is from DB2 though DB1.TABLE1 was expected.
public String getPrimaryKey(String tableName) {
    try {
      DatabaseMetaData metaData = this.getConnection().getMetaData();
      ResultSet results = metaData.getPrimaryKeys(null, null, tableName);
      if (null == results) {
        return null;
      }

      try {
        if (results.next()) {
          return results.getString("COLUMN_NAME");
        } else {
          return null;
        }
Solution:

1.       If we use 'Split-by' explicitly in the import command this code won't be executed and thus no issues.

2.       Using single mapper also avoids the issue as no splits are required in that case.

3.       Modify the code above to use the database name as well while fetching primary key. Parse the database name from the URL and use it while getting the primary key:

ResultSet results = metaData.getPrimaryKeys(null, dbName, tableName);

Thanks,
Sachneet









________________________________






NOTE: This message may contain information that is confidential, proprietary, privileged or otherwise protected by law. The message is intended solely for the named addressee. If received in error, please destroy and notify the sender. Any use of this email is prohibited when received in error. Impetus does not represent, warrant and/or guarantee, that the integrity of this communication has been maintained nor that the communication is free of errors, virus, interception or interference.

Re: Sqoop import: Issue in extracting primary key for 'split-by'

Posted by Abraham Elmahrek <ab...@cloudera.com>.

Thanks for debugging. Could you create a Jira upstream for this? Also,
there are connectors specific for teradata that you can use. Some of the
larger Hadoop vendors out there provide their own teradata connectors.

On Wed, Apr 1, 2015 at 1:29 AM, Sachneet Singh Bains <
sachneets.bains@impetus.co.in> wrote:

> Hi,
>
> We are using Sqoop (1.4.5)to import tables from Teradata to Hadoop.
> Came across an issue that was breaking the import every time. We debugged
> and resolved it.
> Sharing as it may be of some help.
>
> The Scenario and failure:
> (I have renamed table and db for explanation)
>
> *         Importing a table named TABLE1 (Primary Key - PK1) from database
> DB1
>
> *         Not specifying the split by field in the import command.(Sqoop
> in this case fetches the primary key and uses it as the split-by column)
>
> *         The import fails every time saying PK2 does not exist in TABLE1
> with the following error
> ERROR tool.ImportTool: Encountered IOException running import job:
> java.io.IOException: com.teradata.jdbc.jdbc_4.util.JDBCException: [Teradata
> Database] [TeraJDBC 14.10.00.26] [Error 5628] [SQLState HY000] Column PK2
> not found in DB1.TABLE1.
>         at
> org.apache.sqoop.mapreduce.db.DataDrivenDBInputFormat.getSplits(DataDrivenDBInputFormat.java:170)
>         at
> org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:597)
>         at
> org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:614)
>         at
> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:492)
>
> The Issue and solution:
> Issue: The  issue is that in the same Teradata DBMS there exists a table
> by the same name TABLE1 (in different database DB2) with a different
> primary key -PK2.
> Sqoop is using the primary key of this table instead of the correct one.
> Below code is the problem:
> org.apache.sqoop.manager.SqlManager.java (line number 559)
> getPrimaryKeys(null, null, tableName) returns a result set of pk's for all
> the tables in the DBMS and the code uses the first result without
> confirming the database it belongs to.
> In our case the first result is from DB2 though DB1.TABLE1 was expected.
> public String getPrimaryKey(String tableName) {
>     try {
>       DatabaseMetaData metaData = this.getConnection().getMetaData();
>       ResultSet results = metaData.getPrimaryKeys(null, null, tableName);
>       if (null == results) {
>         return null;
>       }
>
>       try {
>         if (results.next()) {
>           return results.getString("COLUMN_NAME");
>         } else {
>           return null;
>         }
> Solution:
>
> 1.       If we use 'Split-by' explicitly in the import command this code
> won't be executed and thus no issues.
>
> 2.       Using single mapper also avoids the issue as no splits are
> required in that case.
>
> 3.       Modify the code above to use the database name as well while
> fetching primary key. Parse the database name from the URL and use it while
> getting the primary key:
>
> ResultSet results = metaData.getPrimaryKeys(null, dbName, tableName);
>
> Thanks,
> Sachneet
>
>
>
>
>
>
>
>
>
> ________________________________
>
>
>
>
>
>
> NOTE: This message may contain information that is confidential,
> proprietary, privileged or otherwise protected by law. The message is
> intended solely for the named addressee. If received in error, please
> destroy and notify the sender. Any use of this email is prohibited when
> received in error. Impetus does not represent, warrant and/or guarantee,
> that the integrity of this communication has been maintained nor that the
> communication is free of errors, virus, interception or interference.
>

Re: Sqoop import: Issue in extracting primary key for 'split-by'

Posted by Abraham Elmahrek <ab...@cloudera.com>.

Thanks for debugging. Could you create a Jira upstream for this? Also,
there are connectors specific for teradata that you can use. Some of the
larger Hadoop vendors out there provide their own teradata connectors.

On Wed, Apr 1, 2015 at 1:29 AM, Sachneet Singh Bains <
sachneets.bains@impetus.co.in> wrote:

> Hi,
>
> We are using Sqoop (1.4.5)to import tables from Teradata to Hadoop.
> Came across an issue that was breaking the import every time. We debugged
> and resolved it.
> Sharing as it may be of some help.
>
> The Scenario and failure:
> (I have renamed table and db for explanation)
>
> *         Importing a table named TABLE1 (Primary Key - PK1) from database
> DB1
>
> *         Not specifying the split by field in the import command.(Sqoop
> in this case fetches the primary key and uses it as the split-by column)
>
> *         The import fails every time saying PK2 does not exist in TABLE1
> with the following error
> ERROR tool.ImportTool: Encountered IOException running import job:
> java.io.IOException: com.teradata.jdbc.jdbc_4.util.JDBCException: [Teradata
> Database] [TeraJDBC 14.10.00.26] [Error 5628] [SQLState HY000] Column PK2
> not found in DB1.TABLE1.
>         at
> org.apache.sqoop.mapreduce.db.DataDrivenDBInputFormat.getSplits(DataDrivenDBInputFormat.java:170)
>         at
> org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:597)
>         at
> org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:614)
>         at
> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:492)
>
> The Issue and solution:
> Issue: The  issue is that in the same Teradata DBMS there exists a table
> by the same name TABLE1 (in different database DB2) with a different
> primary key -PK2.
> Sqoop is using the primary key of this table instead of the correct one.
> Below code is the problem:
> org.apache.sqoop.manager.SqlManager.java (line number 559)
> getPrimaryKeys(null, null, tableName) returns a result set of pk's for all
> the tables in the DBMS and the code uses the first result without
> confirming the database it belongs to.
> In our case the first result is from DB2 though DB1.TABLE1 was expected.
> public String getPrimaryKey(String tableName) {
>     try {
>       DatabaseMetaData metaData = this.getConnection().getMetaData();
>       ResultSet results = metaData.getPrimaryKeys(null, null, tableName);
>       if (null == results) {
>         return null;
>       }
>
>       try {
>         if (results.next()) {
>           return results.getString("COLUMN_NAME");
>         } else {
>           return null;
>         }
> Solution:
>
> 1.       If we use 'Split-by' explicitly in the import command this code
> won't be executed and thus no issues.
>
> 2.       Using single mapper also avoids the issue as no splits are
> required in that case.
>
> 3.       Modify the code above to use the database name as well while
> fetching primary key. Parse the database name from the URL and use it while
> getting the primary key:
>
> ResultSet results = metaData.getPrimaryKeys(null, dbName, tableName);
>
> Thanks,
> Sachneet
>
>
>
>
>
>
>
>
>
> ________________________________
>
>
>
>
>
>
> NOTE: This message may contain information that is confidential,
> proprietary, privileged or otherwise protected by law. The message is
> intended solely for the named addressee. If received in error, please
> destroy and notify the sender. Any use of this email is prohibited when
> received in error. Impetus does not represent, warrant and/or guarantee,
> that the integrity of this communication has been maintained nor that the
> communication is free of errors, virus, interception or interference.
>