You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Aaron Kimball (JIRA)" <ji...@apache.org> on 2009/05/12 19:55:45 UTC

[jira] Created: (HADOOP-5815) Sqoop: A database import tool for Hadoop

Sqoop: A database import tool for Hadoop
----------------------------------------

                 Key: HADOOP-5815
                 URL: https://issues.apache.org/jira/browse/HADOOP-5815
             Project: Hadoop Core
          Issue Type: New Feature
            Reporter: Aaron Kimball
            Assignee: Aaron Kimball
         Attachments: HADOOP-5815.patch


Overview:

Sqoop is a tool designed to help users import existing relational databases into their Hadoop clusters. Sqoop uses JDBC to connect to a database, examine the schema for tables, and auto-generate the necessary classes to import data into HDFS. It then instantiates a MapReduce job to read the table from the database via the DBInputFormat (JDBC-based InputFormat). The table is read into a set of files loaded into HDFS. Both SequenceFile and text-based targets are supported.

Longer term, Sqoop will support automatic connectivity to Hive, with the ability to load data files directly into the Hive warehouse directory, and also to inject the appropriate table definition into the metastore.

Some more specifics:

Sqoop is a program implemented as a contrib module. Its frontend is invoked through "bin/hadoop jar sqoop.jar ..." and allows you to connect to arbitrary JDBC databases and extract their tables into files in HDFS. The underlying implementation utilizes the JDBC interface of HADOOP-2536 (DBInputFormat). The DBWritable implementation needed to extract a table is generated by this tool, based on the types of the columns seen in the table. Sqoop uses JDBC to examine the table specification and translate this to the appropriate Java types.

The generated classes are provided as .java files for the user to reuse. They are also compiled into a jar and used to run a MapReduce task to perform the data import. This either results in text files or SequenceFiles in HDFS. In the latter case, these Java classes are embedded into the SequenceFiles as well.

The program will extract a specific table from a database, or optionally, all tables. For a table, it can read all columns, or just a subset. Since HADOOP-2536 requires that a sorting key be specified for the import task, Sqoop will auto-detect the presence of a primary key on a table and automatically use it as the sort order; the user can also manually specify a sorting column.

Example invocations:

To import an entire database:

hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect jdbc:mysql://db.example.com/company --all-tables

(Requires that all tables have primary keys)

To select a single table:

hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect jdbc:mysql://db.example.com/company --table employees

To select a subset of columns from a table:

hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect jdbc:mysql://db.example.com/company --table employees --columns "employee_id,first_name,last_name,salary,start_date"

To explicitly set the sort column, import format, and import destination (the table will go to /shared/imported_databases/employees):

hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect jdbc:mysql://db.example.com/company --table employees --order-by employee_id --warehouse-dir /shared/imported_databases --as-sequencefile

Sqoop will automatically select the correct JDBC driver class name for HSQLdb and MySQL; this can also be explicitly set, e.g.:

hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect jdbc:postgresql://db.example.com/company --driver org.postgresql.Driver --all-tables


Testing has been conducted with HSQLDB and MySQL. A set of unit tests covers a great deal of Sqoop's functionality, and this tool has been used in practice at Cloudera and with a few other early test users on "real" databases.

A readme file is included in the patch which contains documentation on how to use the tool.



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Issue Comment Edited: (HADOOP-5815) Sqoop: A database import tool for Hadoop

Posted by "Noble Paul (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-5815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12708779#action_12708779 ] 

Noble Paul edited comment on HADOOP-5815 at 5/12/09 11:18 PM:
--------------------------------------------------------------

there is a tool called DataImportHandler which is used to import data from RDBMS , http urls etc which is successfully used in Solr. If necessary we can reuse large parts of it. 

http://wiki.apache.org/solr/DataImportHandler

There is a plan to make it available as a library which can be used to import into any kind of document database solr/couchdb/hadoop etc

      was (Author: noble.paul):
    there is a tool called DataImportHandler which is used to import data from RDBMS , http urls etc which is successfully used in Solr. If necessary we can reuse large parts of it. 

http://wiki.apache.org/solr/DataImportHandler
  
> Sqoop: A database import tool for Hadoop
> ----------------------------------------
>
>                 Key: HADOOP-5815
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5815
>             Project: Hadoop Core
>          Issue Type: New Feature
>            Reporter: Aaron Kimball
>            Assignee: Aaron Kimball
>         Attachments: HADOOP-5815.patch
>
>
> Overview:
> Sqoop is a tool designed to help users import existing relational databases into their Hadoop clusters. Sqoop uses JDBC to connect to a database, examine the schema for tables, and auto-generate the necessary classes to import data into HDFS. It then instantiates a MapReduce job to read the table from the database via the DBInputFormat (JDBC-based InputFormat). The table is read into a set of files loaded into HDFS. Both SequenceFile and text-based targets are supported.
> Longer term, Sqoop will support automatic connectivity to Hive, with the ability to load data files directly into the Hive warehouse directory, and also to inject the appropriate table definition into the metastore.
> Some more specifics:
> Sqoop is a program implemented as a contrib module. Its frontend is invoked through "bin/hadoop jar sqoop.jar ..." and allows you to connect to arbitrary JDBC databases and extract their tables into files in HDFS. The underlying implementation utilizes the JDBC interface of HADOOP-2536 (DBInputFormat). The DBWritable implementation needed to extract a table is generated by this tool, based on the types of the columns seen in the table. Sqoop uses JDBC to examine the table specification and translate this to the appropriate Java types.
> The generated classes are provided as .java files for the user to reuse. They are also compiled into a jar and used to run a MapReduce task to perform the data import. This either results in text files or SequenceFiles in HDFS. In the latter case, these Java classes are embedded into the SequenceFiles as well.
> The program will extract a specific table from a database, or optionally, all tables. For a table, it can read all columns, or just a subset. Since HADOOP-2536 requires that a sorting key be specified for the import task, Sqoop will auto-detect the presence of a primary key on a table and automatically use it as the sort order; the user can also manually specify a sorting column.
> Example invocations:
> To import an entire database:
> hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect jdbc:mysql://db.example.com/company --all-tables
> (Requires that all tables have primary keys)
> To select a single table:
> hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect jdbc:mysql://db.example.com/company --table employees
> To select a subset of columns from a table:
> hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect jdbc:mysql://db.example.com/company --table employees --columns "employee_id,first_name,last_name,salary,start_date"
> To explicitly set the sort column, import format, and import destination (the table will go to /shared/imported_databases/employees):
> hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect jdbc:mysql://db.example.com/company --table employees --order-by employee_id --warehouse-dir /shared/imported_databases --as-sequencefile
> Sqoop will automatically select the correct JDBC driver class name for HSQLdb and MySQL; this can also be explicitly set, e.g.:
> hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect jdbc:postgresql://db.example.com/company --driver org.postgresql.Driver --all-tables
> Testing has been conducted with HSQLDB and MySQL. A set of unit tests covers a great deal of Sqoop's functionality, and this tool has been used in practice at Cloudera and with a few other early test users on "real" databases.
> A readme file is included in the patch which contains documentation on how to use the tool.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-5815) Sqoop: A database import tool for Hadoop

Posted by "Aaron Kimball (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-5815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Aaron Kimball updated HADOOP-5815:
----------------------------------

    Status: Patch Available  (was: Open)

> Sqoop: A database import tool for Hadoop
> ----------------------------------------
>
>                 Key: HADOOP-5815
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5815
>             Project: Hadoop Core
>          Issue Type: New Feature
>            Reporter: Aaron Kimball
>            Assignee: Aaron Kimball
>         Attachments: HADOOP-5815.patch
>
>
> Overview:
> Sqoop is a tool designed to help users import existing relational databases into their Hadoop clusters. Sqoop uses JDBC to connect to a database, examine the schema for tables, and auto-generate the necessary classes to import data into HDFS. It then instantiates a MapReduce job to read the table from the database via the DBInputFormat (JDBC-based InputFormat). The table is read into a set of files loaded into HDFS. Both SequenceFile and text-based targets are supported.
> Longer term, Sqoop will support automatic connectivity to Hive, with the ability to load data files directly into the Hive warehouse directory, and also to inject the appropriate table definition into the metastore.
> Some more specifics:
> Sqoop is a program implemented as a contrib module. Its frontend is invoked through "bin/hadoop jar sqoop.jar ..." and allows you to connect to arbitrary JDBC databases and extract their tables into files in HDFS. The underlying implementation utilizes the JDBC interface of HADOOP-2536 (DBInputFormat). The DBWritable implementation needed to extract a table is generated by this tool, based on the types of the columns seen in the table. Sqoop uses JDBC to examine the table specification and translate this to the appropriate Java types.
> The generated classes are provided as .java files for the user to reuse. They are also compiled into a jar and used to run a MapReduce task to perform the data import. This either results in text files or SequenceFiles in HDFS. In the latter case, these Java classes are embedded into the SequenceFiles as well.
> The program will extract a specific table from a database, or optionally, all tables. For a table, it can read all columns, or just a subset. Since HADOOP-2536 requires that a sorting key be specified for the import task, Sqoop will auto-detect the presence of a primary key on a table and automatically use it as the sort order; the user can also manually specify a sorting column.
> Example invocations:
> To import an entire database:
> hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect jdbc:mysql://db.example.com/company --all-tables
> (Requires that all tables have primary keys)
> To select a single table:
> hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect jdbc:mysql://db.example.com/company --table employees
> To select a subset of columns from a table:
> hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect jdbc:mysql://db.example.com/company --table employees --columns "employee_id,first_name,last_name,salary,start_date"
> To explicitly set the sort column, import format, and import destination (the table will go to /shared/imported_databases/employees):
> hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect jdbc:mysql://db.example.com/company --table employees --order-by employee_id --warehouse-dir /shared/imported_databases --as-sequencefile
> Sqoop will automatically select the correct JDBC driver class name for HSQLdb and MySQL; this can also be explicitly set, e.g.:
> hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect jdbc:postgresql://db.example.com/company --driver org.postgresql.Driver --all-tables
> Testing has been conducted with HSQLDB and MySQL. A set of unit tests covers a great deal of Sqoop's functionality, and this tool has been used in practice at Cloudera and with a few other early test users on "real" databases.
> A readme file is included in the patch which contains documentation on how to use the tool.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-5815) Sqoop: A database import tool for Hadoop

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-5815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12709306#action_12709306 ] 

Hadoop QA commented on HADOOP-5815:
-----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12408037/HADOOP-5815.2.patch
  against trunk revision 774625.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 28 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed core unit tests.

    -1 contrib tests.  The patch failed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/336/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/336/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/336/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/336/console

This message is automatically generated.

> Sqoop: A database import tool for Hadoop
> ----------------------------------------
>
>                 Key: HADOOP-5815
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5815
>             Project: Hadoop Core
>          Issue Type: New Feature
>            Reporter: Aaron Kimball
>            Assignee: Aaron Kimball
>         Attachments: HADOOP-5815.2.patch, HADOOP-5815.patch
>
>
> Overview:
> Sqoop is a tool designed to help users import existing relational databases into their Hadoop clusters. Sqoop uses JDBC to connect to a database, examine the schema for tables, and auto-generate the necessary classes to import data into HDFS. It then instantiates a MapReduce job to read the table from the database via the DBInputFormat (JDBC-based InputFormat). The table is read into a set of files loaded into HDFS. Both SequenceFile and text-based targets are supported.
> Longer term, Sqoop will support automatic connectivity to Hive, with the ability to load data files directly into the Hive warehouse directory, and also to inject the appropriate table definition into the metastore.
> Some more specifics:
> Sqoop is a program implemented as a contrib module. Its frontend is invoked through "bin/hadoop jar sqoop.jar ..." and allows you to connect to arbitrary JDBC databases and extract their tables into files in HDFS. The underlying implementation utilizes the JDBC interface of HADOOP-2536 (DBInputFormat). The DBWritable implementation needed to extract a table is generated by this tool, based on the types of the columns seen in the table. Sqoop uses JDBC to examine the table specification and translate this to the appropriate Java types.
> The generated classes are provided as .java files for the user to reuse. They are also compiled into a jar and used to run a MapReduce task to perform the data import. This either results in text files or SequenceFiles in HDFS. In the latter case, these Java classes are embedded into the SequenceFiles as well.
> The program will extract a specific table from a database, or optionally, all tables. For a table, it can read all columns, or just a subset. Since HADOOP-2536 requires that a sorting key be specified for the import task, Sqoop will auto-detect the presence of a primary key on a table and automatically use it as the sort order; the user can also manually specify a sorting column.
> Example invocations:
> To import an entire database:
> hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect jdbc:mysql://db.example.com/company --all-tables
> (Requires that all tables have primary keys)
> To select a single table:
> hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect jdbc:mysql://db.example.com/company --table employees
> To select a subset of columns from a table:
> hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect jdbc:mysql://db.example.com/company --table employees --columns "employee_id,first_name,last_name,salary,start_date"
> To explicitly set the sort column, import format, and import destination (the table will go to /shared/imported_databases/employees):
> hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect jdbc:mysql://db.example.com/company --table employees --order-by employee_id --warehouse-dir /shared/imported_databases --as-sequencefile
> Sqoop will automatically select the correct JDBC driver class name for HSQLdb and MySQL; this can also be explicitly set, e.g.:
> hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect jdbc:postgresql://db.example.com/company --driver org.postgresql.Driver --all-tables
> Testing has been conducted with HSQLDB and MySQL. A set of unit tests covers a great deal of Sqoop's functionality, and this tool has been used in practice at Cloudera and with a few other early test users on "real" databases.
> A readme file is included in the patch which contains documentation on how to use the tool.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-5815) Sqoop: A database import tool for Hadoop

Posted by "Aaron Kimball (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-5815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12709071#action_12709071 ] 

Aaron Kimball commented on HADOOP-5815:
---------------------------------------

Hi Noble,

I've read through your document there and the related JIRA item in Solr. I'm a bit confused as to how it is applicable here -- maybe you could explain further. As I understand it, The DataImportHandler is designed to ingest data from various sources in a manner that is user-configured on a per-table basis, and incorporate that data into indices that are then readable from the rest of the Solr system. (disclaimer: I have very little understanding of Solr's goals and features. As I understand it, it's a search engine front-end.)

Sqoop's goal (already met by this implementation) is to do ad-hoc loading of database tables into HDFS by performing a straightforward translation of rows to text while physically moving the bits from the database into flat files in HDFS. HDFS does not naturally include any indexing or other higher-level structures over a data set. 

Can you please explain further where you see integration points between these two tools? Thanks!

> Sqoop: A database import tool for Hadoop
> ----------------------------------------
>
>                 Key: HADOOP-5815
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5815
>             Project: Hadoop Core
>          Issue Type: New Feature
>            Reporter: Aaron Kimball
>            Assignee: Aaron Kimball
>         Attachments: HADOOP-5815.patch
>
>
> Overview:
> Sqoop is a tool designed to help users import existing relational databases into their Hadoop clusters. Sqoop uses JDBC to connect to a database, examine the schema for tables, and auto-generate the necessary classes to import data into HDFS. It then instantiates a MapReduce job to read the table from the database via the DBInputFormat (JDBC-based InputFormat). The table is read into a set of files loaded into HDFS. Both SequenceFile and text-based targets are supported.
> Longer term, Sqoop will support automatic connectivity to Hive, with the ability to load data files directly into the Hive warehouse directory, and also to inject the appropriate table definition into the metastore.
> Some more specifics:
> Sqoop is a program implemented as a contrib module. Its frontend is invoked through "bin/hadoop jar sqoop.jar ..." and allows you to connect to arbitrary JDBC databases and extract their tables into files in HDFS. The underlying implementation utilizes the JDBC interface of HADOOP-2536 (DBInputFormat). The DBWritable implementation needed to extract a table is generated by this tool, based on the types of the columns seen in the table. Sqoop uses JDBC to examine the table specification and translate this to the appropriate Java types.
> The generated classes are provided as .java files for the user to reuse. They are also compiled into a jar and used to run a MapReduce task to perform the data import. This either results in text files or SequenceFiles in HDFS. In the latter case, these Java classes are embedded into the SequenceFiles as well.
> The program will extract a specific table from a database, or optionally, all tables. For a table, it can read all columns, or just a subset. Since HADOOP-2536 requires that a sorting key be specified for the import task, Sqoop will auto-detect the presence of a primary key on a table and automatically use it as the sort order; the user can also manually specify a sorting column.
> Example invocations:
> To import an entire database:
> hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect jdbc:mysql://db.example.com/company --all-tables
> (Requires that all tables have primary keys)
> To select a single table:
> hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect jdbc:mysql://db.example.com/company --table employees
> To select a subset of columns from a table:
> hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect jdbc:mysql://db.example.com/company --table employees --columns "employee_id,first_name,last_name,salary,start_date"
> To explicitly set the sort column, import format, and import destination (the table will go to /shared/imported_databases/employees):
> hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect jdbc:mysql://db.example.com/company --table employees --order-by employee_id --warehouse-dir /shared/imported_databases --as-sequencefile
> Sqoop will automatically select the correct JDBC driver class name for HSQLdb and MySQL; this can also be explicitly set, e.g.:
> hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect jdbc:postgresql://db.example.com/company --driver org.postgresql.Driver --all-tables
> Testing has been conducted with HSQLDB and MySQL. A set of unit tests covers a great deal of Sqoop's functionality, and this tool has been used in practice at Cloudera and with a few other early test users on "real" databases.
> A readme file is included in the patch which contains documentation on how to use the tool.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-5815) Sqoop: A database import tool for Hadoop

Posted by "Aaron Kimball (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-5815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Aaron Kimball updated HADOOP-5815:
----------------------------------

    Status: Open  (was: Patch Available)

> Sqoop: A database import tool for Hadoop
> ----------------------------------------
>
>                 Key: HADOOP-5815
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5815
>             Project: Hadoop Core
>          Issue Type: New Feature
>            Reporter: Aaron Kimball
>            Assignee: Aaron Kimball
>         Attachments: HADOOP-5815.2.patch, HADOOP-5815.patch
>
>
> Overview:
> Sqoop is a tool designed to help users import existing relational databases into their Hadoop clusters. Sqoop uses JDBC to connect to a database, examine the schema for tables, and auto-generate the necessary classes to import data into HDFS. It then instantiates a MapReduce job to read the table from the database via the DBInputFormat (JDBC-based InputFormat). The table is read into a set of files loaded into HDFS. Both SequenceFile and text-based targets are supported.
> Longer term, Sqoop will support automatic connectivity to Hive, with the ability to load data files directly into the Hive warehouse directory, and also to inject the appropriate table definition into the metastore.
> Some more specifics:
> Sqoop is a program implemented as a contrib module. Its frontend is invoked through "bin/hadoop jar sqoop.jar ..." and allows you to connect to arbitrary JDBC databases and extract their tables into files in HDFS. The underlying implementation utilizes the JDBC interface of HADOOP-2536 (DBInputFormat). The DBWritable implementation needed to extract a table is generated by this tool, based on the types of the columns seen in the table. Sqoop uses JDBC to examine the table specification and translate this to the appropriate Java types.
> The generated classes are provided as .java files for the user to reuse. They are also compiled into a jar and used to run a MapReduce task to perform the data import. This either results in text files or SequenceFiles in HDFS. In the latter case, these Java classes are embedded into the SequenceFiles as well.
> The program will extract a specific table from a database, or optionally, all tables. For a table, it can read all columns, or just a subset. Since HADOOP-2536 requires that a sorting key be specified for the import task, Sqoop will auto-detect the presence of a primary key on a table and automatically use it as the sort order; the user can also manually specify a sorting column.
> Example invocations:
> To import an entire database:
> hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect jdbc:mysql://db.example.com/company --all-tables
> (Requires that all tables have primary keys)
> To select a single table:
> hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect jdbc:mysql://db.example.com/company --table employees
> To select a subset of columns from a table:
> hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect jdbc:mysql://db.example.com/company --table employees --columns "employee_id,first_name,last_name,salary,start_date"
> To explicitly set the sort column, import format, and import destination (the table will go to /shared/imported_databases/employees):
> hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect jdbc:mysql://db.example.com/company --table employees --order-by employee_id --warehouse-dir /shared/imported_databases --as-sequencefile
> Sqoop will automatically select the correct JDBC driver class name for HSQLdb and MySQL; this can also be explicitly set, e.g.:
> hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect jdbc:postgresql://db.example.com/company --driver org.postgresql.Driver --all-tables
> Testing has been conducted with HSQLDB and MySQL. A set of unit tests covers a great deal of Sqoop's functionality, and this tool has been used in practice at Cloudera and with a few other early test users on "real" databases.
> A readme file is included in the patch which contains documentation on how to use the tool.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-5815) Sqoop: A database import tool for Hadoop

Posted by "Aaron Kimball (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-5815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Aaron Kimball updated HADOOP-5815:
----------------------------------

    Attachment: HADOOP-5815.patch

Attaching patch that contains sqoop; adds project to src/contrib/sqoop/

> Sqoop: A database import tool for Hadoop
> ----------------------------------------
>
>                 Key: HADOOP-5815
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5815
>             Project: Hadoop Core
>          Issue Type: New Feature
>            Reporter: Aaron Kimball
>            Assignee: Aaron Kimball
>         Attachments: HADOOP-5815.patch
>
>
> Overview:
> Sqoop is a tool designed to help users import existing relational databases into their Hadoop clusters. Sqoop uses JDBC to connect to a database, examine the schema for tables, and auto-generate the necessary classes to import data into HDFS. It then instantiates a MapReduce job to read the table from the database via the DBInputFormat (JDBC-based InputFormat). The table is read into a set of files loaded into HDFS. Both SequenceFile and text-based targets are supported.
> Longer term, Sqoop will support automatic connectivity to Hive, with the ability to load data files directly into the Hive warehouse directory, and also to inject the appropriate table definition into the metastore.
> Some more specifics:
> Sqoop is a program implemented as a contrib module. Its frontend is invoked through "bin/hadoop jar sqoop.jar ..." and allows you to connect to arbitrary JDBC databases and extract their tables into files in HDFS. The underlying implementation utilizes the JDBC interface of HADOOP-2536 (DBInputFormat). The DBWritable implementation needed to extract a table is generated by this tool, based on the types of the columns seen in the table. Sqoop uses JDBC to examine the table specification and translate this to the appropriate Java types.
> The generated classes are provided as .java files for the user to reuse. They are also compiled into a jar and used to run a MapReduce task to perform the data import. This either results in text files or SequenceFiles in HDFS. In the latter case, these Java classes are embedded into the SequenceFiles as well.
> The program will extract a specific table from a database, or optionally, all tables. For a table, it can read all columns, or just a subset. Since HADOOP-2536 requires that a sorting key be specified for the import task, Sqoop will auto-detect the presence of a primary key on a table and automatically use it as the sort order; the user can also manually specify a sorting column.
> Example invocations:
> To import an entire database:
> hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect jdbc:mysql://db.example.com/company --all-tables
> (Requires that all tables have primary keys)
> To select a single table:
> hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect jdbc:mysql://db.example.com/company --table employees
> To select a subset of columns from a table:
> hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect jdbc:mysql://db.example.com/company --table employees --columns "employee_id,first_name,last_name,salary,start_date"
> To explicitly set the sort column, import format, and import destination (the table will go to /shared/imported_databases/employees):
> hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect jdbc:mysql://db.example.com/company --table employees --order-by employee_id --warehouse-dir /shared/imported_databases --as-sequencefile
> Sqoop will automatically select the correct JDBC driver class name for HSQLdb and MySQL; this can also be explicitly set, e.g.:
> hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect jdbc:postgresql://db.example.com/company --driver org.postgresql.Driver --all-tables
> Testing has been conducted with HSQLDB and MySQL. A set of unit tests covers a great deal of Sqoop's functionality, and this tool has been used in practice at Cloudera and with a few other early test users on "real" databases.
> A readme file is included in the patch which contains documentation on how to use the tool.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-5815) Sqoop: A database import tool for Hadoop

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-5815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12708807#action_12708807 ] 

Hadoop QA commented on HADOOP-5815:
-----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12407903/HADOOP-5815.patch
  against trunk revision 774138.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 28 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

    -1 release audit.  The applied patch generated 489 release audit warnings (more than the trunk's current 486 warnings).

    +1 core tests.  The patch passed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/332/testReport/
Release audit warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/332/artifact/trunk/current/releaseAuditDiffWarnings.txt
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/332/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/332/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/332/console

This message is automatically generated.

> Sqoop: A database import tool for Hadoop
> ----------------------------------------
>
>                 Key: HADOOP-5815
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5815
>             Project: Hadoop Core
>          Issue Type: New Feature
>            Reporter: Aaron Kimball
>            Assignee: Aaron Kimball
>         Attachments: HADOOP-5815.patch
>
>
> Overview:
> Sqoop is a tool designed to help users import existing relational databases into their Hadoop clusters. Sqoop uses JDBC to connect to a database, examine the schema for tables, and auto-generate the necessary classes to import data into HDFS. It then instantiates a MapReduce job to read the table from the database via the DBInputFormat (JDBC-based InputFormat). The table is read into a set of files loaded into HDFS. Both SequenceFile and text-based targets are supported.
> Longer term, Sqoop will support automatic connectivity to Hive, with the ability to load data files directly into the Hive warehouse directory, and also to inject the appropriate table definition into the metastore.
> Some more specifics:
> Sqoop is a program implemented as a contrib module. Its frontend is invoked through "bin/hadoop jar sqoop.jar ..." and allows you to connect to arbitrary JDBC databases and extract their tables into files in HDFS. The underlying implementation utilizes the JDBC interface of HADOOP-2536 (DBInputFormat). The DBWritable implementation needed to extract a table is generated by this tool, based on the types of the columns seen in the table. Sqoop uses JDBC to examine the table specification and translate this to the appropriate Java types.
> The generated classes are provided as .java files for the user to reuse. They are also compiled into a jar and used to run a MapReduce task to perform the data import. This either results in text files or SequenceFiles in HDFS. In the latter case, these Java classes are embedded into the SequenceFiles as well.
> The program will extract a specific table from a database, or optionally, all tables. For a table, it can read all columns, or just a subset. Since HADOOP-2536 requires that a sorting key be specified for the import task, Sqoop will auto-detect the presence of a primary key on a table and automatically use it as the sort order; the user can also manually specify a sorting column.
> Example invocations:
> To import an entire database:
> hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect jdbc:mysql://db.example.com/company --all-tables
> (Requires that all tables have primary keys)
> To select a single table:
> hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect jdbc:mysql://db.example.com/company --table employees
> To select a subset of columns from a table:
> hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect jdbc:mysql://db.example.com/company --table employees --columns "employee_id,first_name,last_name,salary,start_date"
> To explicitly set the sort column, import format, and import destination (the table will go to /shared/imported_databases/employees):
> hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect jdbc:mysql://db.example.com/company --table employees --order-by employee_id --warehouse-dir /shared/imported_databases --as-sequencefile
> Sqoop will automatically select the correct JDBC driver class name for HSQLdb and MySQL; this can also be explicitly set, e.g.:
> hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect jdbc:postgresql://db.example.com/company --driver org.postgresql.Driver --all-tables
> Testing has been conducted with HSQLDB and MySQL. A set of unit tests covers a great deal of Sqoop's functionality, and this tool has been used in practice at Cloudera and with a few other early test users on "real" databases.
> A readme file is included in the patch which contains documentation on how to use the tool.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-5815) Sqoop: A database import tool for Hadoop

Posted by "Noble Paul (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-5815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12709299#action_12709299 ] 

Noble Paul commented on HADOOP-5815:
------------------------------------

DIH (dataImporthandler)  is a small tool to extract data out of various structured datasources  (rdbms/xml etc) to flat documents . a document is nothing but a Map<String,Object> .The key is the field name and the value can be a single object or a list of objects.  

DIH is about collecting data from various sources using a config script (say you can mix and match data from an xml file + DB) to produce a record. 
The config is written in xml. The user can do custom operations on the extracted data using java/javascript (or any scrpting language supported by java 6)

how does a record look like in hadoop ? 

> Sqoop: A database import tool for Hadoop
> ----------------------------------------
>
>                 Key: HADOOP-5815
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5815
>             Project: Hadoop Core
>          Issue Type: New Feature
>            Reporter: Aaron Kimball
>            Assignee: Aaron Kimball
>         Attachments: HADOOP-5815.2.patch, HADOOP-5815.patch
>
>
> Overview:
> Sqoop is a tool designed to help users import existing relational databases into their Hadoop clusters. Sqoop uses JDBC to connect to a database, examine the schema for tables, and auto-generate the necessary classes to import data into HDFS. It then instantiates a MapReduce job to read the table from the database via the DBInputFormat (JDBC-based InputFormat). The table is read into a set of files loaded into HDFS. Both SequenceFile and text-based targets are supported.
> Longer term, Sqoop will support automatic connectivity to Hive, with the ability to load data files directly into the Hive warehouse directory, and also to inject the appropriate table definition into the metastore.
> Some more specifics:
> Sqoop is a program implemented as a contrib module. Its frontend is invoked through "bin/hadoop jar sqoop.jar ..." and allows you to connect to arbitrary JDBC databases and extract their tables into files in HDFS. The underlying implementation utilizes the JDBC interface of HADOOP-2536 (DBInputFormat). The DBWritable implementation needed to extract a table is generated by this tool, based on the types of the columns seen in the table. Sqoop uses JDBC to examine the table specification and translate this to the appropriate Java types.
> The generated classes are provided as .java files for the user to reuse. They are also compiled into a jar and used to run a MapReduce task to perform the data import. This either results in text files or SequenceFiles in HDFS. In the latter case, these Java classes are embedded into the SequenceFiles as well.
> The program will extract a specific table from a database, or optionally, all tables. For a table, it can read all columns, or just a subset. Since HADOOP-2536 requires that a sorting key be specified for the import task, Sqoop will auto-detect the presence of a primary key on a table and automatically use it as the sort order; the user can also manually specify a sorting column.
> Example invocations:
> To import an entire database:
> hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect jdbc:mysql://db.example.com/company --all-tables
> (Requires that all tables have primary keys)
> To select a single table:
> hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect jdbc:mysql://db.example.com/company --table employees
> To select a subset of columns from a table:
> hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect jdbc:mysql://db.example.com/company --table employees --columns "employee_id,first_name,last_name,salary,start_date"
> To explicitly set the sort column, import format, and import destination (the table will go to /shared/imported_databases/employees):
> hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect jdbc:mysql://db.example.com/company --table employees --order-by employee_id --warehouse-dir /shared/imported_databases --as-sequencefile
> Sqoop will automatically select the correct JDBC driver class name for HSQLdb and MySQL; this can also be explicitly set, e.g.:
> hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect jdbc:postgresql://db.example.com/company --driver org.postgresql.Driver --all-tables
> Testing has been conducted with HSQLDB and MySQL. A set of unit tests covers a great deal of Sqoop's functionality, and this tool has been used in practice at Cloudera and with a few other early test users on "real" databases.
> A readme file is included in the patch which contains documentation on how to use the tool.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-5815) Sqoop: A database import tool for Hadoop

Posted by "Aaron Kimball (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-5815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12709453#action_12709453 ] 

Aaron Kimball commented on HADOOP-5815:
---------------------------------------

Contrib test failures are unrelated (streaming).

> Sqoop: A database import tool for Hadoop
> ----------------------------------------
>
>                 Key: HADOOP-5815
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5815
>             Project: Hadoop Core
>          Issue Type: New Feature
>            Reporter: Aaron Kimball
>            Assignee: Aaron Kimball
>         Attachments: HADOOP-5815.2.patch, HADOOP-5815.patch
>
>
> Overview:
> Sqoop is a tool designed to help users import existing relational databases into their Hadoop clusters. Sqoop uses JDBC to connect to a database, examine the schema for tables, and auto-generate the necessary classes to import data into HDFS. It then instantiates a MapReduce job to read the table from the database via the DBInputFormat (JDBC-based InputFormat). The table is read into a set of files loaded into HDFS. Both SequenceFile and text-based targets are supported.
> Longer term, Sqoop will support automatic connectivity to Hive, with the ability to load data files directly into the Hive warehouse directory, and also to inject the appropriate table definition into the metastore.
> Some more specifics:
> Sqoop is a program implemented as a contrib module. Its frontend is invoked through "bin/hadoop jar sqoop.jar ..." and allows you to connect to arbitrary JDBC databases and extract their tables into files in HDFS. The underlying implementation utilizes the JDBC interface of HADOOP-2536 (DBInputFormat). The DBWritable implementation needed to extract a table is generated by this tool, based on the types of the columns seen in the table. Sqoop uses JDBC to examine the table specification and translate this to the appropriate Java types.
> The generated classes are provided as .java files for the user to reuse. They are also compiled into a jar and used to run a MapReduce task to perform the data import. This either results in text files or SequenceFiles in HDFS. In the latter case, these Java classes are embedded into the SequenceFiles as well.
> The program will extract a specific table from a database, or optionally, all tables. For a table, it can read all columns, or just a subset. Since HADOOP-2536 requires that a sorting key be specified for the import task, Sqoop will auto-detect the presence of a primary key on a table and automatically use it as the sort order; the user can also manually specify a sorting column.
> Example invocations:
> To import an entire database:
> hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect jdbc:mysql://db.example.com/company --all-tables
> (Requires that all tables have primary keys)
> To select a single table:
> hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect jdbc:mysql://db.example.com/company --table employees
> To select a subset of columns from a table:
> hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect jdbc:mysql://db.example.com/company --table employees --columns "employee_id,first_name,last_name,salary,start_date"
> To explicitly set the sort column, import format, and import destination (the table will go to /shared/imported_databases/employees):
> hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect jdbc:mysql://db.example.com/company --table employees --order-by employee_id --warehouse-dir /shared/imported_databases --as-sequencefile
> Sqoop will automatically select the correct JDBC driver class name for HSQLdb and MySQL; this can also be explicitly set, e.g.:
> hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect jdbc:postgresql://db.example.com/company --driver org.postgresql.Driver --all-tables
> Testing has been conducted with HSQLDB and MySQL. A set of unit tests covers a great deal of Sqoop's functionality, and this tool has been used in practice at Cloudera and with a few other early test users on "real" databases.
> A readme file is included in the patch which contains documentation on how to use the tool.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-5815) Sqoop: A database import tool for Hadoop

Posted by "Tom White (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-5815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tom White updated HADOOP-5815:
------------------------------

       Resolution: Fixed
    Fix Version/s: 0.21.0
     Hadoop Flags: [Reviewed]
           Status: Resolved  (was: Patch Available)

I've just committed this. Thanks Aaron!

> Sqoop: A database import tool for Hadoop
> ----------------------------------------
>
>                 Key: HADOOP-5815
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5815
>             Project: Hadoop Core
>          Issue Type: New Feature
>            Reporter: Aaron Kimball
>            Assignee: Aaron Kimball
>             Fix For: 0.21.0
>
>         Attachments: HADOOP-5815.2.patch, HADOOP-5815.patch
>
>
> Overview:
> Sqoop is a tool designed to help users import existing relational databases into their Hadoop clusters. Sqoop uses JDBC to connect to a database, examine the schema for tables, and auto-generate the necessary classes to import data into HDFS. It then instantiates a MapReduce job to read the table from the database via the DBInputFormat (JDBC-based InputFormat). The table is read into a set of files loaded into HDFS. Both SequenceFile and text-based targets are supported.
> Longer term, Sqoop will support automatic connectivity to Hive, with the ability to load data files directly into the Hive warehouse directory, and also to inject the appropriate table definition into the metastore.
> Some more specifics:
> Sqoop is a program implemented as a contrib module. Its frontend is invoked through "bin/hadoop jar sqoop.jar ..." and allows you to connect to arbitrary JDBC databases and extract their tables into files in HDFS. The underlying implementation utilizes the JDBC interface of HADOOP-2536 (DBInputFormat). The DBWritable implementation needed to extract a table is generated by this tool, based on the types of the columns seen in the table. Sqoop uses JDBC to examine the table specification and translate this to the appropriate Java types.
> The generated classes are provided as .java files for the user to reuse. They are also compiled into a jar and used to run a MapReduce task to perform the data import. This either results in text files or SequenceFiles in HDFS. In the latter case, these Java classes are embedded into the SequenceFiles as well.
> The program will extract a specific table from a database, or optionally, all tables. For a table, it can read all columns, or just a subset. Since HADOOP-2536 requires that a sorting key be specified for the import task, Sqoop will auto-detect the presence of a primary key on a table and automatically use it as the sort order; the user can also manually specify a sorting column.
> Example invocations:
> To import an entire database:
> hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect jdbc:mysql://db.example.com/company --all-tables
> (Requires that all tables have primary keys)
> To select a single table:
> hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect jdbc:mysql://db.example.com/company --table employees
> To select a subset of columns from a table:
> hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect jdbc:mysql://db.example.com/company --table employees --columns "employee_id,first_name,last_name,salary,start_date"
> To explicitly set the sort column, import format, and import destination (the table will go to /shared/imported_databases/employees):
> hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect jdbc:mysql://db.example.com/company --table employees --order-by employee_id --warehouse-dir /shared/imported_databases --as-sequencefile
> Sqoop will automatically select the correct JDBC driver class name for HSQLdb and MySQL; this can also be explicitly set, e.g.:
> hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect jdbc:postgresql://db.example.com/company --driver org.postgresql.Driver --all-tables
> Testing has been conducted with HSQLDB and MySQL. A set of unit tests covers a great deal of Sqoop's functionality, and this tool has been used in practice at Cloudera and with a few other early test users on "real" databases.
> A readme file is included in the patch which contains documentation on how to use the tool.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-5815) Sqoop: A database import tool for Hadoop

Posted by "Noble Paul (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-5815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12708779#action_12708779 ] 

Noble Paul commented on HADOOP-5815:
------------------------------------

there is a tool called DataImportHandler which is used to import data from RDBMS , http urls etc which is successfully used in Solr. If necessary we can reuse large parts of it. 

http://wiki.apache.org/solr/DataImportHandler

> Sqoop: A database import tool for Hadoop
> ----------------------------------------
>
>                 Key: HADOOP-5815
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5815
>             Project: Hadoop Core
>          Issue Type: New Feature
>            Reporter: Aaron Kimball
>            Assignee: Aaron Kimball
>         Attachments: HADOOP-5815.patch
>
>
> Overview:
> Sqoop is a tool designed to help users import existing relational databases into their Hadoop clusters. Sqoop uses JDBC to connect to a database, examine the schema for tables, and auto-generate the necessary classes to import data into HDFS. It then instantiates a MapReduce job to read the table from the database via the DBInputFormat (JDBC-based InputFormat). The table is read into a set of files loaded into HDFS. Both SequenceFile and text-based targets are supported.
> Longer term, Sqoop will support automatic connectivity to Hive, with the ability to load data files directly into the Hive warehouse directory, and also to inject the appropriate table definition into the metastore.
> Some more specifics:
> Sqoop is a program implemented as a contrib module. Its frontend is invoked through "bin/hadoop jar sqoop.jar ..." and allows you to connect to arbitrary JDBC databases and extract their tables into files in HDFS. The underlying implementation utilizes the JDBC interface of HADOOP-2536 (DBInputFormat). The DBWritable implementation needed to extract a table is generated by this tool, based on the types of the columns seen in the table. Sqoop uses JDBC to examine the table specification and translate this to the appropriate Java types.
> The generated classes are provided as .java files for the user to reuse. They are also compiled into a jar and used to run a MapReduce task to perform the data import. This either results in text files or SequenceFiles in HDFS. In the latter case, these Java classes are embedded into the SequenceFiles as well.
> The program will extract a specific table from a database, or optionally, all tables. For a table, it can read all columns, or just a subset. Since HADOOP-2536 requires that a sorting key be specified for the import task, Sqoop will auto-detect the presence of a primary key on a table and automatically use it as the sort order; the user can also manually specify a sorting column.
> Example invocations:
> To import an entire database:
> hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect jdbc:mysql://db.example.com/company --all-tables
> (Requires that all tables have primary keys)
> To select a single table:
> hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect jdbc:mysql://db.example.com/company --table employees
> To select a subset of columns from a table:
> hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect jdbc:mysql://db.example.com/company --table employees --columns "employee_id,first_name,last_name,salary,start_date"
> To explicitly set the sort column, import format, and import destination (the table will go to /shared/imported_databases/employees):
> hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect jdbc:mysql://db.example.com/company --table employees --order-by employee_id --warehouse-dir /shared/imported_databases --as-sequencefile
> Sqoop will automatically select the correct JDBC driver class name for HSQLdb and MySQL; this can also be explicitly set, e.g.:
> hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect jdbc:postgresql://db.example.com/company --driver org.postgresql.Driver --all-tables
> Testing has been conducted with HSQLDB and MySQL. A set of unit tests covers a great deal of Sqoop's functionality, and this tool has been used in practice at Cloudera and with a few other early test users on "real" databases.
> A readme file is included in the patch which contains documentation on how to use the tool.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-5815) Sqoop: A database import tool for Hadoop

Posted by "Aaron Kimball (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-5815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Aaron Kimball updated HADOOP-5815:
----------------------------------

    Status: Patch Available  (was: Open)

> Sqoop: A database import tool for Hadoop
> ----------------------------------------
>
>                 Key: HADOOP-5815
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5815
>             Project: Hadoop Core
>          Issue Type: New Feature
>            Reporter: Aaron Kimball
>            Assignee: Aaron Kimball
>         Attachments: HADOOP-5815.2.patch, HADOOP-5815.patch
>
>
> Overview:
> Sqoop is a tool designed to help users import existing relational databases into their Hadoop clusters. Sqoop uses JDBC to connect to a database, examine the schema for tables, and auto-generate the necessary classes to import data into HDFS. It then instantiates a MapReduce job to read the table from the database via the DBInputFormat (JDBC-based InputFormat). The table is read into a set of files loaded into HDFS. Both SequenceFile and text-based targets are supported.
> Longer term, Sqoop will support automatic connectivity to Hive, with the ability to load data files directly into the Hive warehouse directory, and also to inject the appropriate table definition into the metastore.
> Some more specifics:
> Sqoop is a program implemented as a contrib module. Its frontend is invoked through "bin/hadoop jar sqoop.jar ..." and allows you to connect to arbitrary JDBC databases and extract their tables into files in HDFS. The underlying implementation utilizes the JDBC interface of HADOOP-2536 (DBInputFormat). The DBWritable implementation needed to extract a table is generated by this tool, based on the types of the columns seen in the table. Sqoop uses JDBC to examine the table specification and translate this to the appropriate Java types.
> The generated classes are provided as .java files for the user to reuse. They are also compiled into a jar and used to run a MapReduce task to perform the data import. This either results in text files or SequenceFiles in HDFS. In the latter case, these Java classes are embedded into the SequenceFiles as well.
> The program will extract a specific table from a database, or optionally, all tables. For a table, it can read all columns, or just a subset. Since HADOOP-2536 requires that a sorting key be specified for the import task, Sqoop will auto-detect the presence of a primary key on a table and automatically use it as the sort order; the user can also manually specify a sorting column.
> Example invocations:
> To import an entire database:
> hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect jdbc:mysql://db.example.com/company --all-tables
> (Requires that all tables have primary keys)
> To select a single table:
> hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect jdbc:mysql://db.example.com/company --table employees
> To select a subset of columns from a table:
> hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect jdbc:mysql://db.example.com/company --table employees --columns "employee_id,first_name,last_name,salary,start_date"
> To explicitly set the sort column, import format, and import destination (the table will go to /shared/imported_databases/employees):
> hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect jdbc:mysql://db.example.com/company --table employees --order-by employee_id --warehouse-dir /shared/imported_databases --as-sequencefile
> Sqoop will automatically select the correct JDBC driver class name for HSQLdb and MySQL; this can also be explicitly set, e.g.:
> hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect jdbc:postgresql://db.example.com/company --driver org.postgresql.Driver --all-tables
> Testing has been conducted with HSQLDB and MySQL. A set of unit tests covers a great deal of Sqoop's functionality, and this tool has been used in practice at Cloudera and with a few other early test users on "real" databases.
> A readme file is included in the patch which contains documentation on how to use the tool.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-5815) Sqoop: A database import tool for Hadoop

Posted by "Aaron Kimball (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-5815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Aaron Kimball updated HADOOP-5815:
----------------------------------

    Attachment: HADOOP-5815.2.patch

New patch to fix releaseaudit warnings

> Sqoop: A database import tool for Hadoop
> ----------------------------------------
>
>                 Key: HADOOP-5815
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5815
>             Project: Hadoop Core
>          Issue Type: New Feature
>            Reporter: Aaron Kimball
>            Assignee: Aaron Kimball
>         Attachments: HADOOP-5815.2.patch, HADOOP-5815.patch
>
>
> Overview:
> Sqoop is a tool designed to help users import existing relational databases into their Hadoop clusters. Sqoop uses JDBC to connect to a database, examine the schema for tables, and auto-generate the necessary classes to import data into HDFS. It then instantiates a MapReduce job to read the table from the database via the DBInputFormat (JDBC-based InputFormat). The table is read into a set of files loaded into HDFS. Both SequenceFile and text-based targets are supported.
> Longer term, Sqoop will support automatic connectivity to Hive, with the ability to load data files directly into the Hive warehouse directory, and also to inject the appropriate table definition into the metastore.
> Some more specifics:
> Sqoop is a program implemented as a contrib module. Its frontend is invoked through "bin/hadoop jar sqoop.jar ..." and allows you to connect to arbitrary JDBC databases and extract their tables into files in HDFS. The underlying implementation utilizes the JDBC interface of HADOOP-2536 (DBInputFormat). The DBWritable implementation needed to extract a table is generated by this tool, based on the types of the columns seen in the table. Sqoop uses JDBC to examine the table specification and translate this to the appropriate Java types.
> The generated classes are provided as .java files for the user to reuse. They are also compiled into a jar and used to run a MapReduce task to perform the data import. This either results in text files or SequenceFiles in HDFS. In the latter case, these Java classes are embedded into the SequenceFiles as well.
> The program will extract a specific table from a database, or optionally, all tables. For a table, it can read all columns, or just a subset. Since HADOOP-2536 requires that a sorting key be specified for the import task, Sqoop will auto-detect the presence of a primary key on a table and automatically use it as the sort order; the user can also manually specify a sorting column.
> Example invocations:
> To import an entire database:
> hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect jdbc:mysql://db.example.com/company --all-tables
> (Requires that all tables have primary keys)
> To select a single table:
> hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect jdbc:mysql://db.example.com/company --table employees
> To select a subset of columns from a table:
> hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect jdbc:mysql://db.example.com/company --table employees --columns "employee_id,first_name,last_name,salary,start_date"
> To explicitly set the sort column, import format, and import destination (the table will go to /shared/imported_databases/employees):
> hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect jdbc:mysql://db.example.com/company --table employees --order-by employee_id --warehouse-dir /shared/imported_databases --as-sequencefile
> Sqoop will automatically select the correct JDBC driver class name for HSQLdb and MySQL; this can also be explicitly set, e.g.:
> hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect jdbc:postgresql://db.example.com/company --driver org.postgresql.Driver --all-tables
> Testing has been conducted with HSQLDB and MySQL. A set of unit tests covers a great deal of Sqoop's functionality, and this tool has been used in practice at Cloudera and with a few other early test users on "real" databases.
> A readme file is included in the patch which contains documentation on how to use the tool.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-5815) Sqoop: A database import tool for Hadoop

Posted by "Tom White (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-5815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12711186#action_12711186 ] 

Tom White commented on HADOOP-5815:
-----------------------------------

+1

This looks good to me. I'd like to commit this in the next day or two.

> Sqoop: A database import tool for Hadoop
> ----------------------------------------
>
>                 Key: HADOOP-5815
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5815
>             Project: Hadoop Core
>          Issue Type: New Feature
>            Reporter: Aaron Kimball
>            Assignee: Aaron Kimball
>         Attachments: HADOOP-5815.2.patch, HADOOP-5815.patch
>
>
> Overview:
> Sqoop is a tool designed to help users import existing relational databases into their Hadoop clusters. Sqoop uses JDBC to connect to a database, examine the schema for tables, and auto-generate the necessary classes to import data into HDFS. It then instantiates a MapReduce job to read the table from the database via the DBInputFormat (JDBC-based InputFormat). The table is read into a set of files loaded into HDFS. Both SequenceFile and text-based targets are supported.
> Longer term, Sqoop will support automatic connectivity to Hive, with the ability to load data files directly into the Hive warehouse directory, and also to inject the appropriate table definition into the metastore.
> Some more specifics:
> Sqoop is a program implemented as a contrib module. Its frontend is invoked through "bin/hadoop jar sqoop.jar ..." and allows you to connect to arbitrary JDBC databases and extract their tables into files in HDFS. The underlying implementation utilizes the JDBC interface of HADOOP-2536 (DBInputFormat). The DBWritable implementation needed to extract a table is generated by this tool, based on the types of the columns seen in the table. Sqoop uses JDBC to examine the table specification and translate this to the appropriate Java types.
> The generated classes are provided as .java files for the user to reuse. They are also compiled into a jar and used to run a MapReduce task to perform the data import. This either results in text files or SequenceFiles in HDFS. In the latter case, these Java classes are embedded into the SequenceFiles as well.
> The program will extract a specific table from a database, or optionally, all tables. For a table, it can read all columns, or just a subset. Since HADOOP-2536 requires that a sorting key be specified for the import task, Sqoop will auto-detect the presence of a primary key on a table and automatically use it as the sort order; the user can also manually specify a sorting column.
> Example invocations:
> To import an entire database:
> hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect jdbc:mysql://db.example.com/company --all-tables
> (Requires that all tables have primary keys)
> To select a single table:
> hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect jdbc:mysql://db.example.com/company --table employees
> To select a subset of columns from a table:
> hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect jdbc:mysql://db.example.com/company --table employees --columns "employee_id,first_name,last_name,salary,start_date"
> To explicitly set the sort column, import format, and import destination (the table will go to /shared/imported_databases/employees):
> hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect jdbc:mysql://db.example.com/company --table employees --order-by employee_id --warehouse-dir /shared/imported_databases --as-sequencefile
> Sqoop will automatically select the correct JDBC driver class name for HSQLdb and MySQL; this can also be explicitly set, e.g.:
> hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect jdbc:postgresql://db.example.com/company --driver org.postgresql.Driver --all-tables
> Testing has been conducted with HSQLDB and MySQL. A set of unit tests covers a great deal of Sqoop's functionality, and this tool has been used in practice at Cloudera and with a few other early test users on "real" databases.
> A readme file is included in the patch which contains documentation on how to use the tool.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.