You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@sqoop.apache.org by "Masatake Iwasaki (Created) (JIRA)" <ji...@apache.org> on 2011/11/04 09:43:00 UTC

[jira] [Created] (SQOOP-390) PostgreSQL connector for direct export with pg_bulkload

PostgreSQL connector for direct export with pg_bulkload
-------------------------------------------------------

Key: SQOOP-390
URL: https://issues.apache.org/jira/browse/SQOOP-390
Project: Sqoop
Issue Type: New Feature
Reporter: Masatake Iwasaki

h1. Features

* Fast data export with pg_bulkload.
** http://pgbulkload.projects.postgresql.org/index.html
* User can get benefit of functionality of pg_bulkload such as
** bypassing shared bufferes and WAL,
** ignoring data causing parse error,
** ETL feature with filter functions.

h1. Implementation

* Each map tasks create their own staging table with names based on task attempt id. Then export data with pg_bulkload invoked as outer process.
* Staging tables are erased on
** Successful completion of export,
** or Exception in map task.
* Reduce task migrate data from staging tables into destination table
** Number of reduce tasks is internally set to 1 .

h1. Requirements

* pg_bulkload must be installed on DB server and all slave nodes.
* Also PostgreSQL JDBC is required on client. (Same as PostgresqlManager)
* Superuser role of PostgreSQL database is required for pg_bulkload.

h1. Usage

Currently there is no Factory class. Specify connection manager class name with --connection-manager option to use.

{noformat}
sqoop export --connect jdbc:postgresql://localhost:5432/test \
--connection-manager com.cloudera.sqoop.manager.PGBulkloadManager \
--table test --username postgres --export-dir=/test -m 1
{noformat}

You can also specify pg_bulkload configuration with Hadoop Configuration properties.

{noformat}
-Dpgbulkload.bin="/usr/bin/pg_bulkload"
-Dpgbulkload.input.field.delim=$'\t'
-Dpgbulkload.check.constraints="YES"
-Dpgbulkload.parse.errors="INFINITE"
-Dpgbulkload.duplicate.errors="INFINITE"
{noformat}

h1. Test

There is test class named PGBulkloadManagerTest extending TestExport.

{noformat}
ant -Dtestcase=PGBulkloadManagerTest test
{noformat}

This test requires
* PostgreSQL running on localhost:5432,
* database named sqooptest,
* super user role named sqooptest with no password, or .pgpass created

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (SQOOP-390) PostgreSQL connector for direct export with pg_bulkload

Posted by "Masatake Iwasaki (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SQOOP-390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Masatake Iwasaki updated SQOOP-390:
-----------------------------------

    Attachment: SQOOP-390-1.patch

I would like review and suggestion for this connector Postgresql.
Thanks in advance.
                
> PostgreSQL connector for direct export with pg_bulkload
> -------------------------------------------------------
>
>                 Key: SQOOP-390
>                 URL: https://issues.apache.org/jira/browse/SQOOP-390
>             Project: Sqoop
>          Issue Type: New Feature
>            Reporter: Masatake Iwasaki
>         Attachments: SQOOP-390-1.patch
>
>
> h1. Features
> * Fast data export with pg_bulkload.
>   ** http://pgbulkload.projects.postgresql.org/index.html
> * User can get benefit of functionality of pg_bulkload such as
> ** bypassing shared bufferes and WAL,
> ** ignoring data causing parse error,
> ** ETL feature with filter functions.
> h1. Implementation
> * Each map tasks create their own staging table with names based on task attempt id. Then export data with pg_bulkload invoked as outer process.
> * Staging tables are erased on
> ** Successful completion of export, 
> ** or Exception in map task.
> * Reduce task migrate data from staging tables into destination table
> ** Number of reduce tasks is internally set to 1 .
> h1. Requirements
> * pg_bulkload must be installed on DB server and all slave nodes.
> * Also PostgreSQL JDBC is required on client. (Same as PostgresqlManager)
> * Superuser role of PostgreSQL database is required for pg_bulkload.
> h1. Usage
> Currently there is no Factory class. Specify connection manager class name with --connection-manager option to use.
>   
> {noformat} 
> sqoop export --connect jdbc:postgresql://localhost:5432/test \
>              --connection-manager com.cloudera.sqoop.manager.PGBulkloadManager \
>              --table test --username postgres --export-dir=/test -m 1
> {noformat} 
> You can also specify pg_bulkload configuration with Hadoop Configuration properties.
> {noformat} 
> -Dpgbulkload.bin="/usr/bin/pg_bulkload"
> -Dpgbulkload.input.field.delim=$'\t'
> -Dpgbulkload.check.constraints="YES"
> -Dpgbulkload.parse.errors="INFINITE"
> -Dpgbulkload.duplicate.errors="INFINITE"
> {noformat} 
> h1. Test
> There is test class named PGBulkloadManagerTest extending TestExport.
> {noformat} 
>  ant -Dtestcase=PGBulkloadManagerTest test
> {noformat} 
> This test requires 
> * PostgreSQL running on localhost:5432,
> * database named sqooptest,
> * super user role named sqooptest with no password, or .pgpass created

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (SQOOP-390) PostgreSQL connector for direct export with pg_bulkload

Posted by "jiraposter@reviews.apache.org (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SQOOP-390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13143900#comment-13143900 ] 

jiraposter@reviews.apache.org commented on SQOOP-390:
-----------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/2724/
-----------------------------------------------------------

Review request for Sqoop.


Summary
-------

Patch for SQOOP-390
https://issues.apache.org/jira/browse/SQOOP-390


This addresses bug SQOOP-390.
    https://issues.apache.org/jira/browse/SQOOP-390


Diffs
-----

  /src/java/com/cloudera/sqoop/manager/PGBulkloadManager.java PRE-CREATION 
  /src/java/com/cloudera/sqoop/mapreduce/AutoProgressReducer.java PRE-CREATION 
  /src/java/com/cloudera/sqoop/mapreduce/PGBulkloadExportJob.java PRE-CREATION 
  /src/java/com/cloudera/sqoop/mapreduce/PGBulkloadExportMapper.java PRE-CREATION 
  /src/java/com/cloudera/sqoop/mapreduce/PGBulkloadExportReducer.java PRE-CREATION 
  /src/test/com/cloudera/sqoop/manager/PGBulkloadManagerTest.java PRE-CREATION 

Diff: https://reviews.apache.org/r/2724/diff


Testing
-------


Thanks,

Masatake


                
> PostgreSQL connector for direct export with pg_bulkload
> -------------------------------------------------------
>
>                 Key: SQOOP-390
>                 URL: https://issues.apache.org/jira/browse/SQOOP-390
>             Project: Sqoop
>          Issue Type: New Feature
>            Reporter: Masatake Iwasaki
>         Attachments: SQOOP-390-1.patch
>
>
> h1. Features
> * Fast data export with pg_bulkload.
>   ** http://pgbulkload.projects.postgresql.org/index.html
> * User can get benefit of functionality of pg_bulkload such as
> ** bypassing shared bufferes and WAL,
> ** ignoring data causing parse error,
> ** ETL feature with filter functions.
> h1. Implementation
> * Each map tasks create their own staging table with names based on task attempt id. Then export data with pg_bulkload invoked as outer process.
> * Staging tables are erased on
> ** Successful completion of export, 
> ** or Exception in map task.
> * Reduce task migrate data from staging tables into destination table
> ** Number of reduce tasks is internally set to 1 .
> h1. Requirements
> * pg_bulkload must be installed on DB server and all slave nodes.
> * Also PostgreSQL JDBC is required on client. (Same as PostgresqlManager)
> * Superuser role of PostgreSQL database is required for pg_bulkload.
> h1. Usage
> Currently there is no Factory class. Specify connection manager class name with --connection-manager option to use.
>   
> {noformat} 
> sqoop export --connect jdbc:postgresql://localhost:5432/test \
>              --connection-manager com.cloudera.sqoop.manager.PGBulkloadManager \
>              --table test --username postgres --export-dir=/test -m 1
> {noformat} 
> You can also specify pg_bulkload configuration with Hadoop Configuration properties.
> {noformat} 
> -Dpgbulkload.bin="/usr/bin/pg_bulkload"
> -Dpgbulkload.input.field.delim=$'\t'
> -Dpgbulkload.check.constraints="YES"
> -Dpgbulkload.parse.errors="INFINITE"
> -Dpgbulkload.duplicate.errors="INFINITE"
> {noformat} 
> h1. Test
> There is test class named PGBulkloadManagerTest extending TestExport.
> {noformat} 
>  ant -Dtestcase=PGBulkloadManagerTest test
> {noformat} 
> This test requires 
> * PostgreSQL running on localhost:5432,
> * database named sqooptest,
> * super user role named sqooptest with no password, or .pgpass created

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (SQOOP-390) PostgreSQL connector for direct export with pg_bulkload

Posted by "Masatake Iwasaki (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SQOOP-390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Masatake Iwasaki updated SQOOP-390:
-----------------------------------

    Description: 
h1. Features

* Fast data export into PostgreSQL database with [pg_bulkload|http://pgbulkload.projects.postgresql.org/index.html].
* User can get benefit of functionality of pg_bulkload such as
** fast export bypassing shared bufferes and WAL,
** removing invalid data which cause parse error,
** ETL feature with filter functions.


h1. Implementation

h2. PGBulkloadExportMapper

At first, each map tasks create their own staging table with names based on task attempt id because pg_bulkload holds table level lock.
Arguments of pg_bulkload command can be passed via configuration object.
The Mapper export data by invoking pg_bulkload command as external process.
Commnad execution is done in the same way as DirectPostgresqlManager.


h2. PGBulkloadExportReducer

Reducer migrates data from staging tables into destination table.
Reducer gets the names of staging tables as map output values.
In order to do migration in a transaction, the number of reducers must be 1. (It is set by ConnectionManager internally).
Migration is done in same way as Sqoop defalt connectors using "INSERT INTO dst ( SELECT * FROM src )".
In the cleanup method, staging tables are dropped.
If exception is raised in the reducer, garbage staging tables are left.
User can delete them by executing this connector with --clear-staging-table option.


h1. Requirements

* pg_bulkload must be installed on DB server and all slave nodes.
** RPM for RedHat or CentOS is available in [download page|http://pgfoundry.org/frs/?group_id=1000261].
* [PostgreSQL JDBC| http://jdbc.postgresql.org/download.html] is also required on client node (same as PostgresqlManager).
* Superuser role of PostgreSQL database is required for execution of pg_bulkload.


h1. Usage

Currently there is no Factory class.
Specify connection manager class name with --connection-manager option to use.
  
{noformat}
$ sqoop export --connect jdbc:postgresql://localhost:5432/test \
               --connection-manager com.cloudera.sqoop.manager.PGBulkloadManager \
               --table test --username postgres --export-dir=/test -m 1
{noformat}

You can also specify pg_bulkload configuration with Hadoop configuration properties.

{noformat}
$ sqoop export \
    -Dpgbulkload.bin="/usr/local/bin/pg_bulkload" \
    -Dpgbulkload.input.field.delim=$'\t' \
    -Dpgbulkload.check.constraints="YES" \
    -Dpgbulkload.parse.errors="INFINITE" \
    -Dpgbulkload.duplicate.errors="INFINITE" \
    --connect jdbc:postgresql://localhost:5432/test \
    --connection-manager com.cloudera.sqoop.manager.PGBulkloadManager \
    --table test --username postgres --export-dir=/test -m 1
{noformat}


h1. Test

There is test class named PGBulkloadManagerTest extending TestExport.
This test expects that
* The major version of PostgreSQL is 9.0,
* The version of pg_bulkload is 3.1.1,
* PostgreSQL server is running on localhost:5432 of testing node,
* there is database named sqooptest,
* super user role named sqooptest exists,
* no password is set for the role, or .pgpass file is created.

Tests can be invoked as below.
{noformat}
# ant -Dtestcase=PGBulkloadManagerTest test
{noformat}


h2. Test on single-node cluster built on CentOS 6.0

Hadoop is expected to be running in pseudo distributed mode.
{noformat}
# rpm -ivh jdk-6u29-linux-amd64.rpm
# yum install ant-junit
# yum install hadoop-0.20-conf-pseudo
# sudo -u hdfs hadoop namenode -format
# /etc/init.d/hadoop-0.20-namenode start
# /etc/init.d/hadoop-0.20-datanode start
# /etc/init.d/hadoop-0.20-jobtracker start
# /etc/init.d/hadoop-0.20-tasktracker start
{noformat}

Setup for PostgreSQL:
{noformat}
# wget http://ftp.postgresql.org/pub/source/v9.0.7/postgresql-9.0.7.tar.bz2
# tar jxf postgresql-9.0.7.tar.bz2
# cd postgresql-9.0.7
# ./configure --prefix=/usr/local
# make
# make install
# useradd postgres
# mkdir /var/pgdata
# chown postgres:postgres /var/pgdata
# chmod 700 /var/pgdata
# sudo -u postgres /usr/local/bin/initdb  -D /var/pgdata -E utf-8
# sudo -u postgres /usr/local/bin/pg_ctl start -D /var/pgdata -l /var/pgdata/postgresql.log
# createuser -U postgres -s sqooptest
# createdb -U sqooptest sqooptest
{noformat}

Setup for pg_bulkload:
{noformat}
# cd ..
# wget http://pgfoundry.org/frs/download.php/3176/pg_bulkload-3.1.1.tar.gz
# tar zxf pg_bulkload-3.1.1.tar.gz
# mv pg_bulkload-3.1.1 postgresql-9.0.7/contrib/
# cd postgresql-9.0.7/contrib/pg_bulkload-3.1.1
# make
# make install
# psql -U sqooptest -f /usr/local/share/postgresql/contrib/pg_bulkload.sql sqooptest
# ln -s /usr/local/bin/pg_bulkload /usr/bin/
{noformat}

Setup for PostgreSQL JDBC Driver:
{noformat}
# wget http://jdbc.postgresql.org/download/postgresql-9.0-802.jdbc4.jar
# cp postgresql-9.0-802.jdbc4.jar /usr/lib/hadoop/lib/
{noformat}

run test.
{noformat}
# cd /usr/local/src/sqoop-trunk
# ant -Dtestcase=PGBulkloadManagerTest test
{noformat}

  was:
h1. Features

* Fast data export with pg_bulkload.
  ** http://pgbulkload.projects.postgresql.org/index.html
* User can get benefit of functionality of pg_bulkload such as
** bypassing shared bufferes and WAL,
** ignoring data causing parse error,
** ETL feature with filter functions.


h1. Implementation

* Each map tasks create their own staging table with names based on task attempt id. Then export data with pg_bulkload invoked as outer process.
* Staging tables are erased on
** Successful completion of export, 
** or Exception in map task.
* Reduce task migrate data from staging tables into destination table
** Number of reduce tasks is internally set to 1 .


h1. Requirements

* pg_bulkload must be installed on DB server and all slave nodes.
* Also PostgreSQL JDBC is required on client. (Same as PostgresqlManager)
* Superuser role of PostgreSQL database is required for pg_bulkload.


h1. Usage

Currently there is no Factory class. Specify connection manager class name with --connection-manager option to use.
  
{noformat} 
sqoop export --connect jdbc:postgresql://localhost:5432/test \
             --connection-manager com.cloudera.sqoop.manager.PGBulkloadManager \
             --table test --username postgres --export-dir=/test -m 1
{noformat} 

You can also specify pg_bulkload configuration with Hadoop Configuration properties.

{noformat} 
-Dpgbulkload.bin="/usr/bin/pg_bulkload"
-Dpgbulkload.input.field.delim=$'\t'
-Dpgbulkload.check.constraints="YES"
-Dpgbulkload.parse.errors="INFINITE"
-Dpgbulkload.duplicate.errors="INFINITE"
{noformat} 


h1. Test

There is test class named PGBulkloadManagerTest extending TestExport.

{noformat} 
 ant -Dtestcase=PGBulkloadManagerTest test
{noformat} 

This test requires 
* PostgreSQL running on localhost:5432,
* database named sqooptest,
* super user role named sqooptest with no password, or .pgpass created


    
> PostgreSQL connector for direct export with pg_bulkload
> -------------------------------------------------------
>
>                 Key: SQOOP-390
>                 URL: https://issues.apache.org/jira/browse/SQOOP-390
>             Project: Sqoop
>          Issue Type: New Feature
>            Reporter: Masatake Iwasaki
>         Attachments: SQOOP-390-1.patch
>
>
> h1. Features
> * Fast data export into PostgreSQL database with [pg_bulkload|http://pgbulkload.projects.postgresql.org/index.html].
> * User can get benefit of functionality of pg_bulkload such as
> ** fast export bypassing shared bufferes and WAL,
> ** removing invalid data which cause parse error,
> ** ETL feature with filter functions.
> h1. Implementation
> h2. PGBulkloadExportMapper
> At first, each map tasks create their own staging table with names based on task attempt id because pg_bulkload holds table level lock.
> Arguments of pg_bulkload command can be passed via configuration object.
> The Mapper export data by invoking pg_bulkload command as external process.
> Commnad execution is done in the same way as DirectPostgresqlManager.
> h2. PGBulkloadExportReducer
> Reducer migrates data from staging tables into destination table.
> Reducer gets the names of staging tables as map output values.
> In order to do migration in a transaction, the number of reducers must be 1. (It is set by ConnectionManager internally).
> Migration is done in same way as Sqoop defalt connectors using "INSERT INTO dst ( SELECT * FROM src )".
> In the cleanup method, staging tables are dropped.
> If exception is raised in the reducer, garbage staging tables are left.
> User can delete them by executing this connector with --clear-staging-table option.
> h1. Requirements
> * pg_bulkload must be installed on DB server and all slave nodes.
> ** RPM for RedHat or CentOS is available in [download page|http://pgfoundry.org/frs/?group_id=1000261].
> * [PostgreSQL JDBC| http://jdbc.postgresql.org/download.html] is also required on client node (same as PostgresqlManager).
> * Superuser role of PostgreSQL database is required for execution of pg_bulkload.
> h1. Usage
> Currently there is no Factory class.
> Specify connection manager class name with --connection-manager option to use.
>   
> {noformat}
> $ sqoop export --connect jdbc:postgresql://localhost:5432/test \
>                --connection-manager com.cloudera.sqoop.manager.PGBulkloadManager \
>                --table test --username postgres --export-dir=/test -m 1
> {noformat}
> You can also specify pg_bulkload configuration with Hadoop configuration properties.
> {noformat}
> $ sqoop export \
>     -Dpgbulkload.bin="/usr/local/bin/pg_bulkload" \
>     -Dpgbulkload.input.field.delim=$'\t' \
>     -Dpgbulkload.check.constraints="YES" \
>     -Dpgbulkload.parse.errors="INFINITE" \
>     -Dpgbulkload.duplicate.errors="INFINITE" \
>     --connect jdbc:postgresql://localhost:5432/test \
>     --connection-manager com.cloudera.sqoop.manager.PGBulkloadManager \
>     --table test --username postgres --export-dir=/test -m 1
> {noformat}
> h1. Test
> There is test class named PGBulkloadManagerTest extending TestExport.
> This test expects that
> * The major version of PostgreSQL is 9.0,
> * The version of pg_bulkload is 3.1.1,
> * PostgreSQL server is running on localhost:5432 of testing node,
> * there is database named sqooptest,
> * super user role named sqooptest exists,
> * no password is set for the role, or .pgpass file is created.
> Tests can be invoked as below.
> {noformat}
> # ant -Dtestcase=PGBulkloadManagerTest test
> {noformat}
> h2. Test on single-node cluster built on CentOS 6.0
> Hadoop is expected to be running in pseudo distributed mode.
> {noformat}
> # rpm -ivh jdk-6u29-linux-amd64.rpm
> # yum install ant-junit
> # yum install hadoop-0.20-conf-pseudo
> # sudo -u hdfs hadoop namenode -format
> # /etc/init.d/hadoop-0.20-namenode start
> # /etc/init.d/hadoop-0.20-datanode start
> # /etc/init.d/hadoop-0.20-jobtracker start
> # /etc/init.d/hadoop-0.20-tasktracker start
> {noformat}
> Setup for PostgreSQL:
> {noformat}
> # wget http://ftp.postgresql.org/pub/source/v9.0.7/postgresql-9.0.7.tar.bz2
> # tar jxf postgresql-9.0.7.tar.bz2
> # cd postgresql-9.0.7
> # ./configure --prefix=/usr/local
> # make
> # make install
> # useradd postgres
> # mkdir /var/pgdata
> # chown postgres:postgres /var/pgdata
> # chmod 700 /var/pgdata
> # sudo -u postgres /usr/local/bin/initdb  -D /var/pgdata -E utf-8
> # sudo -u postgres /usr/local/bin/pg_ctl start -D /var/pgdata -l /var/pgdata/postgresql.log
> # createuser -U postgres -s sqooptest
> # createdb -U sqooptest sqooptest
> {noformat}
> Setup for pg_bulkload:
> {noformat}
> # cd ..
> # wget http://pgfoundry.org/frs/download.php/3176/pg_bulkload-3.1.1.tar.gz
> # tar zxf pg_bulkload-3.1.1.tar.gz
> # mv pg_bulkload-3.1.1 postgresql-9.0.7/contrib/
> # cd postgresql-9.0.7/contrib/pg_bulkload-3.1.1
> # make
> # make install
> # psql -U sqooptest -f /usr/local/share/postgresql/contrib/pg_bulkload.sql sqooptest
> # ln -s /usr/local/bin/pg_bulkload /usr/bin/
> {noformat}
> Setup for PostgreSQL JDBC Driver:
> {noformat}
> # wget http://jdbc.postgresql.org/download/postgresql-9.0-802.jdbc4.jar
> # cp postgresql-9.0-802.jdbc4.jar /usr/lib/hadoop/lib/
> {noformat}
> run test.
> {noformat}
> # cd /usr/local/src/sqoop-trunk
> # ant -Dtestcase=PGBulkloadManagerTest test
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (SQOOP-390) PostgreSQL connector for direct export with pg_bulkload

Posted by "Jarek Jarcec Cecho (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SQOOP-390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13143846#comment-13143846 ] 

Jarek Jarcec Cecho commented on SQOOP-390:
------------------------------------------

Hi Masatake,
thank you very much for your patch. We're currently using review board application running on https://reviews.apache.org/ for patch reviewing. would you be so kind and upload it there for the review, please?

Thank you,
Jarcec
                
> PostgreSQL connector for direct export with pg_bulkload
> -------------------------------------------------------
>
>                 Key: SQOOP-390
>                 URL: https://issues.apache.org/jira/browse/SQOOP-390
>             Project: Sqoop
>          Issue Type: New Feature
>            Reporter: Masatake Iwasaki
>         Attachments: SQOOP-390-1.patch
>
>
> h1. Features
> * Fast data export with pg_bulkload.
>   ** http://pgbulkload.projects.postgresql.org/index.html
> * User can get benefit of functionality of pg_bulkload such as
> ** bypassing shared bufferes and WAL,
> ** ignoring data causing parse error,
> ** ETL feature with filter functions.
> h1. Implementation
> * Each map tasks create their own staging table with names based on task attempt id. Then export data with pg_bulkload invoked as outer process.
> * Staging tables are erased on
> ** Successful completion of export, 
> ** or Exception in map task.
> * Reduce task migrate data from staging tables into destination table
> ** Number of reduce tasks is internally set to 1 .
> h1. Requirements
> * pg_bulkload must be installed on DB server and all slave nodes.
> * Also PostgreSQL JDBC is required on client. (Same as PostgresqlManager)
> * Superuser role of PostgreSQL database is required for pg_bulkload.
> h1. Usage
> Currently there is no Factory class. Specify connection manager class name with --connection-manager option to use.
>   
> {noformat} 
> sqoop export --connect jdbc:postgresql://localhost:5432/test \
>              --connection-manager com.cloudera.sqoop.manager.PGBulkloadManager \
>              --table test --username postgres --export-dir=/test -m 1
> {noformat} 
> You can also specify pg_bulkload configuration with Hadoop Configuration properties.
> {noformat} 
> -Dpgbulkload.bin="/usr/bin/pg_bulkload"
> -Dpgbulkload.input.field.delim=$'\t'
> -Dpgbulkload.check.constraints="YES"
> -Dpgbulkload.parse.errors="INFINITE"
> -Dpgbulkload.duplicate.errors="INFINITE"
> {noformat} 
> h1. Test
> There is test class named PGBulkloadManagerTest extending TestExport.
> {noformat} 
>  ant -Dtestcase=PGBulkloadManagerTest test
> {noformat} 
> This test requires 
> * PostgreSQL running on localhost:5432,
> * database named sqooptest,
> * super user role named sqooptest with no password, or .pgpass created

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (SQOOP-390) PostgreSQL connector for direct export with pg_bulkload

Posted by "Masatake Iwasaki (Issue Comment Edited) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SQOOP-390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13143838#comment-13143838 ] 

Masatake Iwasaki edited comment on SQOOP-390 at 11/4/11 8:59 AM:
-----------------------------------------------------------------

I would like review and suggestion for this connector.
Thanks in advance.
                
      was (Author: iwasakims):
    I would like review and suggestion for this connector Postgresql.
Thanks in advance.
                  
> PostgreSQL connector for direct export with pg_bulkload
> -------------------------------------------------------
>
>                 Key: SQOOP-390
>                 URL: https://issues.apache.org/jira/browse/SQOOP-390
>             Project: Sqoop
>          Issue Type: New Feature
>            Reporter: Masatake Iwasaki
>         Attachments: SQOOP-390-1.patch
>
>
> h1. Features
> * Fast data export with pg_bulkload.
>   ** http://pgbulkload.projects.postgresql.org/index.html
> * User can get benefit of functionality of pg_bulkload such as
> ** bypassing shared bufferes and WAL,
> ** ignoring data causing parse error,
> ** ETL feature with filter functions.
> h1. Implementation
> * Each map tasks create their own staging table with names based on task attempt id. Then export data with pg_bulkload invoked as outer process.
> * Staging tables are erased on
> ** Successful completion of export, 
> ** or Exception in map task.
> * Reduce task migrate data from staging tables into destination table
> ** Number of reduce tasks is internally set to 1 .
> h1. Requirements
> * pg_bulkload must be installed on DB server and all slave nodes.
> * Also PostgreSQL JDBC is required on client. (Same as PostgresqlManager)
> * Superuser role of PostgreSQL database is required for pg_bulkload.
> h1. Usage
> Currently there is no Factory class. Specify connection manager class name with --connection-manager option to use.
>   
> {noformat} 
> sqoop export --connect jdbc:postgresql://localhost:5432/test \
>              --connection-manager com.cloudera.sqoop.manager.PGBulkloadManager \
>              --table test --username postgres --export-dir=/test -m 1
> {noformat} 
> You can also specify pg_bulkload configuration with Hadoop Configuration properties.
> {noformat} 
> -Dpgbulkload.bin="/usr/bin/pg_bulkload"
> -Dpgbulkload.input.field.delim=$'\t'
> -Dpgbulkload.check.constraints="YES"
> -Dpgbulkload.parse.errors="INFINITE"
> -Dpgbulkload.duplicate.errors="INFINITE"
> {noformat} 
> h1. Test
> There is test class named PGBulkloadManagerTest extending TestExport.
> {noformat} 
>  ant -Dtestcase=PGBulkloadManagerTest test
> {noformat} 
> This test requires 
> * PostgreSQL running on localhost:5432,
> * database named sqooptest,
> * super user role named sqooptest with no password, or .pgpass created

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (SQOOP-390) PostgreSQL connector for direct export with pg_bulkload

Posted by "jiraposter@reviews.apache.org (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SQOOP-390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13185437#comment-13185437 ] 

jiraposter@reviews.apache.org commented on SQOOP-390:
-----------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/2724/
-----------------------------------------------------------

(Updated 2012-01-13 05:23:36.248155)


Review request for Sqoop.


Summary
-------

Patch for SQOOP-390
https://issues.apache.org/jira/browse/SQOOP-390


This addresses bug SQOOP-390.
    https://issues.apache.org/jira/browse/SQOOP-390


Diffs
-----

  /src/java/com/cloudera/sqoop/manager/PGBulkloadManager.java PRE-CREATION 
  /src/java/com/cloudera/sqoop/mapreduce/AutoProgressReducer.java PRE-CREATION 
  /src/java/com/cloudera/sqoop/mapreduce/PGBulkloadExportJob.java PRE-CREATION 
  /src/java/com/cloudera/sqoop/mapreduce/PGBulkloadExportMapper.java PRE-CREATION 
  /src/java/com/cloudera/sqoop/mapreduce/PGBulkloadExportReducer.java PRE-CREATION 
  /src/test/com/cloudera/sqoop/manager/PGBulkloadManagerTest.java PRE-CREATION 

Diff: https://reviews.apache.org/r/2724/diff


Testing (updated)
-------

This patch include the test class PGBulkloadManagerTest.
I've tested "ant test" and passed.


Thanks,

Masatake


                
> PostgreSQL connector for direct export with pg_bulkload
> -------------------------------------------------------
>
>                 Key: SQOOP-390
>                 URL: https://issues.apache.org/jira/browse/SQOOP-390
>             Project: Sqoop
>          Issue Type: New Feature
>            Reporter: Masatake Iwasaki
>         Attachments: SQOOP-390-1.patch
>
>
> h1. Features
> * Fast data export with pg_bulkload.
>   ** http://pgbulkload.projects.postgresql.org/index.html
> * User can get benefit of functionality of pg_bulkload such as
> ** bypassing shared bufferes and WAL,
> ** ignoring data causing parse error,
> ** ETL feature with filter functions.
> h1. Implementation
> * Each map tasks create their own staging table with names based on task attempt id. Then export data with pg_bulkload invoked as outer process.
> * Staging tables are erased on
> ** Successful completion of export, 
> ** or Exception in map task.
> * Reduce task migrate data from staging tables into destination table
> ** Number of reduce tasks is internally set to 1 .
> h1. Requirements
> * pg_bulkload must be installed on DB server and all slave nodes.
> * Also PostgreSQL JDBC is required on client. (Same as PostgresqlManager)
> * Superuser role of PostgreSQL database is required for pg_bulkload.
> h1. Usage
> Currently there is no Factory class. Specify connection manager class name with --connection-manager option to use.
>   
> {noformat} 
> sqoop export --connect jdbc:postgresql://localhost:5432/test \
>              --connection-manager com.cloudera.sqoop.manager.PGBulkloadManager \
>              --table test --username postgres --export-dir=/test -m 1
> {noformat} 
> You can also specify pg_bulkload configuration with Hadoop Configuration properties.
> {noformat} 
> -Dpgbulkload.bin="/usr/bin/pg_bulkload"
> -Dpgbulkload.input.field.delim=$'\t'
> -Dpgbulkload.check.constraints="YES"
> -Dpgbulkload.parse.errors="INFINITE"
> -Dpgbulkload.duplicate.errors="INFINITE"
> {noformat} 
> h1. Test
> There is test class named PGBulkloadManagerTest extending TestExport.
> {noformat} 
>  ant -Dtestcase=PGBulkloadManagerTest test
> {noformat} 
> This test requires 
> * PostgreSQL running on localhost:5432,
> * database named sqooptest,
> * super user role named sqooptest with no password, or .pgpass created

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira