You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@sqoop.apache.org by Nicolas Paris <ni...@riseup.net> on 2018/11/10 16:10:01 UTC
Export PostgreSQL Direct
Hi
I guess have spotted some minor error in the sqoop documentation:
[1] : in the table, it says direct mode is enable for postgres only for
import. That's wrong, export too is enabled.
[2] : the psql is not needed for both import/export
----------
Now my questions:
I have been able to load data from all formats (including orc) to
postgresql with sqoop export in **no direct** mode. While robust, it
uses the jdbc insert prepared statement and it is way too slow, even
parallelized.
I have been able to load data from **csv only** format with sqoop export
in **direct mode**. While very fast (parallel copy statements!), the
method is not robust in case the data do have varchar columns. In
particular a varchar column may contain **newlines** and this breaks the
mapper job, since it splits the csv by newlines.
That's too bad, because the *copy* statement can handle *newlined csv*.
1) Is there any way to only send a whole hdfs file per mapper instead of
splitting them ? That would work well.
2) Any plan to allow sqoop export from orc in direct mode ?
Thanks,
[1]: https://sqoop.apache.org/docs/1.4.7/SqoopUserGuide.html#_supported_databases
[2]: https://sqoop.apache.org/docs/1.4.7/SqoopUserGuide.html#_requirements_2
--
nicolas
Re: Export PostgreSQL Direct
Posted by Nicolas Paris <ni...@riseup.net>.
On Sat, Nov 10, 2018 at 05:10:01PM +0100, Nicolas Paris wrote:
> I have been able to load data from **csv only** format with sqoop export
> in **direct mode**. While very fast (parallel copy statements!), the
> method is not robust in case the data do have varchar columns. In
> particular a varchar column may contain **newlines** and this breaks the
> mapper job, since it splits the csv by newlines.
> That's too bad, because the *copy* statement can handle *newlined csv*.
>
> 1) Is there any way to only send a whole hdfs file per mapper instead of
> splitting them ? That would work well.
Finally I found a workaround. Setting the number of mapper to one(-m 1),
makes no split in the data, and COPY handle well the multi line CSV
produced by HIVE.
> sqoop export --connect "jdbc:postgresql://<postgres_host>/<postgres_db>" --username <postgres_user> --password-file file:///home/$USER/.password --export-dir /apps/hive/warehouse/<my_db>/<my_table_path> --table <my_hive_table> --columns "id, text_column" -m 1 --direct --lines-terminated-by '\n' --fields-terminated-by ',' --input-null-string "\\\\N" --input-null-non-string "\\\\N" -- --schema <my_schema>
This should works in any situation (excluding binary columns). Finally I am not
sure using one COPY instead of parallel COPY is worst: less agressive with the
database while great performance.
regards,
--
nicolas