You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Raymond Xie <xi...@gmail.com> on 2018/07/23 19:47:23 UTC

what's the best practice to create an external hive table based on a csv file on HDFS with 618 columns in header?

We are using Cloudera CDH 5.11

I have seen solution for small xlsx files with only handful columns in
header, in my case the csv file to be loaded into a new hive table has 618
columns.

   1.

   Would it be saved as parquet by default if I upload it (save it to csv
   first) through HUE-> File Browser? if not, where can I specify the file
   format?
   2.

   What would be the best way to create an external Impala table based on
   that location? It would definitely be unbelievable if I need to create the
   DDL/schema manually as there are so many columns.

Thank you very much.


*------------------------------------------------*
*Sincerely yours,*


*Raymond*

Re: what's the best practice to create an external hive table based on a csv file on HDFS with 618 columns in header?

Posted by Furcy Pin <pi...@gmail.com>.
Hello,

To load your data as parquet, you can either:

A. use spark:
https://docs.databricks.com/spark/latest/data-sources/read-csv.html and
write it directly as a parquet file
(df.write.format("parquet").saveAsTable("parquet_table"))
B. Load it as a csv file in Hive, and perform a CREATE TABLE parquet_table
STORED PARQUET as SELECT * FROM csv_table
C. Try to use Hue, or any ETL tool to do it (Talend, Dataiku, etc.)

This list is not exhaustive, surely they are other methods out there,


Now, concerning the fact that your table has 618 columns, there are a few
possibilities

A. You don't need all of them and, you can select the one you want while
you transform your csv table into a parquet table
B. You need all of them (after all, if your company give you a csv with
618, they are probably using all of them at some point)
you can either: use spark solution above to create the table without
specifying the fields (and you can then do a SHOW CREATE TABLE
parquet_table to get the DLL of the new table if you want to keep it)
C. You can try to denormalize your data (i.e. make multiple tables out of
it). I would recommend that for a traditional RDBMS, but not for Hive
D. You can group your columns into structs to make them more organized and
easier to select and manage (that's what I would do)

Last but not least, If you prefer writing plain SQL, I would advise that
you train yourself to perform multi-line edit like this:
https://youtu.be/ZJyJqI1UVaA?t=65

I would advise trying Sublime Text for that, but most modern IDE support
this too.
https://www.youtube.com/watch?v=kyiBiXFbbQs&ab_channel=MattThiessen

Hope this helps,

cheers, and good luck

Furcy

On Mon, 23 Jul 2018 at 21:47, Raymond Xie <xi...@gmail.com> wrote:

> We are using Cloudera CDH 5.11
>
> I have seen solution for small xlsx files with only handful columns in
> header, in my case the csv file to be loaded into a new hive table has 618
> columns.
>
>    1.
>
>    Would it be saved as parquet by default if I upload it (save it to csv
>    first) through HUE-> File Browser? if not, where can I specify the file
>    format?
>    2.
>
>    What would be the best way to create an external Impala table based on
>    that location? It would definitely be unbelievable if I need to create the
>    DDL/schema manually as there are so many columns.
>
> Thank you very much.
>
>
> *------------------------------------------------*
> *Sincerely yours,*
>
>
> *Raymond*
>