You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Daniel Takacs <hu...@hotmail.com> on 2019/01/03 05:51:28 UTC
Re: insert data into hadoop / hive cluster

Thanks the tools you pointed to were very interesting but I was hoping to achieve this with very little external dependencies.

I was thinking of running a script, what do you think?

CREATE TABLE IF NOT EXISTS dbname.finaltable(a string);
SET hive.cli.errors.ignore=true;
ALTER TABLE dbname.finaltable ADD COLUMNS (b decimal(38,0));
ALTER TABLE dbname.finaltable ADD COLUMNS (c decimal(38,0));
ALTER TABLE dbname.finaltable ADD COLUMNS (d string);
SET hive.cli.errors.ignore=false;

CREATE TABLE dbname.527b66e52b534d919581cae3476b8469(a decimal(38,0),b decimal(38,0),c string,d string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\,' LINES TERMINATED BY '\n' tblproperties("skip.header.line.count"="1");

LOAD DATA LOCAL INPATH

'/home/user/upload_tmp/73a06dcfa9d74bf8a87a9e297e623521/datatoimport.csv' OVERWRITE INTO TABLE dbname.527b66e52b534d919581cae3476b8469;

set mapreduce.job.queuename=myqueue;

# UNFORTUNATELY NEXT STATEMENT TAKES A BIT OF TIME SINCE I GET PLACED ON A QUEUE
INSERT INTO TABLE dbname.finaltable(a,b,c,d) select a,b,c,d FROM dbname.527b66e52b534d919581cae3476b8469;

DROP TABLE dbname.527b66e52b534d919581cae3476b8469


________________________________
From: dam6923 <da...@gmail.com>
Sent: Thursday, December 27, 2018 5:08 AM
To: dev@hive.apache.org
Cc: user@hive.apache.org
Subject: Re: insert data into hadoop / hive cluster

Check out an ETL tool such as StreamSets, NiFi, Pentaho.

On Wed, Dec 26, 2018, 11:55 PM Daniel Takacs <hungarianguy@hotmail.com
wrote:

> I'm working on an ETL that requires me to import a continuous stream of
> CSVs into hadoop / hive cluster. For now let's assume the CSVs need to end
> up in the same database.table. But the newer CSVs might introduce
> additional columns (hence I want the script to alter the table and add
> additional columns as it encounters them).
>
>
>
> e.g.
>
>
>
> csv1.csv
>
> a,b
>
> 1,2
>
> 2,4
>
>
>
> csv2.csv
>
> a,b,c
>
> 3,8,0
>
> 4,10,2
>
>
>
> what is the best way to write such ETL into hive.  should I use hive with
> -f to spin up scripts like:
>
>
> upsert.hql:
>
> CREATE TABLE IF NOT EXISTS mydbname.testtable(a INT) ROW FORMAT DELIMITED
> FIELDS TERMINATED BY '\,';
>
> SET hive.cli.errors.ignore=true;
>
> ALTER TABLE mydbname.testtable ADD COLUMNS (b string);
>
> SET hive.cli.errors.ignore=false;
>
> LOAD DATA LOCAL INPATH '/home/pathtodata/testdata.csv' INTO TABLE
> mydbname.testtable;
>
>
>
> (disadvantage is that when LAD DATA encounters invalid column string for
> integer field the value NULL is inserted and I do not get notified)
>
> should I do it from beeline?
>
> should I write a pig script?
>
> should I write a java program?
>
>
> should I use programs like: https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fenahwe%2FCsv2Hive&amp;data=02%7C01%7C%7C5d82ddceb86644ab904508d66bfc7198%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636815129307651209&amp;sdata=bHbEOyszd78GzbvTpwGpnH1VC2lvf%2BfRda7ebX%2FVcLc%3D&amp;reserved=0
>
>
> what's the recommended approach here?
>
>