You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Mich Talebzadeh <mi...@cloudtechnologypartners.co.uk> on 2016/02/17 23:43:54 UTC

Importing csv files into Hive ORC target table

 

Hi, 

We put csv files that are zipped using bzip into a staging are on hdfs 

In Hive an external table is created as below: 

DROP TABLE IF EXISTS stg_t2;
CREATE EXTERNAL TABLE stg_t2 (
 INVOICENUMBER string
,PAYMENTDATE string
,NET string
,VAT string
,TOTAL string
)
COMMENT 'from csv file from excel sheet PayInsPeridaleTechnology'
ROW FORMAT serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
STORED AS TEXTFILE
LOCATION '/data/stg/table2'
TBLPROPERTIES ("skip.header.line.count"="1") 

We have an ORC table in Hive created as below: 

DROP TABLE IF EXISTS t2;
CREATE TABLE t2 (
 INVOICENUMBER INT
,PAYMENTDATE timestamp
,NET DECIMAL(20,2)
,VAT DECIMAL(20,2)
,TOTAL DECIMAL(20,2)
)
COMMENT 'from csv file from excel sheet PayInsPeridaleTechnology'
STORED AS ORC
TBLPROPERTIES ( "orc.compress"="ZLIB" )
; 

Then we insert the data from the external table into target table do
some conversion and ignoring empty rows 

INSERT INTO TABLE t2
SELECT
 INVOICENUMBER
 , CAST(UNIX_TIMESTAMP(paymentdate,'DD/MM/YYYY')*1000 as timestamp)
 --, CAST(REGEXP_REPLACE(SUBSTR(net,2,20),",","") AS DECIMAL(20,2))
 , CAST(REGEXP_REPLACE(net,'[^\d\.]','') AS DECIMAL(20,2))
 , CAST(REGEXP_REPLACE(vat,'[^\d\.]','') AS DECIMAL(20,2))
 , CAST(REGEXP_REPLACE(total,'[^\d\.]','') AS DECIMAL(20,2))
FROM
stg_t2 

This works OK for now. 

I was wondering whether this could be done using operations on rdd in
Spark? 

var rdd = sc.textFile("/data/stg/table2") 

I can use rdd.count to see the total rows and
rdd.collect.foreach(println) to see the individual rows 

I would like to get some ideas on how I can do CAST conversion etc on
the data to clean it up and store it in the said ORC table? 

Thanks 

-- 

Dr Mich Talebzadeh

LinkedIn
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

http://talebzadehmich.wordpress.com

NOTE: The information in this email is proprietary and confidential.
This message is for the designated recipient only, if you are not the
intended recipient, you should destroy it immediately. Any information
in this message shall not be understood as given or endorsed by Cloud
Technology Partners Ltd, its subsidiaries or their employees, unless
expressly so stated. It is the responsibility of the recipient to ensure
that this email is virus free, therefore neither Cloud Technology
partners Ltd, its subsidiaries nor their employees accept any
responsibility.

Re: Importing csv files into Hive ORC target table

Posted by Alex Dzhagriev <dz...@gmail.com>.

Hi Mich,

You can use data frames (
http://spark.apache.org/docs/latest/sql-programming-guide.html#dataframes)
to achieve that.

val sqlContext = new HiveContext(sc)

var rdd = sc.textFile("/data/stg/table2")

//...
//perform you business logic, cleanups, etc.
//...

sqlContext.createDataFrame(resultRdd).write.orc("..path..")

Please, note that resultRdd should contain Products (e.g. case classes)

Cheers, Alex.



On Wed, Feb 17, 2016 at 11:43 PM, Mich Talebzadeh <
mich.talebzadeh@cloudtechnologypartners.co.uk> wrote:

> Hi,
>
> We put csv files that are zipped using bzip into a staging are on hdfs
>
> In Hive an external table is created as below:
>
> DROP TABLE IF EXISTS stg_t2;
> CREATE EXTERNAL TABLE stg_t2 (
>  INVOICENUMBER string
> ,PAYMENTDATE string
> ,NET string
> ,VAT string
> ,TOTAL string
> )
> COMMENT 'from csv file from excel sheet PayInsPeridaleTechnology'
> ROW FORMAT serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
> STORED AS TEXTFILE
> LOCATION '/data/stg/table2'
> TBLPROPERTIES ("skip.header.line.count"="1")
>
> We have an ORC table in Hive created as below:
>
>
>
> DROP TABLE IF EXISTS t2;
> CREATE TABLE t2 (
>  INVOICENUMBER          INT
> ,PAYMENTDATE            timestamp
> ,NET                    DECIMAL(20,2)
> ,VAT                    DECIMAL(20,2)
> ,TOTAL                  DECIMAL(20,2)
> )
> COMMENT 'from csv file from excel sheet PayInsPeridaleTechnology'
> STORED AS ORC
> TBLPROPERTIES ( "orc.compress"="ZLIB" )
> ;
>
> Then we insert the data from the external table into target table do some
> conversion and ignoring empty rows
>
> INSERT INTO TABLE t2
> SELECT
>           INVOICENUMBER
>         , CAST(UNIX_TIMESTAMP(paymentdate,'DD/MM/YYYY')*1000 as timestamp)
>         --, CAST(REGEXP_REPLACE(SUBSTR(net,2,20),",","") AS DECIMAL(20,2))
>         , CAST(REGEXP_REPLACE(net,'[^\\d\\.]','') AS DECIMAL(20,2))
>         , CAST(REGEXP_REPLACE(vat,'[^\\d\\.]','') AS DECIMAL(20,2))
>         , CAST(REGEXP_REPLACE(total,'[^\\d\\.]','') AS DECIMAL(20,2))
> FROM
> stg_t2
>
> This works OK for now.
>
>
>
> I was wondering whether this could be done using operations on rdd in
> Spark?
>
> var rdd = sc.textFile("/data/stg/table2")
>
> I can use rdd.count to see the total rows and rdd.collect.foreach(println)
> to see the individual rows
>
>
>
> I would like to get some ideas on how I can do CAST conversion etc on the
> data to clean it up and store it in the said ORC table?
>
>
>
> Thanks
>
> --
>
> Dr Mich Talebzadeh
>
> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> http://talebzadehmich.wordpress.com
>
> NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Cloud Technology Partners Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Cloud Technology partners Ltd, its subsidiaries nor their employees accept any responsibility.
>
>
>