You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Alex Dzhagriev <dz...@gmail.com> on 2016/02/17 23:58:07 UTC
Re: Importing csv files into Hive ORC target table

Hi Mich,

You can use data frames (
http://spark.apache.org/docs/latest/sql-programming-guide.html#dataframes)
to achieve that.

val sqlContext = new HiveContext(sc)

var rdd = sc.textFile("/data/stg/table2")

//...
//perform you business logic, cleanups, etc.
//...

sqlContext.createDataFrame(resultRdd).write.orc("..path..")

Please, note that resultRdd should contain Products (e.g. case classes)

Cheers, Alex.



On Wed, Feb 17, 2016 at 11:43 PM, Mich Talebzadeh <
mich.talebzadeh@cloudtechnologypartners.co.uk> wrote:

> Hi,
>
> We put csv files that are zipped using bzip into a staging are on hdfs
>
> In Hive an external table is created as below:
>
> DROP TABLE IF EXISTS stg_t2;
> CREATE EXTERNAL TABLE stg_t2 (
>  INVOICENUMBER string
> ,PAYMENTDATE string
> ,NET string
> ,VAT string
> ,TOTAL string
> )
> COMMENT 'from csv file from excel sheet PayInsPeridaleTechnology'
> ROW FORMAT serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
> STORED AS TEXTFILE
> LOCATION '/data/stg/table2'
> TBLPROPERTIES ("skip.header.line.count"="1")
>
> We have an ORC table in Hive created as below:
>
>
>
> DROP TABLE IF EXISTS t2;
> CREATE TABLE t2 (
>  INVOICENUMBER          INT
> ,PAYMENTDATE            timestamp
> ,NET                    DECIMAL(20,2)
> ,VAT                    DECIMAL(20,2)
> ,TOTAL                  DECIMAL(20,2)
> )
> COMMENT 'from csv file from excel sheet PayInsPeridaleTechnology'
> STORED AS ORC
> TBLPROPERTIES ( "orc.compress"="ZLIB" )
> ;
>
> Then we insert the data from the external table into target table do some
> conversion and ignoring empty rows
>
> INSERT INTO TABLE t2
> SELECT
>           INVOICENUMBER
>         , CAST(UNIX_TIMESTAMP(paymentdate,'DD/MM/YYYY')*1000 as timestamp)
>         --, CAST(REGEXP_REPLACE(SUBSTR(net,2,20),",","") AS DECIMAL(20,2))
>         , CAST(REGEXP_REPLACE(net,'[^\\d\\.]','') AS DECIMAL(20,2))
>         , CAST(REGEXP_REPLACE(vat,'[^\\d\\.]','') AS DECIMAL(20,2))
>         , CAST(REGEXP_REPLACE(total,'[^\\d\\.]','') AS DECIMAL(20,2))
> FROM
> stg_t2
>
> This works OK for now.
>
>
>
> I was wondering whether this could be done using operations on rdd in
> Spark?
>
> var rdd = sc.textFile("/data/stg/table2")
>
> I can use rdd.count to see the total rows and rdd.collect.foreach(println)
> to see the individual rows
>
>
>
> I would like to get some ideas on how I can do CAST conversion etc on the
> data to clean it up and store it in the said ORC table?
>
>
>
> Thanks
>
> --
>
> Dr Mich Talebzadeh
>
> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> http://talebzadehmich.wordpress.com
>
> NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Cloud Technology Partners Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Cloud Technology partners Ltd, its subsidiaries nor their employees accept any responsibility.
>
>
>