You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Sateesh Karuturi <sa...@gmail.com> on 2016/01/05 07:58:44 UTC

Deleting empty rows from hive table through java

Hello...
Anyone please help me how to delete empty rows from hive table through java?
Thanks in advance

Re: Deleting empty rows from hive table through java

Posted by Vikas Parashar <pa...@gmail.com>.
If data is not huge then please export it into csv. You have to do all the
transformation on csv and point your table on it.
Would you mind telling me how you are loading your data in hive.



Regards,
Vikas Parashar

On Tue, Jan 5, 2016 at 1:46 PM, Sateesh Karuturi <
sateesh.karuturi9@gmail.com> wrote:

> Thank you for your quick response...
> Directly loading the data from webpage to hive
>
> On Tue, Jan 5, 2016 at 1:44 PM, Vikas Parashar <pa...@gmail.com>
> wrote:
>
>> What is the backend of your table?
>> Is it csv, orc or anything else!
>>
>>
>> Regards,
>> Vikas Parashar
>>
>>
>> On Tue, Jan 5, 2016 at 12:28 PM, Sateesh Karuturi <
>> sateesh.karuturi9@gmail.com> wrote:
>>
>>> Hello...
>>> Anyone please help me how to delete empty rows from hive table through
>>> java?
>>> Thanks in advance
>>>
>>
>>
>

Re: Deleting empty rows from hive table through java

Posted by Sateesh Karuturi <sa...@gmail.com>.
Thank you for your quick response...
Directly loading the data from webpage to hive

On Tue, Jan 5, 2016 at 1:44 PM, Vikas Parashar <pa...@gmail.com> wrote:

> What is the backend of your table?
> Is it csv, orc or anything else!
>
>
> Regards,
> Vikas Parashar
>
>
> On Tue, Jan 5, 2016 at 12:28 PM, Sateesh Karuturi <
> sateesh.karuturi9@gmail.com> wrote:
>
>> Hello...
>> Anyone please help me how to delete empty rows from hive table through
>> java?
>> Thanks in advance
>>
>
>

Re: Deleting empty rows from hive table through java

Posted by Vikas Parashar <pa...@gmail.com>.
What is the backend of your table?
Is it csv, orc or anything else!


Regards,
Vikas Parashar


On Tue, Jan 5, 2016 at 12:28 PM, Sateesh Karuturi <
sateesh.karuturi9@gmail.com> wrote:

> Hello...
> Anyone please help me how to delete empty rows from hive table through
> java?
> Thanks in advance
>

RE: Deleting empty rows from hive table through java

Posted by Mich Talebzadeh <mi...@peridale.co.uk>.
Agreed.

 

Empty rows in any database have no intrinsic value. If we think of ELT, then in theory we need to get the Web data into Hive table including empty rows and then do the clean-up and getting rid of them. This is time consuming and whatever engine we use it is not going to be efficient. I have a text shell that generates a simple table with ID column and a description column as a random text column. Pretty simple code. However, it inserts one normal row followed by a blank row into a hive table that creates itself

 

#!/bin/ksh

function genrandom

{

        l=$1

        [ "$l" == "" ] && l=50

        tr -dc A-Za-z0-9_ < /dev/urandom | head -c ${l} | xargs

}

#

# Main Section

#

fi

FILE_NAME=`basename $0 .ksh`

#

IN_FILE="/var/tmp/test.hql"

[ -f ${IN_FILE} ] && rm -f ${IN_FILE}

LOG_FILE="/var/tmp/test.log"

[ -f ${LOG_FILE} ] && rm -f ${LOG_FILE}

cat >> ${IN_FILE} << !

set hive.exec.dynamic.partition=true;

set hive.exec.max.dynamic.partitions.pernode=1000;

set hive.enforce.bucketing = true;

use test;  -- That is a test database change it to whatever you like

DROP TABLE IF EXISTS txtest;

CREATE TABLE txtest (

          id int

        , description string

)

CLUSTERED BY (id) INTO 256 BUCKETS

STORED AS ORC TBLPROPERTIES('transactional'='true')

;

INSERT INTO TABLE txtest VALUES

!

ROWS=20

integer ROWCOUNT=1

while ((ROWCOUNT <= ROWS))

do

   NEW_UUID=`genrandom 50`  ## generate 50 character random string

   if  ((ROWCOUNT < ROWS))

   then

     COLON=","

   else

     COLON=""

   fi

   echo "(${ROWCOUNT},'${NEW_UUID}') ${COLON}" >> ${IN_FILE}

   if  ((ROWCOUNT < ROWS))

   then

        #

        ## generate an empty line

        #

        echo "('','') ${COLON}" >> ${IN_FILE}

   fi

   ((ROWCOUNT = ROWCOUNT + 1))

done

#

cat >> ${IN_FILE} << !

;

select * from txtest;

!exit

 

Now run that test.hql script against your hive and then try to delete empty rows

 

0: jdbc:hive2://rhes564:10010/default> select count(1) from test.txtest;

INFO  : Status: Finished successfully in 69.32 seconds

+------+--+

| _c0  |

+------+--+

| 39   |

+------+--+

1 row selected (73.368 seconds)

 

0: jdbc:hive2://rhes564:10010/default> select count(1) from test.txtest where id is null;

INFO  :

INFO  : Status: Finished successfully in 7.04 seconds

+------+--+

| _c0  |

+------+--+

| 19   |

+------+--+

1 row selected (7.151 seconds)

 

0: jdbc:hive2://rhes564:10010/default> delete from test.txtest where id is null;

INFO  :

Query Hive on Spark job[2] stages:

INFO  : 5

INFO  : 4

INFO  :

Status: Running (Hive on Spark job[2])

INFO  : Status: Finished successfully in 15.08 seconds

INFO  : Loading data to table test.txtest from hdfs://rhes564:9000/user/hive/warehouse/test.db/txtest/.hive-staging_hive_2016-01-05_16-08-56_808_8837275301497077898-13/-ext-10000

INFO  : Table test.txtest stats: [numFiles=257, numRows=20, totalSize=65884, rawDataSize=0]

No rows affected (15.696 seconds)

 

 

Ok so it took 15.6 seconds to delete those 19 empty rows. One can easily do that by getting the empty lines at OS level before putting data into Hive table!

 

cat test.hql | grep -v "('','')" > tmp.$$

mv -f tmp.$$ test.hql

 

 

So like most things there is really no clear cut approach whether you want to do it outside of Hive or in Hive.

 

HTH

 

 

Dr Mich Talebzadeh

 

LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly

 

http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> 

 

NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.

 

From: Vikas Parashar [mailto:para.vikas@gmail.com] 
Sent: 05 January 2016 11:40
To: user@hive.apache.org
Subject: Re: Deleting empty rows from hive table through java

 

Well said Mich,

 

I had gone through from the same scenario in which we had done ETL out side the hive. Once the transformation is done then we loaded all data into hive warehouse. I think, that's the best practice, we should follow it.

 

Regards,

Vikas Parashar 

 

On Tue, Jan 5, 2016 at 5:02 PM, Mich Talebzadeh <mich@peridale.co.uk <ma...@peridale.co.uk> > wrote:

In would be interesting to do ETL outside of Hive by getting Data from Webpage to an intermediate file, pruning the empty rows and loading the final CSV file into Hive destination table.

 

I am pretty sure this clean up outside of Hive would be faster compared to said thing in Hive

 

Dr Mich Talebzadeh

 

LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly

 

http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> 

 

NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.

 

From: Mich Talebzadeh [mailto:mich@peridale.co.uk <ma...@peridale.co.uk> ] 
Sent: 05 January 2016 08:55
To: user@hive.apache.org <ma...@hive.apache.org> 
Subject: RE: Deleting empty rows from hive table through java

 

Hi Sateesh,

 

You can do the clean-up in Hive by creating a staging table in Hive, feeding your CSV data there and then inserting data into main table where COL1 is NOT NULL.

 

Alternatively you can create your Hive table as transactional. Although I would say the staging table is better as you will have a full record of your CSV data at any time.

 

You can of course do the pruning of data outside of Hive using a simple shell script with sed and awk (if you are familiar with those tools).

 

cat CSV_FILE | '|sed -e '/^$/d'

 

HTH

 

Dr Mich Talebzadeh

 

LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly

 

http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> 

 

NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.

 

From: Sateesh Karuturi [mailto:sateesh.karuturi9@gmail.com] 
Sent: 05 January 2016 06:59
To: user@hive.apache.org <ma...@hive.apache.org> 
Subject: Deleting empty rows from hive table through java

 

Hello...

Anyone please help me how to delete empty rows from hive table through java?

Thanks in advance

 


Re: Deleting empty rows from hive table through java

Posted by Vikas Parashar <pa...@gmail.com>.
Well said Mich,

I had gone through from the same scenario in which we had done ETL out side
the hive. Once the transformation is done then we loaded all data into hive
warehouse. I think, that's the best practice, we should follow it.

Regards,
Vikas Parashar

On Tue, Jan 5, 2016 at 5:02 PM, Mich Talebzadeh <mi...@peridale.co.uk> wrote:

> In would be interesting to do ETL outside of Hive by getting Data from
> Webpage to an intermediate file, pruning the empty rows and loading the
> final CSV file into Hive destination table.
>
>
>
> I am pretty sure this clean up outside of Hive would be faster compared to
> said thing in Hive
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> *Sybase ASE 15 Gold Medal Award 2008*
>
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>
>
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>
> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
> 15", ISBN 978-0-9563693-0-7*.
>
> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4*
>
> *Publications due shortly:*
>
> *Complex Event Processing in Heterogeneous Environments*, ISBN:
> 978-0-9563693-3-8
>
> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
> one out shortly
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Ltd, its subsidiaries nor their employees
> accept any responsibility.
>
>
>
> *From:* Mich Talebzadeh [mailto:mich@peridale.co.uk]
> *Sent:* 05 January 2016 08:55
> *To:* user@hive.apache.org
> *Subject:* RE: Deleting empty rows from hive table through java
>
>
>
> Hi Sateesh,
>
>
>
> You can do the clean-up in Hive by creating a staging table in Hive,
> feeding your CSV data there and then inserting data into main table where
> COL1 is NOT NULL.
>
>
>
> Alternatively you can create your Hive table as transactional. Although I
> would say the staging table is better as you will have a full record of
> your CSV data at any time.
>
>
>
> You can of course do the pruning of data outside of Hive using a simple
> shell script with sed and awk (if you are familiar with those tools).
>
>
>
> cat CSV_FILE | '|sed -e '/^$/d'
>
>
>
> HTH
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> *Sybase ASE 15 Gold Medal Award 2008*
>
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>
>
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>
> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
> 15", ISBN 978-0-9563693-0-7*.
>
> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4*
>
> *Publications due shortly:*
>
> *Complex Event Processing in Heterogeneous Environments*, ISBN:
> 978-0-9563693-3-8
>
> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
> one out shortly
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Ltd, its subsidiaries nor their employees
> accept any responsibility.
>
>
>
> *From:* Sateesh Karuturi [mailto:sateesh.karuturi9@gmail.com
> <sa...@gmail.com>]
> *Sent:* 05 January 2016 06:59
> *To:* user@hive.apache.org
> *Subject:* Deleting empty rows from hive table through java
>
>
>
> Hello...
>
> Anyone please help me how to delete empty rows from hive table through
> java?
>
> Thanks in advance
>

RE: Deleting empty rows from hive table through java

Posted by Mich Talebzadeh <mi...@peridale.co.uk>.
In would be interesting to do ETL outside of Hive by getting Data from Webpage to an intermediate file, pruning the empty rows and loading the final CSV file into Hive destination table.

 

I am pretty sure this clean up outside of Hive would be faster compared to said thing in Hive

 

Dr Mich Talebzadeh

 

LinkedIn   <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

 <http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly

 

 <http://talebzadehmich.wordpress.com/> http://talebzadehmich.wordpress.com

 

NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.

 

From: Mich Talebzadeh [mailto:mich@peridale.co.uk] 
Sent: 05 January 2016 08:55
To: user@hive.apache.org
Subject: RE: Deleting empty rows from hive table through java

 

Hi Sateesh,

 

You can do the clean-up in Hive by creating a staging table in Hive, feeding your CSV data there and then inserting data into main table where COL1 is NOT NULL.

 

Alternatively you can create your Hive table as transactional. Although I would say the staging table is better as you will have a full record of your CSV data at any time.

 

You can of course do the pruning of data outside of Hive using a simple shell script with sed and awk (if you are familiar with those tools).

 

cat CSV_FILE | '|sed -e '/^$/d'

 

HTH

 

Dr Mich Talebzadeh

 

LinkedIn   <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

 <http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly

 

 <http://talebzadehmich.wordpress.com/> http://talebzadehmich.wordpress.com

 

NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.

 

From: Sateesh Karuturi [mailto:sateesh.karuturi9@gmail.com] 
Sent: 05 January 2016 06:59
To: user@hive.apache.org <ma...@hive.apache.org> 
Subject: Deleting empty rows from hive table through java

 

Hello...

Anyone please help me how to delete empty rows from hive table through java?

Thanks in advance


RE: Deleting empty rows from hive table through java

Posted by Mich Talebzadeh <mi...@peridale.co.uk>.
Hi Sateesh,

 

You can do the clean-up in Hive by creating a staging table in Hive, feeding your CSV data there and then inserting data into main table where COL1 is NOT NULL.

 

Alternatively you can create your Hive table as transactional. Although I would say the staging table is better as you will have a full record of your CSV data at any time.

 

You can of course do the pruning of data outside of Hive using a simple shell script with sed and awk (if you are familiar with those tools).

 

cat CSV_FILE | '|sed -e '/^$/d'

 

HTH

 

Dr Mich Talebzadeh

 

LinkedIn   <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

 <http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly

 

 <http://talebzadehmich.wordpress.com/> http://talebzadehmich.wordpress.com

 

NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.

 

From: Sateesh Karuturi [mailto:sateesh.karuturi9@gmail.com] 
Sent: 05 January 2016 06:59
To: user@hive.apache.org
Subject: Deleting empty rows from hive table through java

 

Hello...

Anyone please help me how to delete empty rows from hive table through java?

Thanks in advance