You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Sateesh Karuturi <sa...@gmail.com> on 2016/01/05 07:58:44 UTC
Deleting empty rows from hive table through java
Hello...
Anyone please help me how to delete empty rows from hive table through java?
Thanks in advance
Re: Deleting empty rows from hive table through java
Posted by Vikas Parashar <pa...@gmail.com>.
If data is not huge then please export it into csv. You have to do all the
transformation on csv and point your table on it.
Would you mind telling me how you are loading your data in hive.
Regards,
Vikas Parashar
On Tue, Jan 5, 2016 at 1:46 PM, Sateesh Karuturi <
sateesh.karuturi9@gmail.com> wrote:
> Thank you for your quick response...
> Directly loading the data from webpage to hive
>
> On Tue, Jan 5, 2016 at 1:44 PM, Vikas Parashar <pa...@gmail.com>
> wrote:
>
>> What is the backend of your table?
>> Is it csv, orc or anything else!
>>
>>
>> Regards,
>> Vikas Parashar
>>
>>
>> On Tue, Jan 5, 2016 at 12:28 PM, Sateesh Karuturi <
>> sateesh.karuturi9@gmail.com> wrote:
>>
>>> Hello...
>>> Anyone please help me how to delete empty rows from hive table through
>>> java?
>>> Thanks in advance
>>>
>>
>>
>
Re: Deleting empty rows from hive table through java
Posted by Sateesh Karuturi <sa...@gmail.com>.
Thank you for your quick response...
Directly loading the data from webpage to hive
On Tue, Jan 5, 2016 at 1:44 PM, Vikas Parashar <pa...@gmail.com> wrote:
> What is the backend of your table?
> Is it csv, orc or anything else!
>
>
> Regards,
> Vikas Parashar
>
>
> On Tue, Jan 5, 2016 at 12:28 PM, Sateesh Karuturi <
> sateesh.karuturi9@gmail.com> wrote:
>
>> Hello...
>> Anyone please help me how to delete empty rows from hive table through
>> java?
>> Thanks in advance
>>
>
>
Re: Deleting empty rows from hive table through java
Posted by Vikas Parashar <pa...@gmail.com>.
What is the backend of your table?
Is it csv, orc or anything else!
Regards,
Vikas Parashar
On Tue, Jan 5, 2016 at 12:28 PM, Sateesh Karuturi <
sateesh.karuturi9@gmail.com> wrote:
> Hello...
> Anyone please help me how to delete empty rows from hive table through
> java?
> Thanks in advance
>
RE: Deleting empty rows from hive table through java
Posted by Mich Talebzadeh <mi...@peridale.co.uk>.
Agreed.
Empty rows in any database have no intrinsic value. If we think of ELT, then in theory we need to get the Web data into Hive table including empty rows and then do the clean-up and getting rid of them. This is time consuming and whatever engine we use it is not going to be efficient. I have a text shell that generates a simple table with ID column and a description column as a random text column. Pretty simple code. However, it inserts one normal row followed by a blank row into a hive table that creates itself
#!/bin/ksh
function genrandom
{
l=$1
[ "$l" == "" ] && l=50
tr -dc A-Za-z0-9_ < /dev/urandom | head -c ${l} | xargs
}
#
# Main Section
#
fi
FILE_NAME=`basename $0 .ksh`
#
IN_FILE="/var/tmp/test.hql"
[ -f ${IN_FILE} ] && rm -f ${IN_FILE}
LOG_FILE="/var/tmp/test.log"
[ -f ${LOG_FILE} ] && rm -f ${LOG_FILE}
cat >> ${IN_FILE} << !
set hive.exec.dynamic.partition=true;
set hive.exec.max.dynamic.partitions.pernode=1000;
set hive.enforce.bucketing = true;
use test; -- That is a test database change it to whatever you like
DROP TABLE IF EXISTS txtest;
CREATE TABLE txtest (
id int
, description string
)
CLUSTERED BY (id) INTO 256 BUCKETS
STORED AS ORC TBLPROPERTIES('transactional'='true')
;
INSERT INTO TABLE txtest VALUES
!
ROWS=20
integer ROWCOUNT=1
while ((ROWCOUNT <= ROWS))
do
NEW_UUID=`genrandom 50` ## generate 50 character random string
if ((ROWCOUNT < ROWS))
then
COLON=","
else
COLON=""
fi
echo "(${ROWCOUNT},'${NEW_UUID}') ${COLON}" >> ${IN_FILE}
if ((ROWCOUNT < ROWS))
then
#
## generate an empty line
#
echo "('','') ${COLON}" >> ${IN_FILE}
fi
((ROWCOUNT = ROWCOUNT + 1))
done
#
cat >> ${IN_FILE} << !
;
select * from txtest;
!exit
Now run that test.hql script against your hive and then try to delete empty rows
0: jdbc:hive2://rhes564:10010/default> select count(1) from test.txtest;
INFO : Status: Finished successfully in 69.32 seconds
+------+--+
| _c0 |
+------+--+
| 39 |
+------+--+
1 row selected (73.368 seconds)
0: jdbc:hive2://rhes564:10010/default> select count(1) from test.txtest where id is null;
INFO :
INFO : Status: Finished successfully in 7.04 seconds
+------+--+
| _c0 |
+------+--+
| 19 |
+------+--+
1 row selected (7.151 seconds)
0: jdbc:hive2://rhes564:10010/default> delete from test.txtest where id is null;
INFO :
Query Hive on Spark job[2] stages:
INFO : 5
INFO : 4
INFO :
Status: Running (Hive on Spark job[2])
INFO : Status: Finished successfully in 15.08 seconds
INFO : Loading data to table test.txtest from hdfs://rhes564:9000/user/hive/warehouse/test.db/txtest/.hive-staging_hive_2016-01-05_16-08-56_808_8837275301497077898-13/-ext-10000
INFO : Table test.txtest stats: [numFiles=257, numRows=20, totalSize=65884, rawDataSize=0]
No rows affected (15.696 seconds)
Ok so it took 15.6 seconds to delete those 19 empty rows. One can easily do that by getting the empty lines at OS level before putting data into Hive table!
cat test.hql | grep -v "('','')" > tmp.$$
mv -f tmp.$$ test.hql
So like most things there is really no clear cut approach whether you want to do it outside of Hive or in Hive.
HTH
Dr Mich Talebzadeh
LinkedIn https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
Sybase ASE 15 Gold Medal Award 2008
A Winning Strategy: Running the most Critical Financial Data on ASE 15
http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7.
co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4
Publications due shortly:
Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8
Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly
http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.
From: Vikas Parashar [mailto:para.vikas@gmail.com]
Sent: 05 January 2016 11:40
To: user@hive.apache.org
Subject: Re: Deleting empty rows from hive table through java
Well said Mich,
I had gone through from the same scenario in which we had done ETL out side the hive. Once the transformation is done then we loaded all data into hive warehouse. I think, that's the best practice, we should follow it.
Regards,
Vikas Parashar
On Tue, Jan 5, 2016 at 5:02 PM, Mich Talebzadeh <mich@peridale.co.uk <ma...@peridale.co.uk> > wrote:
In would be interesting to do ETL outside of Hive by getting Data from Webpage to an intermediate file, pruning the empty rows and loading the final CSV file into Hive destination table.
I am pretty sure this clean up outside of Hive would be faster compared to said thing in Hive
Dr Mich Talebzadeh
LinkedIn https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
Sybase ASE 15 Gold Medal Award 2008
A Winning Strategy: Running the most Critical Financial Data on ASE 15
http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7.
co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4
Publications due shortly:
Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8
Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly
http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.
From: Mich Talebzadeh [mailto:mich@peridale.co.uk <ma...@peridale.co.uk> ]
Sent: 05 January 2016 08:55
To: user@hive.apache.org <ma...@hive.apache.org>
Subject: RE: Deleting empty rows from hive table through java
Hi Sateesh,
You can do the clean-up in Hive by creating a staging table in Hive, feeding your CSV data there and then inserting data into main table where COL1 is NOT NULL.
Alternatively you can create your Hive table as transactional. Although I would say the staging table is better as you will have a full record of your CSV data at any time.
You can of course do the pruning of data outside of Hive using a simple shell script with sed and awk (if you are familiar with those tools).
cat CSV_FILE | '|sed -e '/^$/d'
HTH
Dr Mich Talebzadeh
LinkedIn https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
Sybase ASE 15 Gold Medal Award 2008
A Winning Strategy: Running the most Critical Financial Data on ASE 15
http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7.
co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4
Publications due shortly:
Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8
Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly
http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.
From: Sateesh Karuturi [mailto:sateesh.karuturi9@gmail.com]
Sent: 05 January 2016 06:59
To: user@hive.apache.org <ma...@hive.apache.org>
Subject: Deleting empty rows from hive table through java
Hello...
Anyone please help me how to delete empty rows from hive table through java?
Thanks in advance
Re: Deleting empty rows from hive table through java
Posted by Vikas Parashar <pa...@gmail.com>.
Well said Mich,
I had gone through from the same scenario in which we had done ETL out side
the hive. Once the transformation is done then we loaded all data into hive
warehouse. I think, that's the best practice, we should follow it.
Regards,
Vikas Parashar
On Tue, Jan 5, 2016 at 5:02 PM, Mich Talebzadeh <mi...@peridale.co.uk> wrote:
> In would be interesting to do ETL outside of Hive by getting Data from
> Webpage to an intermediate file, pruning the empty rows and loading the
> final CSV file into Hive destination table.
>
>
>
> I am pretty sure this clean up outside of Hive would be faster compared to
> said thing in Hive
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> *Sybase ASE 15 Gold Medal Award 2008*
>
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>
>
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>
> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
> 15", ISBN 978-0-9563693-0-7*.
>
> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4*
>
> *Publications due shortly:*
>
> *Complex Event Processing in Heterogeneous Environments*, ISBN:
> 978-0-9563693-3-8
>
> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
> one out shortly
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Ltd, its subsidiaries nor their employees
> accept any responsibility.
>
>
>
> *From:* Mich Talebzadeh [mailto:mich@peridale.co.uk]
> *Sent:* 05 January 2016 08:55
> *To:* user@hive.apache.org
> *Subject:* RE: Deleting empty rows from hive table through java
>
>
>
> Hi Sateesh,
>
>
>
> You can do the clean-up in Hive by creating a staging table in Hive,
> feeding your CSV data there and then inserting data into main table where
> COL1 is NOT NULL.
>
>
>
> Alternatively you can create your Hive table as transactional. Although I
> would say the staging table is better as you will have a full record of
> your CSV data at any time.
>
>
>
> You can of course do the pruning of data outside of Hive using a simple
> shell script with sed and awk (if you are familiar with those tools).
>
>
>
> cat CSV_FILE | '|sed -e '/^$/d'
>
>
>
> HTH
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> *Sybase ASE 15 Gold Medal Award 2008*
>
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>
>
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>
> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
> 15", ISBN 978-0-9563693-0-7*.
>
> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4*
>
> *Publications due shortly:*
>
> *Complex Event Processing in Heterogeneous Environments*, ISBN:
> 978-0-9563693-3-8
>
> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
> one out shortly
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Ltd, its subsidiaries nor their employees
> accept any responsibility.
>
>
>
> *From:* Sateesh Karuturi [mailto:sateesh.karuturi9@gmail.com
> <sa...@gmail.com>]
> *Sent:* 05 January 2016 06:59
> *To:* user@hive.apache.org
> *Subject:* Deleting empty rows from hive table through java
>
>
>
> Hello...
>
> Anyone please help me how to delete empty rows from hive table through
> java?
>
> Thanks in advance
>
RE: Deleting empty rows from hive table through java
Posted by Mich Talebzadeh <mi...@peridale.co.uk>.
In would be interesting to do ETL outside of Hive by getting Data from Webpage to an intermediate file, pruning the empty rows and loading the final CSV file into Hive destination table.
I am pretty sure this clean up outside of Hive would be faster compared to said thing in Hive
Dr Mich Talebzadeh
LinkedIn <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
Sybase ASE 15 Gold Medal Award 2008
A Winning Strategy: Running the most Critical Financial Data on ASE 15
<http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7.
co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4
Publications due shortly:
Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8
Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly
<http://talebzadehmich.wordpress.com/> http://talebzadehmich.wordpress.com
NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.
From: Mich Talebzadeh [mailto:mich@peridale.co.uk]
Sent: 05 January 2016 08:55
To: user@hive.apache.org
Subject: RE: Deleting empty rows from hive table through java
Hi Sateesh,
You can do the clean-up in Hive by creating a staging table in Hive, feeding your CSV data there and then inserting data into main table where COL1 is NOT NULL.
Alternatively you can create your Hive table as transactional. Although I would say the staging table is better as you will have a full record of your CSV data at any time.
You can of course do the pruning of data outside of Hive using a simple shell script with sed and awk (if you are familiar with those tools).
cat CSV_FILE | '|sed -e '/^$/d'
HTH
Dr Mich Talebzadeh
LinkedIn <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
Sybase ASE 15 Gold Medal Award 2008
A Winning Strategy: Running the most Critical Financial Data on ASE 15
<http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7.
co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4
Publications due shortly:
Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8
Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly
<http://talebzadehmich.wordpress.com/> http://talebzadehmich.wordpress.com
NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.
From: Sateesh Karuturi [mailto:sateesh.karuturi9@gmail.com]
Sent: 05 January 2016 06:59
To: user@hive.apache.org <ma...@hive.apache.org>
Subject: Deleting empty rows from hive table through java
Hello...
Anyone please help me how to delete empty rows from hive table through java?
Thanks in advance
RE: Deleting empty rows from hive table through java
Posted by Mich Talebzadeh <mi...@peridale.co.uk>.
Hi Sateesh,
You can do the clean-up in Hive by creating a staging table in Hive, feeding your CSV data there and then inserting data into main table where COL1 is NOT NULL.
Alternatively you can create your Hive table as transactional. Although I would say the staging table is better as you will have a full record of your CSV data at any time.
You can of course do the pruning of data outside of Hive using a simple shell script with sed and awk (if you are familiar with those tools).
cat CSV_FILE | '|sed -e '/^$/d'
HTH
Dr Mich Talebzadeh
LinkedIn <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
Sybase ASE 15 Gold Medal Award 2008
A Winning Strategy: Running the most Critical Financial Data on ASE 15
<http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7.
co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4
Publications due shortly:
Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8
Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly
<http://talebzadehmich.wordpress.com/> http://talebzadehmich.wordpress.com
NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.
From: Sateesh Karuturi [mailto:sateesh.karuturi9@gmail.com]
Sent: 05 January 2016 06:59
To: user@hive.apache.org
Subject: Deleting empty rows from hive table through java
Hello...
Anyone please help me how to delete empty rows from hive table through java?
Thanks in advance