You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Hamza Asad <ha...@gmail.com> on 2013/07/17 19:40:57 UTC

which approach is better

Please let me knw which approach is better. Either i save my data directly
to HDFS and run hive (shark) queries over it OR store my data in HBASE, and
then query it.. as i want to ensure efficient data retrieval and data
remains safe and can easily recover if hadoop crashes.

-- 
*Muhammad Hamza Asad*

Re: which approach is better

Posted by Bennie Schut <bs...@ebuddy.com>.

The best way to restore is from a backup. We use distcp to keep this 
scalable : http://hadoop.apache.org/docs/r1.2.0/distcp2.html
The data we feed to hdfs also gets pushed to this backup and the 
metadatabase from hive also gets pushed here. So this combination works 
well for us (had to use it once).
Even if a namenode could never crash and all software worked fine 100% 
of the time there is always the one crazy user/admin who will find a way 
to wipe all data.
To me backups are not optional.

Op 17-7-2013 20:17, Hamza Asad schreef:
> I use data to generates reports on daily basis, Do couple of analysis 
> and its insert once and read many on daily basis.  But My main purpose 
> is to secure my data and easily recover it even if my hadoop(datanode) 
> OR HDFS crashes. As uptill now, i'm using approach in which data has 
> been retrieved directly from HDFS and few days back my hadoop crashes 
> and when i repair it, i was unable to recover my Old data which 
> resides on HDFS. So please let me know do i have to make architectural 
> change OR is there any way to recover data which resides in crashed HDFS
>
>
> On Wed, Jul 17, 2013 at 11:00 PM, Nitin Pawar <nitinpawar432@gmail.com 
> <ma...@gmail.com>> wrote:
>
>     what's the purpose of data storage?
>     whats the read and write throughput you expect?
>     whats the way you will access data while read?
>     whats are your SLAs on both read and write?
>
>     there will be more questions others will ask so be ready for that :)
>
>
>
>     On Wed, Jul 17, 2013 at 11:10 PM, Hamza Asad
>     <hamza.asad13@gmail.com <ma...@gmail.com>> wrote:
>
>         Please let me knw which approach is better. Either i save my
>         data directly to HDFS and run hive (shark) queries over it OR
>         store my data in HBASE, and then query it.. as i want to
>         ensure efficient data retrieval and data remains safe and can
>         easily recover if hadoop crashes.
>
>         -- 
>         */Muhammad Hamza Asad/*
>
>
>
>
>     -- 
>     Nitin Pawar
>
>
>
>
> -- 
> */Muhammad Hamza Asad/*

Re: which approach is better

Posted by Hamza Asad <ha...@gmail.com>.

I use data to generates reports on daily basis, Do couple of analysis and
its insert once and read many on daily basis.  But My main purpose is to
secure my data and easily recover it even if my hadoop(datanode) OR HDFS
crashes. As uptill now, i'm using approach in which data has been retrieved
directly from HDFS and few days back my hadoop crashes and when i repair
it, i was unable to recover my Old data which resides on HDFS. So please
let me know do i have to make architectural change OR is there any way to
recover data which resides in crashed HDFS

On Wed, Jul 17, 2013 at 11:00 PM, Nitin Pawar <ni...@gmail.com>wrote:

> what's the purpose of data storage?
> whats the read and write throughput you expect?
> whats the way you will access data while read?
> whats are your SLAs on both read and write?
>
> there will be more questions others will ask so be ready for that :)
>
>
>
> On Wed, Jul 17, 2013 at 11:10 PM, Hamza Asad <ha...@gmail.com>wrote:
>
>> Please let me knw which approach is better. Either i save my data
>> directly to HDFS and run hive (shark) queries over it OR store my data in
>> HBASE, and then query it.. as i want to ensure efficient data retrieval and
>> data remains safe and can easily recover if hadoop crashes.
>>
>> --
>> *Muhammad Hamza Asad*
>>
>
>
>
> --
> Nitin Pawar
>

-- 
*Muhammad Hamza Asad*

Re: which approach is better

Posted by Nitin Pawar <ni...@gmail.com>.

what's the purpose of data storage?
whats the read and write throughput you expect?
whats the way you will access data while read?
whats are your SLAs on both read and write?

there will be more questions others will ask so be ready for that :)

On Wed, Jul 17, 2013 at 11:10 PM, Hamza Asad <ha...@gmail.com> wrote:

> Please let me knw which approach is better. Either i save my data directly
> to HDFS and run hive (shark) queries over it OR store my data in HBASE, and
> then query it.. as i want to ensure efficient data retrieval and data
> remains safe and can easily recover if hadoop crashes.
>
> --
> *Muhammad Hamza Asad*
>

-- 
Nitin Pawar

Re: which approach is better

Posted by "kulkarni.swarnim@gmail.com" <ku...@gmail.com>.

First of all, that might not be the right approach to choose the underlying
storage. You should choose HDFS or HBase depending on whether the data is
going to be used for batch processing or you need random access on top of
it. HBase is just another layer on top of HDFS. So obviously the queries
running on top of HBase are going to be less efficient. So if you can get
away with using HDFS, I would say that is the best and simplest approach.

On Wed, Jul 17, 2013 at 12:40 PM, Hamza Asad <ha...@gmail.com> wrote:

> Please let me knw which approach is better. Either i save my data directly
> to HDFS and run hive (shark) queries over it OR store my data in HBASE, and
> then query it.. as i want to ensure efficient data retrieval and data
> remains safe and can easily recover if hadoop crashes.
>
> --
> *Muhammad Hamza Asad*
>

-- 
Swarnim