You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Kyle B <kb...@gmail.com> on 2013/03/05 19:45:29 UTC

Hive sample test

Hello,

I was wondering if there is a way to quick-verify a Hive query before it is
run against a big dataset? The tables I am querying against have millions
of records, and I'd like to verify my Hive query before I run it against
all records.

Is there a way to test the query against a small subset of the data,
without going into full MapReduce? As silly as this sounds, is there a way
to MapReduce without the overhead of MapReduce? That way I can check my
query is doing what I want before I run it against all records.

Thanks,

-Kyle

RE: Hive sample test

Posted by "Connell, Chuck" <Ch...@nuance.com>.

Using the Hive sampling feature would also help. This is exactly what that feature is designed for.

Chuck

From: Kyle B [mailto:kbinaz@gmail.com]
Sent: Tuesday, March 05, 2013 1:45 PM
To: user@hive.apache.org
Subject: Hive sample test

Hello,

I was wondering if there is a way to quick-verify a Hive query before it is run against a big dataset? The tables I am querying against have millions of records, and I'd like to verify my Hive query before I run it against all records.

Is there a way to test the query against a small subset of the data, without going into full MapReduce? As silly as this sounds, is there a way to MapReduce without the overhead of MapReduce? That way I can check my query is doing what I want before I run it against all records.

Thanks,

-Kyle

Re: Hive sample test

Posted by Ramki Palle <ra...@gmail.com>.

If  any of the 100 rows that the sub-query returns do not satisfy the where
clause, there would be no rows in the overall result. Do we still consider
that the Hive query is verified in this case?

Regards,
Ramki.


On Wed, Mar 6, 2013 at 1:14 AM, Dean Wampler <
dean.wampler@thinkbiganalytics.com> wrote:

> NIce, yea that would do it.
>
>
> On Tue, Mar 5, 2013 at 1:26 PM, Mark Grover <gr...@gmail.com>wrote:
>
>> I typically change my query to query from a limited version of the whole
>> table.
>>
>> Change
>>
>> select really_expensive_select_clause
>> from
>> really_big_table
>> where
>> something=something
>> group by something=something
>>
>> to
>>
>> select really_expensive_select_clause
>> from
>> (
>> select
>> *
>> from
>> really_big_table
>> limit 100
>> )t
>> where
>> something=something
>> group by something=something
>>
>>
>> On Tue, Mar 5, 2013 at 10:57 AM, Dean Wampler
>> <de...@thinkbiganalytics.com> wrote:
>> > Unfortunately, it will still go through the whole thing, then just
>> limit the
>> > output. However, there's a flag that I think only works in more recent
>> Hive
>> > releases:
>> >
>> > set hive.limit.optimize.enable=true
>> >
>> > This is supposed to apply limiting earlier in the data stream, so it
>> will
>> > give different results that limiting just the output.
>> >
>> > Like Chuck said, you might consider sampling, but unless your table is
>> > organized into buckets, you'll at least scan the whole table, but maybe
>> not
>> > do all computation over it ??
>> >
>> > Also, if you have a small sample data set:
>> >
>> > set hive.exec.mode.local.auto=true
>> >
>> > will cause Hive to bypass the Job and Task Trackers, calling APIs
>> directly,
>> > when it can do the whole thing in a single process. Not "lightning
>> fast",
>> > but faster.
>> >
>> > dean
>> >
>> > On Tue, Mar 5, 2013 at 12:48 PM, Joey D'Antoni <jd...@yahoo.com>
>> wrote:
>> >>
>> >> Just add a limit 1 to the end of your query.
>> >>
>> >>
>> >>
>> >>
>> >> On Mar 5, 2013, at 1:45 PM, Kyle B <kb...@gmail.com> wrote:
>> >>
>> >> Hello,
>> >>
>> >> I was wondering if there is a way to quick-verify a Hive query before
>> it
>> >> is run against a big dataset? The tables I am querying against have
>> millions
>> >> of records, and I'd like to verify my Hive query before I run it
>> against all
>> >> records.
>> >>
>> >> Is there a way to test the query against a small subset of the data,
>> >> without going into full MapReduce? As silly as this sounds, is there a
>> way
>> >> to MapReduce without the overhead of MapReduce? That way I can check my
>> >> query is doing what I want before I run it against all records.
>> >>
>> >> Thanks,
>> >>
>> >> -Kyle
>> >
>> >
>> >
>> >
>> > --
>> > Dean Wampler, Ph.D.
>> > thinkbiganalytics.com
>> > +1-312-339-1330
>> >
>>
>
>
>
> --
> *Dean Wampler, Ph.D.*
> thinkbiganalytics.com
> +1-312-339-1330
>
>

Re: Hive sample test

Posted by Dean Wampler <de...@thinkbiganalytics.com>.

NIce, yea that would do it.

On Tue, Mar 5, 2013 at 1:26 PM, Mark Grover <gr...@gmail.com>wrote:

> I typically change my query to query from a limited version of the whole
> table.
>
> Change
>
> select really_expensive_select_clause
> from
> really_big_table
> where
> something=something
> group by something=something
>
> to
>
> select really_expensive_select_clause
> from
> (
> select
> *
> from
> really_big_table
> limit 100
> )t
> where
> something=something
> group by something=something
>
>
> On Tue, Mar 5, 2013 at 10:57 AM, Dean Wampler
> <de...@thinkbiganalytics.com> wrote:
> > Unfortunately, it will still go through the whole thing, then just limit
> the
> > output. However, there's a flag that I think only works in more recent
> Hive
> > releases:
> >
> > set hive.limit.optimize.enable=true
> >
> > This is supposed to apply limiting earlier in the data stream, so it will
> > give different results that limiting just the output.
> >
> > Like Chuck said, you might consider sampling, but unless your table is
> > organized into buckets, you'll at least scan the whole table, but maybe
> not
> > do all computation over it ??
> >
> > Also, if you have a small sample data set:
> >
> > set hive.exec.mode.local.auto=true
> >
> > will cause Hive to bypass the Job and Task Trackers, calling APIs
> directly,
> > when it can do the whole thing in a single process. Not "lightning fast",
> > but faster.
> >
> > dean
> >
> > On Tue, Mar 5, 2013 at 12:48 PM, Joey D'Antoni <jd...@yahoo.com>
> wrote:
> >>
> >> Just add a limit 1 to the end of your query.
> >>
> >>
> >>
> >>
> >> On Mar 5, 2013, at 1:45 PM, Kyle B <kb...@gmail.com> wrote:
> >>
> >> Hello,
> >>
> >> I was wondering if there is a way to quick-verify a Hive query before it
> >> is run against a big dataset? The tables I am querying against have
> millions
> >> of records, and I'd like to verify my Hive query before I run it
> against all
> >> records.
> >>
> >> Is there a way to test the query against a small subset of the data,
> >> without going into full MapReduce? As silly as this sounds, is there a
> way
> >> to MapReduce without the overhead of MapReduce? That way I can check my
> >> query is doing what I want before I run it against all records.
> >>
> >> Thanks,
> >>
> >> -Kyle
> >
> >
> >
> >
> > --
> > Dean Wampler, Ph.D.
> > thinkbiganalytics.com
> > +1-312-339-1330
> >
>



-- 
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330

Re: Hive sample test

Posted by Mark Grover <gr...@gmail.com>.

I typically change my query to query from a limited version of the whole table.

Change

select really_expensive_select_clause
from
really_big_table
where
something=something
group by something=something

to

select really_expensive_select_clause
from
(
select
*
from
really_big_table
limit 100
)t
where
something=something
group by something=something


On Tue, Mar 5, 2013 at 10:57 AM, Dean Wampler
<de...@thinkbiganalytics.com> wrote:
> Unfortunately, it will still go through the whole thing, then just limit the
> output. However, there's a flag that I think only works in more recent Hive
> releases:
>
> set hive.limit.optimize.enable=true
>
> This is supposed to apply limiting earlier in the data stream, so it will
> give different results that limiting just the output.
>
> Like Chuck said, you might consider sampling, but unless your table is
> organized into buckets, you'll at least scan the whole table, but maybe not
> do all computation over it ??
>
> Also, if you have a small sample data set:
>
> set hive.exec.mode.local.auto=true
>
> will cause Hive to bypass the Job and Task Trackers, calling APIs directly,
> when it can do the whole thing in a single process. Not "lightning fast",
> but faster.
>
> dean
>
> On Tue, Mar 5, 2013 at 12:48 PM, Joey D'Antoni <jd...@yahoo.com> wrote:
>>
>> Just add a limit 1 to the end of your query.
>>
>>
>>
>>
>> On Mar 5, 2013, at 1:45 PM, Kyle B <kb...@gmail.com> wrote:
>>
>> Hello,
>>
>> I was wondering if there is a way to quick-verify a Hive query before it
>> is run against a big dataset? The tables I am querying against have millions
>> of records, and I'd like to verify my Hive query before I run it against all
>> records.
>>
>> Is there a way to test the query against a small subset of the data,
>> without going into full MapReduce? As silly as this sounds, is there a way
>> to MapReduce without the overhead of MapReduce? That way I can check my
>> query is doing what I want before I run it against all records.
>>
>> Thanks,
>>
>> -Kyle
>
>
>
>
> --
> Dean Wampler, Ph.D.
> thinkbiganalytics.com
> +1-312-339-1330
>

Re: Hive sample test

Posted by Dean Wampler <de...@thinkbiganalytics.com>.

Unfortunately, it will still go through the whole thing, then just limit
the output. However, there's a flag that I think only works in more recent
Hive releases:

set hive.limit.optimize.enable=true

This is supposed to apply limiting earlier in the data stream, so it will
give different results that limiting just the output.

Like Chuck said, you might consider sampling, but unless your table is
organized into buckets, you'll at least scan the whole table, but maybe not
do all computation over it ??

Also, if you have a small sample data set:

set hive.exec.mode.local.auto=true

will cause Hive to bypass the Job and Task Trackers, calling APIs directly,
when it can do the whole thing in a single process. Not "lightning fast",
but faster.

dean

On Tue, Mar 5, 2013 at 12:48 PM, Joey D'Antoni <jd...@yahoo.com> wrote:

> Just add a limit 1 to the end of your query.
>
>
>
>
> On Mar 5, 2013, at 1:45 PM, Kyle B <kb...@gmail.com> wrote:
>
> Hello,
>
> I was wondering if there is a way to quick-verify a Hive query before it
> is run against a big dataset? The tables I am querying against have
> millions of records, and I'd like to verify my Hive query before I run it
> against all records.
>
> Is there a way to test the query against a small subset of the data,
> without going into full MapReduce? As silly as this sounds, is there a way
> to MapReduce without the overhead of MapReduce? That way I can check my
> query is doing what I want before I run it against all records.
>
> Thanks,
>
> -Kyle
>
>

-- 
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330

Re: Hive sample test

Posted by Joey D'Antoni <jd...@yahoo.com>.

Just add a limit 1 to the end of your query.




On Mar 5, 2013, at 1:45 PM, Kyle B <kb...@gmail.com> wrote:

> Hello,
> 
> I was wondering if there is a way to quick-verify a Hive query before it is run against a big dataset? The tables I am querying against have millions of records, and I'd like to verify my Hive query before I run it against all records.
> 
> Is there a way to test the query against a small subset of the data, without going into full MapReduce? As silly as this sounds, is there a way to MapReduce without the overhead of MapReduce? That way I can check my query is doing what I want before I run it against all records.
> 
> Thanks,
> 
> -Kyle