You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by David Capwell <dc...@gmail.com> on 2015/07/31 04:22:12 UTC

External sorted tables

We are trying to create a external table in hive. This data is sorted,
so wanted to tell hive about this. When I do, it complains about
parsing the create.

> CREATE EXTERNAL TABLE IF NOT EXISTS store.testing (
...
. . . . . . . . . . . . . . . . . . .>   timestamp bigint,
...)
. . . . . . . . . . . . . . . . . . .>   SORTED BY (timestamp)
...
. . . . . . . . . . . . . . . . . . .>   LOCATION '/project/db/table'
. . . . . . . . . . . . . . . . . . .> ;
Error: Error while compiling statement: FAILED: ParseException line
1:507 missing EOF at 'SORTED' near ')' (state=42000,code=40000)
2: jdbc:hive2://localhost:10000/store>

What can I do to let hive know that my data is sorted? Every example
online of sorted by is grouped with buckets, but we really don't want
to add bucketing.


Hive version: 0.14.0

Thanks for your help!

RE: External sorted tables

Posted by David Capwell <dc...@gmail.com>.
Thanks, so only optimization for sorted data is on buckets, so without
buckets there isn't really a reason to tell hive the data is sorted.
Thanks for letting me know
On Aug 3, 2015 11:11 AM, "Ryan Harris" <Ry...@zionsbancorp.com> wrote:

> Sort-Merge-Bucket (SMB) joins and MAPJOINs are related, but separate...
>
>
>
>
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+JoinOptimization
>
>
>
> I think you should still be able to get map-side joins (if the join table
> is small enough)....
>
>
>
> Nothing here that says sorting is required:
>
>
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins#LanguageManualJoins-MapJoinRestrictions
>
>
>
> You need sorting (and bucketing) for the SMB join, but you aren't
> specifying bucketing.
>
>
>
> I'd suggest either bucketing the data along with sorting it, or try it
> without 'sorted by' and see if you can execute a mapjoin.
>
>
>
>
>
> *From:* David Capwell [mailto:dcapwell@gmail.com]
> *Sent:* Monday, August 03, 2015 11:59 AM
> *To:* user@hive.apache.org
> *Subject:* RE: External sorted tables
>
>
>
> Mostly wanted to tell hive it's sorted so it could use more efficient
> joins like a map side join.  No other reason
>
> On Aug 3, 2015 10:47 AM, "Ryan Harris" <Ry...@zionsbancorp.com>
> wrote:
>
> Unless you are using bucketing and sampling, there is no benefit (that I
> can think of) to informing hive that the data **is** in fact sorted...
>
>
>
> If there is something specific you are trying to accomplish by specifying
> the sort order of that column, perhaps you can elaborate on that.
> Otherwise, leave out the 'sorted by' statement and you should be fine.
>
>
>
> *From:* David Capwell [mailto:dcapwell@gmail.com]
> *Sent:* Monday, August 03, 2015 10:50 AM
> *To:* user@hive.apache.org
> *Subject:* Re: External sorted tables
>
>
>
> Based off the ddl it is required to have buckets, I was wondering if there
> was a way to get around it?
>
> Thinking as a hack I could try bucket=1, but if there is a better way
> would love to know
>
> On Aug 2, 2015 6:18 PM, "Takahiko Saito" <ty...@gmail.com> wrote:
>
> Hi,
>
>
>
> Is it possible that 'create table sorted by' must have buckets?
>
>
>
> I found the below statements in
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL:
>
>
>
> "The CLUSTERED BY and SORTED BY creation commands do not affect how data
> is inserted into a table – only how it is read. This means that users must
> be careful to insert data correctly by specifying the number of reducers to
> be equal to the number of buckets, and using CLUSTER BY and SORT BY
> commands in their query."
>
>
>
> On Thu, Jul 30, 2015 at 7:22 PM, David Capwell <dc...@gmail.com> wrote:
>
> We are trying to create a external table in hive. This data is sorted,
> so wanted to tell hive about this. When I do, it complains about
> parsing the create.
>
> > CREATE EXTERNAL TABLE IF NOT EXISTS store.testing (
> ...
> . . . . . . . . . . . . . . . . . . .>   timestamp bigint,
> ...)
> . . . . . . . . . . . . . . . . . . .>   SORTED BY (timestamp)
> ...
> . . . . . . . . . . . . . . . . . . .>   LOCATION '/project/db/table'
> . . . . . . . . . . . . . . . . . . .> ;
> Error: Error while compiling statement: FAILED: ParseException line
> 1:507 missing EOF at 'SORTED' near ')' (state=42000,code=40000)
> 2: jdbc:hive2://localhost:10000/store>
>
> What can I do to let hive know that my data is sorted? Every example
> online of sorted by is grouped with buckets, but we really don't want
> to add bucketing.
>
>
> Hive version: 0.14.0
>
> Thanks for your help!
>
>
>
>
>
> --
>
> Takahiko Saito
> ------------------------------
>
> THIS ELECTRONIC MESSAGE, INCLUDING ANY ACCOMPANYING DOCUMENTS, IS
> CONFIDENTIAL and may contain information that is privileged and exempt from
> disclosure under applicable law. If you are neither the intended recipient
> nor responsible for delivering the message to the intended recipient,
> please note that any dissemination, distribution, copying or the taking of
> any action in reliance upon the message is strictly prohibited. If you have
> received this communication in error, please notify the sender immediately.
> Thank you.
> ------------------------------
> THIS ELECTRONIC MESSAGE, INCLUDING ANY ACCOMPANYING DOCUMENTS, IS
> CONFIDENTIAL and may contain information that is privileged and exempt from
> disclosure under applicable law. If you are neither the intended recipient
> nor responsible for delivering the message to the intended recipient,
> please note that any dissemination, distribution, copying or the taking of
> any action in reliance upon the message is strictly prohibited. If you have
> received this communication in error, please notify the sender immediately.
> Thank you.
>

RE: External sorted tables

Posted by Ryan Harris <Ry...@zionsbancorp.com>.
Sort-Merge-Bucket (SMB) joins and MAPJOINs are related, but separate...

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+JoinOptimization

I think you should still be able to get map-side joins (if the join table is small enough)....

Nothing here that says sorting is required:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins#LanguageManualJoins-MapJoinRestrictions

You need sorting (and bucketing) for the SMB join, but you aren't specifying bucketing.

I'd suggest either bucketing the data along with sorting it, or try it without 'sorted by' and see if you can execute a mapjoin.


From: David Capwell [mailto:dcapwell@gmail.com]
Sent: Monday, August 03, 2015 11:59 AM
To: user@hive.apache.org
Subject: RE: External sorted tables


Mostly wanted to tell hive it's sorted so it could use more efficient joins like a map side join.  No other reason
On Aug 3, 2015 10:47 AM, "Ryan Harris" <Ry...@zionsbancorp.com>> wrote:
Unless you are using bucketing and sampling, there is no benefit (that I can think of) to informing hive that the data *is* in fact sorted...

If there is something specific you are trying to accomplish by specifying the sort order of that column, perhaps you can elaborate on that.  Otherwise, leave out the 'sorted by' statement and you should be fine.

From: David Capwell [mailto:dcapwell@gmail.com<ma...@gmail.com>]
Sent: Monday, August 03, 2015 10:50 AM
To: user@hive.apache.org<ma...@hive.apache.org>
Subject: Re: External sorted tables


Based off the ddl it is required to have buckets, I was wondering if there was a way to get around it?

Thinking as a hack I could try bucket=1, but if there is a better way would love to know
On Aug 2, 2015 6:18 PM, "Takahiko Saito" <ty...@gmail.com>> wrote:
Hi,

Is it possible that 'create table sorted by' must have buckets?

I found the below statements in https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL:

"The CLUSTERED BY and SORTED BY creation commands do not affect how data is inserted into a table – only how it is read. This means that users must be careful to insert data correctly by specifying the number of reducers to be equal to the number of buckets, and using CLUSTER BY and SORT BY commands in their query."

On Thu, Jul 30, 2015 at 7:22 PM, David Capwell <dc...@gmail.com>> wrote:
We are trying to create a external table in hive. This data is sorted,
so wanted to tell hive about this. When I do, it complains about
parsing the create.

> CREATE EXTERNAL TABLE IF NOT EXISTS store.testing (
...
. . . . . . . . . . . . . . . . . . .>   timestamp bigint,
...)
. . . . . . . . . . . . . . . . . . .>   SORTED BY (timestamp)
...
. . . . . . . . . . . . . . . . . . .>   LOCATION '/project/db/table'
. . . . . . . . . . . . . . . . . . .> ;
Error: Error while compiling statement: FAILED: ParseException line
1:507 missing EOF at 'SORTED' near ')' (state=42000,code=40000)
2: jdbc:hive2://localhost:10000/store>

What can I do to let hive know that my data is sorted? Every example
online of sorted by is grouped with buckets, but we really don't want
to add bucketing.


Hive version: 0.14.0

Thanks for your help!



--
Takahiko Saito
________________________________
THIS ELECTRONIC MESSAGE, INCLUDING ANY ACCOMPANYING DOCUMENTS, IS CONFIDENTIAL and may contain information that is privileged and exempt from disclosure under applicable law. If you are neither the intended recipient nor responsible for delivering the message to the intended recipient, please note that any dissemination, distribution, copying or the taking of any action in reliance upon the message is strictly prohibited. If you have received this communication in error, please notify the sender immediately. Thank you.

======================================================================
THIS ELECTRONIC MESSAGE, INCLUDING ANY ACCOMPANYING DOCUMENTS, IS CONFIDENTIAL and may contain information that is privileged and exempt from disclosure under applicable law. If you are neither the intended recipient nor responsible for delivering the message to the intended recipient, please note that any dissemination, distribution, copying or the taking of any action in reliance upon the message is strictly prohibited. If you have received this communication in error, please notify the sender immediately.  Thank you.

RE: External sorted tables

Posted by David Capwell <dc...@gmail.com>.
Mostly wanted to tell hive it's sorted so it could use more efficient joins
like a map side join.  No other reason
On Aug 3, 2015 10:47 AM, "Ryan Harris" <Ry...@zionsbancorp.com> wrote:

> Unless you are using bucketing and sampling, there is no benefit (that I
> can think of) to informing hive that the data **is** in fact sorted...
>
>
>
> If there is something specific you are trying to accomplish by specifying
> the sort order of that column, perhaps you can elaborate on that.
> Otherwise, leave out the 'sorted by' statement and you should be fine.
>
>
>
> *From:* David Capwell [mailto:dcapwell@gmail.com]
> *Sent:* Monday, August 03, 2015 10:50 AM
> *To:* user@hive.apache.org
> *Subject:* Re: External sorted tables
>
>
>
> Based off the ddl it is required to have buckets, I was wondering if there
> was a way to get around it?
>
> Thinking as a hack I could try bucket=1, but if there is a better way
> would love to know
>
> On Aug 2, 2015 6:18 PM, "Takahiko Saito" <ty...@gmail.com> wrote:
>
> Hi,
>
>
>
> Is it possible that 'create table sorted by' must have buckets?
>
>
>
> I found the below statements in
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL:
>
>
>
> "The CLUSTERED BY and SORTED BY creation commands do not affect how data
> is inserted into a table – only how it is read. This means that users must
> be careful to insert data correctly by specifying the number of reducers to
> be equal to the number of buckets, and using CLUSTER BY and SORT BY
> commands in their query."
>
>
>
> On Thu, Jul 30, 2015 at 7:22 PM, David Capwell <dc...@gmail.com> wrote:
>
> We are trying to create a external table in hive. This data is sorted,
> so wanted to tell hive about this. When I do, it complains about
> parsing the create.
>
> > CREATE EXTERNAL TABLE IF NOT EXISTS store.testing (
> ...
> . . . . . . . . . . . . . . . . . . .>   timestamp bigint,
> ...)
> . . . . . . . . . . . . . . . . . . .>   SORTED BY (timestamp)
> ...
> . . . . . . . . . . . . . . . . . . .>   LOCATION '/project/db/table'
> . . . . . . . . . . . . . . . . . . .> ;
> Error: Error while compiling statement: FAILED: ParseException line
> 1:507 missing EOF at 'SORTED' near ')' (state=42000,code=40000)
> 2: jdbc:hive2://localhost:10000/store>
>
> What can I do to let hive know that my data is sorted? Every example
> online of sorted by is grouped with buckets, but we really don't want
> to add bucketing.
>
>
> Hive version: 0.14.0
>
> Thanks for your help!
>
>
>
>
>
> --
>
> Takahiko Saito
> ------------------------------
> THIS ELECTRONIC MESSAGE, INCLUDING ANY ACCOMPANYING DOCUMENTS, IS
> CONFIDENTIAL and may contain information that is privileged and exempt from
> disclosure under applicable law. If you are neither the intended recipient
> nor responsible for delivering the message to the intended recipient,
> please note that any dissemination, distribution, copying or the taking of
> any action in reliance upon the message is strictly prohibited. If you have
> received this communication in error, please notify the sender immediately.
> Thank you.
>

RE: External sorted tables

Posted by Ryan Harris <Ry...@zionsbancorp.com>.
Unless you are using bucketing and sampling, there is no benefit (that I can think of) to informing hive that the data *is* in fact sorted...

If there is something specific you are trying to accomplish by specifying the sort order of that column, perhaps you can elaborate on that.  Otherwise, leave out the 'sorted by' statement and you should be fine.

From: David Capwell [mailto:dcapwell@gmail.com]
Sent: Monday, August 03, 2015 10:50 AM
To: user@hive.apache.org
Subject: Re: External sorted tables


Based off the ddl it is required to have buckets, I was wondering if there was a way to get around it?

Thinking as a hack I could try bucket=1, but if there is a better way would love to know
On Aug 2, 2015 6:18 PM, "Takahiko Saito" <ty...@gmail.com>> wrote:
Hi,

Is it possible that 'create table sorted by' must have buckets?

I found the below statements in https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL:

"The CLUSTERED BY and SORTED BY creation commands do not affect how data is inserted into a table – only how it is read. This means that users must be careful to insert data correctly by specifying the number of reducers to be equal to the number of buckets, and using CLUSTER BY and SORT BY commands in their query."

On Thu, Jul 30, 2015 at 7:22 PM, David Capwell <dc...@gmail.com>> wrote:
We are trying to create a external table in hive. This data is sorted,
so wanted to tell hive about this. When I do, it complains about
parsing the create.

> CREATE EXTERNAL TABLE IF NOT EXISTS store.testing (
...
. . . . . . . . . . . . . . . . . . .>   timestamp bigint,
...)
. . . . . . . . . . . . . . . . . . .>   SORTED BY (timestamp)
...
. . . . . . . . . . . . . . . . . . .>   LOCATION '/project/db/table'
. . . . . . . . . . . . . . . . . . .> ;
Error: Error while compiling statement: FAILED: ParseException line
1:507 missing EOF at 'SORTED' near ')' (state=42000,code=40000)
2: jdbc:hive2://localhost:10000/store>

What can I do to let hive know that my data is sorted? Every example
online of sorted by is grouped with buckets, but we really don't want
to add bucketing.


Hive version: 0.14.0

Thanks for your help!



--
Takahiko Saito

======================================================================
THIS ELECTRONIC MESSAGE, INCLUDING ANY ACCOMPANYING DOCUMENTS, IS CONFIDENTIAL and may contain information that is privileged and exempt from disclosure under applicable law. If you are neither the intended recipient nor responsible for delivering the message to the intended recipient, please note that any dissemination, distribution, copying or the taking of any action in reliance upon the message is strictly prohibited. If you have received this communication in error, please notify the sender immediately.  Thank you.

Re: External sorted tables

Posted by David Capwell <dc...@gmail.com>.
Based off the ddl it is required to have buckets, I was wondering if there
was a way to get around it?

Thinking as a hack I could try bucket=1, but if there is a better way would
love to know
On Aug 2, 2015 6:18 PM, "Takahiko Saito" <ty...@gmail.com> wrote:

> Hi,
>
> Is it possible that 'create table sorted by' must have buckets?
>
> I found the below statements in
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL:
>
> "The CLUSTERED BY and SORTED BY creation commands do not affect how data
> is inserted into a table – only how it is read. This means that users must
> be careful to insert data correctly by specifying the number of reducers to
> be equal to the number of buckets, and using CLUSTER BY and SORT BY
> commands in their query."
>
> On Thu, Jul 30, 2015 at 7:22 PM, David Capwell <dc...@gmail.com> wrote:
>
>> We are trying to create a external table in hive. This data is sorted,
>> so wanted to tell hive about this. When I do, it complains about
>> parsing the create.
>>
>> > CREATE EXTERNAL TABLE IF NOT EXISTS store.testing (
>> ...
>> . . . . . . . . . . . . . . . . . . .>   timestamp bigint,
>> ...)
>> . . . . . . . . . . . . . . . . . . .>   SORTED BY (timestamp)
>> ...
>> . . . . . . . . . . . . . . . . . . .>   LOCATION '/project/db/table'
>> . . . . . . . . . . . . . . . . . . .> ;
>> Error: Error while compiling statement: FAILED: ParseException line
>> 1:507 missing EOF at 'SORTED' near ')' (state=42000,code=40000)
>> 2: jdbc:hive2://localhost:10000/store>
>>
>> What can I do to let hive know that my data is sorted? Every example
>> online of sorted by is grouped with buckets, but we really don't want
>> to add bucketing.
>>
>>
>> Hive version: 0.14.0
>>
>> Thanks for your help!
>>
>
>
>
> --
> Takahiko Saito
>

Re: External sorted tables

Posted by Takahiko Saito <ty...@gmail.com>.
Hi,

Is it possible that 'create table sorted by' must have buckets?

I found the below statements in
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL:

"The CLUSTERED BY and SORTED BY creation commands do not affect how data is
inserted into a table – only how it is read. This means that users must be
careful to insert data correctly by specifying the number of reducers to be
equal to the number of buckets, and using CLUSTER BY and SORT BY commands
in their query."

On Thu, Jul 30, 2015 at 7:22 PM, David Capwell <dc...@gmail.com> wrote:

> We are trying to create a external table in hive. This data is sorted,
> so wanted to tell hive about this. When I do, it complains about
> parsing the create.
>
> > CREATE EXTERNAL TABLE IF NOT EXISTS store.testing (
> ...
> . . . . . . . . . . . . . . . . . . .>   timestamp bigint,
> ...)
> . . . . . . . . . . . . . . . . . . .>   SORTED BY (timestamp)
> ...
> . . . . . . . . . . . . . . . . . . .>   LOCATION '/project/db/table'
> . . . . . . . . . . . . . . . . . . .> ;
> Error: Error while compiling statement: FAILED: ParseException line
> 1:507 missing EOF at 'SORTED' near ')' (state=42000,code=40000)
> 2: jdbc:hive2://localhost:10000/store>
>
> What can I do to let hive know that my data is sorted? Every example
> online of sorted by is grouped with buckets, but we really don't want
> to add bucketing.
>
>
> Hive version: 0.14.0
>
> Thanks for your help!
>



-- 
Takahiko Saito