You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by max scalf <or...@gmail.com> on 2015/03/08 00:02:52 UTC

sorting in hive -- general

Hello all,

I am a new to hadoop and hive in general and i am reading "hadoop the
definitive guide" by Tom White and on page 504 for the hive chapter, Tom
says below with regards to soritng

*Sorting and Aggregating*
*Sorting data in Hive can be achieved by using a standard ORDER BY clause.
ORDER BY performs a parallel total sort of the input (like that described
in “Total Sort” on page 261). When a globally sorted result is not
required—and in many cases it isn’t—you can use Hive’s nonstandard
extension, SORT BY, instead. SORT BY produces a sorted file per reducer.*


My Questions is, what exactly does he mean by "globally sorted result"?, if
the sort by operation produces a sorted file per reducer does that mean at
the end of the sort all the reducer are put back together to give the
correct results ?

Re: sorting in hive -- general

Posted by max scalf <or...@gmail.com>.

Thank you very much for the explanation Alexander.

On Sun, Mar 8, 2015 at 1:14 PM, Alexander Pivovarov <ap...@gmail.com>
wrote:

> 1. sort by -
> key are distributed according to MR partitioner  (controlled by
> distributed by in hive)
>
> Lets assume hash partitioned uses the same column as sort by and uses x
> mod 16 formula to get reducer id
>
> reduced 0 will have keys
> 0
> 16
> 32
>
> reducer 1 will have keys
> 1
> 17
> 33
>
>
> if you merge reducer 0 and reducer 1 output you will have
> 0
> 16
> 32
> 1
> 17
> 33
>
>
> 2. "order by" will use 1 reducer and hive will send all keys to reducer 0
>
> So "order by" in hive works different from terasort. In case of terasort
> you can merge output files and get one file with globally sorted data.
>
>
>
>
> On Sun, Mar 8, 2015 at 7:55 AM, max scalf <or...@gmail.com> wrote:
>
>> Thank you Alexander.  So is it fair to assume when sort by is used and
>> multiple files are produced per reducer at the end of it all of then are
>> put togeather/merged to get the results back?
>>
>> And can sort by be used without distributed by and expect same result as
>> order by ?
>>
>> On Sat, Mar 7, 2015 at 7:05 PM, Alexander Pivovarov <apivovarov@gmail.com
>> > wrote:
>>
>>> sort by query produces multiple independent files.
>>>
>>> order by - just one file
>>>
>>> usually sort by is used with distributed by.
>>> In older hive versions (0.7) they might be used to implement local sort
>>> within partition
>>> similar to RANK() OVER (PARTITION BY A ORDER BY B)
>>>
>>>
>>> On Sat, Mar 7, 2015 at 3:02 PM, max scalf <or...@gmail.com>
>>> wrote:
>>>
>>>> Hello all,
>>>>
>>>> I am a new to hadoop and hive in general and i am reading "hadoop the
>>>> definitive guide" by Tom White and on page 504 for the hive chapter, Tom
>>>> says below with regards to soritng
>>>>
>>>> *Sorting and Aggregating*
>>>> *Sorting data in Hive can be achieved by using a standard ORDER BY
>>>> clause. ORDER BY performs a parallel total sort of the input (like that
>>>> described in “Total Sort” on page 261). When a globally sorted result is
>>>> not required—and in many cases it isn’t—you can use Hive’s nonstandard
>>>> extension, SORT BY, instead. SORT BY produces a sorted file per reducer.*
>>>>
>>>>
>>>> My Questions is, what exactly does he mean by "globally sorted
>>>> result"?, if the sort by operation produces a sorted file per reducer does
>>>> that mean at the end of the sort all the reducer are put back together to
>>>> give the correct results ?
>>>>
>>>>
>>>>
>>>>
>>>
>>
>

Re: sorting in hive -- general

Posted by max scalf <or...@gmail.com>.

Thank you very much for the explanation Alexander.

On Sun, Mar 8, 2015 at 1:14 PM, Alexander Pivovarov <ap...@gmail.com>
wrote:

> 1. sort by -
> key are distributed according to MR partitioner  (controlled by
> distributed by in hive)
>
> Lets assume hash partitioned uses the same column as sort by and uses x
> mod 16 formula to get reducer id
>
> reduced 0 will have keys
> 0
> 16
> 32
>
> reducer 1 will have keys
> 1
> 17
> 33
>
>
> if you merge reducer 0 and reducer 1 output you will have
> 0
> 16
> 32
> 1
> 17
> 33
>
>
> 2. "order by" will use 1 reducer and hive will send all keys to reducer 0
>
> So "order by" in hive works different from terasort. In case of terasort
> you can merge output files and get one file with globally sorted data.
>
>
>
>
> On Sun, Mar 8, 2015 at 7:55 AM, max scalf <or...@gmail.com> wrote:
>
>> Thank you Alexander.  So is it fair to assume when sort by is used and
>> multiple files are produced per reducer at the end of it all of then are
>> put togeather/merged to get the results back?
>>
>> And can sort by be used without distributed by and expect same result as
>> order by ?
>>
>> On Sat, Mar 7, 2015 at 7:05 PM, Alexander Pivovarov <apivovarov@gmail.com
>> > wrote:
>>
>>> sort by query produces multiple independent files.
>>>
>>> order by - just one file
>>>
>>> usually sort by is used with distributed by.
>>> In older hive versions (0.7) they might be used to implement local sort
>>> within partition
>>> similar to RANK() OVER (PARTITION BY A ORDER BY B)
>>>
>>>
>>> On Sat, Mar 7, 2015 at 3:02 PM, max scalf <or...@gmail.com>
>>> wrote:
>>>
>>>> Hello all,
>>>>
>>>> I am a new to hadoop and hive in general and i am reading "hadoop the
>>>> definitive guide" by Tom White and on page 504 for the hive chapter, Tom
>>>> says below with regards to soritng
>>>>
>>>> *Sorting and Aggregating*
>>>> *Sorting data in Hive can be achieved by using a standard ORDER BY
>>>> clause. ORDER BY performs a parallel total sort of the input (like that
>>>> described in “Total Sort” on page 261). When a globally sorted result is
>>>> not required—and in many cases it isn’t—you can use Hive’s nonstandard
>>>> extension, SORT BY, instead. SORT BY produces a sorted file per reducer.*
>>>>
>>>>
>>>> My Questions is, what exactly does he mean by "globally sorted
>>>> result"?, if the sort by operation produces a sorted file per reducer does
>>>> that mean at the end of the sort all the reducer are put back together to
>>>> give the correct results ?
>>>>
>>>>
>>>>
>>>>
>>>
>>
>

Re: sorting in hive -- general

Posted by max scalf <or...@gmail.com>.

Thank you very much for the explanation Alexander.

On Sun, Mar 8, 2015 at 1:14 PM, Alexander Pivovarov <ap...@gmail.com>
wrote:

> 1. sort by -
> key are distributed according to MR partitioner  (controlled by
> distributed by in hive)
>
> Lets assume hash partitioned uses the same column as sort by and uses x
> mod 16 formula to get reducer id
>
> reduced 0 will have keys
> 0
> 16
> 32
>
> reducer 1 will have keys
> 1
> 17
> 33
>
>
> if you merge reducer 0 and reducer 1 output you will have
> 0
> 16
> 32
> 1
> 17
> 33
>
>
> 2. "order by" will use 1 reducer and hive will send all keys to reducer 0
>
> So "order by" in hive works different from terasort. In case of terasort
> you can merge output files and get one file with globally sorted data.
>
>
>
>
> On Sun, Mar 8, 2015 at 7:55 AM, max scalf <or...@gmail.com> wrote:
>
>> Thank you Alexander.  So is it fair to assume when sort by is used and
>> multiple files are produced per reducer at the end of it all of then are
>> put togeather/merged to get the results back?
>>
>> And can sort by be used without distributed by and expect same result as
>> order by ?
>>
>> On Sat, Mar 7, 2015 at 7:05 PM, Alexander Pivovarov <apivovarov@gmail.com
>> > wrote:
>>
>>> sort by query produces multiple independent files.
>>>
>>> order by - just one file
>>>
>>> usually sort by is used with distributed by.
>>> In older hive versions (0.7) they might be used to implement local sort
>>> within partition
>>> similar to RANK() OVER (PARTITION BY A ORDER BY B)
>>>
>>>
>>> On Sat, Mar 7, 2015 at 3:02 PM, max scalf <or...@gmail.com>
>>> wrote:
>>>
>>>> Hello all,
>>>>
>>>> I am a new to hadoop and hive in general and i am reading "hadoop the
>>>> definitive guide" by Tom White and on page 504 for the hive chapter, Tom
>>>> says below with regards to soritng
>>>>
>>>> *Sorting and Aggregating*
>>>> *Sorting data in Hive can be achieved by using a standard ORDER BY
>>>> clause. ORDER BY performs a parallel total sort of the input (like that
>>>> described in “Total Sort” on page 261). When a globally sorted result is
>>>> not required—and in many cases it isn’t—you can use Hive’s nonstandard
>>>> extension, SORT BY, instead. SORT BY produces a sorted file per reducer.*
>>>>
>>>>
>>>> My Questions is, what exactly does he mean by "globally sorted
>>>> result"?, if the sort by operation produces a sorted file per reducer does
>>>> that mean at the end of the sort all the reducer are put back together to
>>>> give the correct results ?
>>>>
>>>>
>>>>
>>>>
>>>
>>
>

Re: sorting in hive -- general

Posted by max scalf <or...@gmail.com>.

Thank you very much for the explanation Alexander.

On Sun, Mar 8, 2015 at 1:14 PM, Alexander Pivovarov <ap...@gmail.com>
wrote:

> 1. sort by -
> key are distributed according to MR partitioner  (controlled by
> distributed by in hive)
>
> Lets assume hash partitioned uses the same column as sort by and uses x
> mod 16 formula to get reducer id
>
> reduced 0 will have keys
> 0
> 16
> 32
>
> reducer 1 will have keys
> 1
> 17
> 33
>
>
> if you merge reducer 0 and reducer 1 output you will have
> 0
> 16
> 32
> 1
> 17
> 33
>
>
> 2. "order by" will use 1 reducer and hive will send all keys to reducer 0
>
> So "order by" in hive works different from terasort. In case of terasort
> you can merge output files and get one file with globally sorted data.
>
>
>
>
> On Sun, Mar 8, 2015 at 7:55 AM, max scalf <or...@gmail.com> wrote:
>
>> Thank you Alexander.  So is it fair to assume when sort by is used and
>> multiple files are produced per reducer at the end of it all of then are
>> put togeather/merged to get the results back?
>>
>> And can sort by be used without distributed by and expect same result as
>> order by ?
>>
>> On Sat, Mar 7, 2015 at 7:05 PM, Alexander Pivovarov <apivovarov@gmail.com
>> > wrote:
>>
>>> sort by query produces multiple independent files.
>>>
>>> order by - just one file
>>>
>>> usually sort by is used with distributed by.
>>> In older hive versions (0.7) they might be used to implement local sort
>>> within partition
>>> similar to RANK() OVER (PARTITION BY A ORDER BY B)
>>>
>>>
>>> On Sat, Mar 7, 2015 at 3:02 PM, max scalf <or...@gmail.com>
>>> wrote:
>>>
>>>> Hello all,
>>>>
>>>> I am a new to hadoop and hive in general and i am reading "hadoop the
>>>> definitive guide" by Tom White and on page 504 for the hive chapter, Tom
>>>> says below with regards to soritng
>>>>
>>>> *Sorting and Aggregating*
>>>> *Sorting data in Hive can be achieved by using a standard ORDER BY
>>>> clause. ORDER BY performs a parallel total sort of the input (like that
>>>> described in “Total Sort” on page 261). When a globally sorted result is
>>>> not required—and in many cases it isn’t—you can use Hive’s nonstandard
>>>> extension, SORT BY, instead. SORT BY produces a sorted file per reducer.*
>>>>
>>>>
>>>> My Questions is, what exactly does he mean by "globally sorted
>>>> result"?, if the sort by operation produces a sorted file per reducer does
>>>> that mean at the end of the sort all the reducer are put back together to
>>>> give the correct results ?
>>>>
>>>>
>>>>
>>>>
>>>
>>
>

Re: sorting in hive -- general

Posted by Alexander Pivovarov <ap...@gmail.com>.

1. sort by -
key are distributed according to MR partitioner  (controlled by distributed
by in hive)

Lets assume hash partitioned uses the same column as sort by and uses x mod
16 formula to get reducer id

reduced 0 will have keys
0
16
32

reducer 1 will have keys
1
17
33


if you merge reducer 0 and reducer 1 output you will have
0
16
32
1
17
33


2. "order by" will use 1 reducer and hive will send all keys to reducer 0

So "order by" in hive works different from terasort. In case of terasort
you can merge output files and get one file with globally sorted data.




On Sun, Mar 8, 2015 at 7:55 AM, max scalf <or...@gmail.com> wrote:

> Thank you Alexander.  So is it fair to assume when sort by is used and
> multiple files are produced per reducer at the end of it all of then are
> put togeather/merged to get the results back?
>
> And can sort by be used without distributed by and expect same result as
> order by ?
>
> On Sat, Mar 7, 2015 at 7:05 PM, Alexander Pivovarov <ap...@gmail.com>
> wrote:
>
>> sort by query produces multiple independent files.
>>
>> order by - just one file
>>
>> usually sort by is used with distributed by.
>> In older hive versions (0.7) they might be used to implement local sort
>> within partition
>> similar to RANK() OVER (PARTITION BY A ORDER BY B)
>>
>>
>> On Sat, Mar 7, 2015 at 3:02 PM, max scalf <or...@gmail.com> wrote:
>>
>>> Hello all,
>>>
>>> I am a new to hadoop and hive in general and i am reading "hadoop the
>>> definitive guide" by Tom White and on page 504 for the hive chapter, Tom
>>> says below with regards to soritng
>>>
>>> *Sorting and Aggregating*
>>> *Sorting data in Hive can be achieved by using a standard ORDER BY
>>> clause. ORDER BY performs a parallel total sort of the input (like that
>>> described in “Total Sort” on page 261). When a globally sorted result is
>>> not required—and in many cases it isn’t—you can use Hive’s nonstandard
>>> extension, SORT BY, instead. SORT BY produces a sorted file per reducer.*
>>>
>>>
>>> My Questions is, what exactly does he mean by "globally sorted result"?,
>>> if the sort by operation produces a sorted file per reducer does that mean
>>> at the end of the sort all the reducer are put back together to give the
>>> correct results ?
>>>
>>>
>>>
>>>
>>
>

Re: sorting in hive -- general

Posted by Alexander Pivovarov <ap...@gmail.com>.

1. sort by -
key are distributed according to MR partitioner  (controlled by distributed
by in hive)

Lets assume hash partitioned uses the same column as sort by and uses x mod
16 formula to get reducer id

reduced 0 will have keys
0
16
32

reducer 1 will have keys
1
17
33


if you merge reducer 0 and reducer 1 output you will have
0
16
32
1
17
33


2. "order by" will use 1 reducer and hive will send all keys to reducer 0

So "order by" in hive works different from terasort. In case of terasort
you can merge output files and get one file with globally sorted data.




On Sun, Mar 8, 2015 at 7:55 AM, max scalf <or...@gmail.com> wrote:

> Thank you Alexander.  So is it fair to assume when sort by is used and
> multiple files are produced per reducer at the end of it all of then are
> put togeather/merged to get the results back?
>
> And can sort by be used without distributed by and expect same result as
> order by ?
>
> On Sat, Mar 7, 2015 at 7:05 PM, Alexander Pivovarov <ap...@gmail.com>
> wrote:
>
>> sort by query produces multiple independent files.
>>
>> order by - just one file
>>
>> usually sort by is used with distributed by.
>> In older hive versions (0.7) they might be used to implement local sort
>> within partition
>> similar to RANK() OVER (PARTITION BY A ORDER BY B)
>>
>>
>> On Sat, Mar 7, 2015 at 3:02 PM, max scalf <or...@gmail.com> wrote:
>>
>>> Hello all,
>>>
>>> I am a new to hadoop and hive in general and i am reading "hadoop the
>>> definitive guide" by Tom White and on page 504 for the hive chapter, Tom
>>> says below with regards to soritng
>>>
>>> *Sorting and Aggregating*
>>> *Sorting data in Hive can be achieved by using a standard ORDER BY
>>> clause. ORDER BY performs a parallel total sort of the input (like that
>>> described in “Total Sort” on page 261). When a globally sorted result is
>>> not required—and in many cases it isn’t—you can use Hive’s nonstandard
>>> extension, SORT BY, instead. SORT BY produces a sorted file per reducer.*
>>>
>>>
>>> My Questions is, what exactly does he mean by "globally sorted result"?,
>>> if the sort by operation produces a sorted file per reducer does that mean
>>> at the end of the sort all the reducer are put back together to give the
>>> correct results ?
>>>
>>>
>>>
>>>
>>
>

Re: sorting in hive -- general

Posted by Alexander Pivovarov <ap...@gmail.com>.

1. sort by -
key are distributed according to MR partitioner  (controlled by distributed
by in hive)

Lets assume hash partitioned uses the same column as sort by and uses x mod
16 formula to get reducer id

reduced 0 will have keys
0
16
32

reducer 1 will have keys
1
17
33


if you merge reducer 0 and reducer 1 output you will have
0
16
32
1
17
33


2. "order by" will use 1 reducer and hive will send all keys to reducer 0

So "order by" in hive works different from terasort. In case of terasort
you can merge output files and get one file with globally sorted data.




On Sun, Mar 8, 2015 at 7:55 AM, max scalf <or...@gmail.com> wrote:

> Thank you Alexander.  So is it fair to assume when sort by is used and
> multiple files are produced per reducer at the end of it all of then are
> put togeather/merged to get the results back?
>
> And can sort by be used without distributed by and expect same result as
> order by ?
>
> On Sat, Mar 7, 2015 at 7:05 PM, Alexander Pivovarov <ap...@gmail.com>
> wrote:
>
>> sort by query produces multiple independent files.
>>
>> order by - just one file
>>
>> usually sort by is used with distributed by.
>> In older hive versions (0.7) they might be used to implement local sort
>> within partition
>> similar to RANK() OVER (PARTITION BY A ORDER BY B)
>>
>>
>> On Sat, Mar 7, 2015 at 3:02 PM, max scalf <or...@gmail.com> wrote:
>>
>>> Hello all,
>>>
>>> I am a new to hadoop and hive in general and i am reading "hadoop the
>>> definitive guide" by Tom White and on page 504 for the hive chapter, Tom
>>> says below with regards to soritng
>>>
>>> *Sorting and Aggregating*
>>> *Sorting data in Hive can be achieved by using a standard ORDER BY
>>> clause. ORDER BY performs a parallel total sort of the input (like that
>>> described in “Total Sort” on page 261). When a globally sorted result is
>>> not required—and in many cases it isn’t—you can use Hive’s nonstandard
>>> extension, SORT BY, instead. SORT BY produces a sorted file per reducer.*
>>>
>>>
>>> My Questions is, what exactly does he mean by "globally sorted result"?,
>>> if the sort by operation produces a sorted file per reducer does that mean
>>> at the end of the sort all the reducer are put back together to give the
>>> correct results ?
>>>
>>>
>>>
>>>
>>
>

Re: sorting in hive -- general

Posted by Alexander Pivovarov <ap...@gmail.com>.

1. sort by -
key are distributed according to MR partitioner  (controlled by distributed
by in hive)

Lets assume hash partitioned uses the same column as sort by and uses x mod
16 formula to get reducer id

reduced 0 will have keys
0
16
32

reducer 1 will have keys
1
17
33


if you merge reducer 0 and reducer 1 output you will have
0
16
32
1
17
33


2. "order by" will use 1 reducer and hive will send all keys to reducer 0

So "order by" in hive works different from terasort. In case of terasort
you can merge output files and get one file with globally sorted data.




On Sun, Mar 8, 2015 at 7:55 AM, max scalf <or...@gmail.com> wrote:

> Thank you Alexander.  So is it fair to assume when sort by is used and
> multiple files are produced per reducer at the end of it all of then are
> put togeather/merged to get the results back?
>
> And can sort by be used without distributed by and expect same result as
> order by ?
>
> On Sat, Mar 7, 2015 at 7:05 PM, Alexander Pivovarov <ap...@gmail.com>
> wrote:
>
>> sort by query produces multiple independent files.
>>
>> order by - just one file
>>
>> usually sort by is used with distributed by.
>> In older hive versions (0.7) they might be used to implement local sort
>> within partition
>> similar to RANK() OVER (PARTITION BY A ORDER BY B)
>>
>>
>> On Sat, Mar 7, 2015 at 3:02 PM, max scalf <or...@gmail.com> wrote:
>>
>>> Hello all,
>>>
>>> I am a new to hadoop and hive in general and i am reading "hadoop the
>>> definitive guide" by Tom White and on page 504 for the hive chapter, Tom
>>> says below with regards to soritng
>>>
>>> *Sorting and Aggregating*
>>> *Sorting data in Hive can be achieved by using a standard ORDER BY
>>> clause. ORDER BY performs a parallel total sort of the input (like that
>>> described in “Total Sort” on page 261). When a globally sorted result is
>>> not required—and in many cases it isn’t—you can use Hive’s nonstandard
>>> extension, SORT BY, instead. SORT BY produces a sorted file per reducer.*
>>>
>>>
>>> My Questions is, what exactly does he mean by "globally sorted result"?,
>>> if the sort by operation produces a sorted file per reducer does that mean
>>> at the end of the sort all the reducer are put back together to give the
>>> correct results ?
>>>
>>>
>>>
>>>
>>
>

Re: sorting in hive -- general

Posted by max scalf <or...@gmail.com>.

Thank you Alexander.  So is it fair to assume when sort by is used and
multiple files are produced per reducer at the end of it all of then are
put togeather/merged to get the results back?

And can sort by be used without distributed by and expect same result as
order by ?

On Sat, Mar 7, 2015 at 7:05 PM, Alexander Pivovarov <ap...@gmail.com>
wrote:

> sort by query produces multiple independent files.
>
> order by - just one file
>
> usually sort by is used with distributed by.
> In older hive versions (0.7) they might be used to implement local sort
> within partition
> similar to RANK() OVER (PARTITION BY A ORDER BY B)
>
>
> On Sat, Mar 7, 2015 at 3:02 PM, max scalf <or...@gmail.com> wrote:
>
>> Hello all,
>>
>> I am a new to hadoop and hive in general and i am reading "hadoop the
>> definitive guide" by Tom White and on page 504 for the hive chapter, Tom
>> says below with regards to soritng
>>
>> *Sorting and Aggregating*
>> *Sorting data in Hive can be achieved by using a standard ORDER BY
>> clause. ORDER BY performs a parallel total sort of the input (like that
>> described in “Total Sort” on page 261). When a globally sorted result is
>> not required—and in many cases it isn’t—you can use Hive’s nonstandard
>> extension, SORT BY, instead. SORT BY produces a sorted file per reducer.*
>>
>>
>> My Questions is, what exactly does he mean by "globally sorted result"?,
>> if the sort by operation produces a sorted file per reducer does that mean
>> at the end of the sort all the reducer are put back together to give the
>> correct results ?
>>
>>
>>
>>
>

Re: sorting in hive -- general

Posted by max scalf <or...@gmail.com>.

Thank you Alexander.  So is it fair to assume when sort by is used and
multiple files are produced per reducer at the end of it all of then are
put togeather/merged to get the results back?

And can sort by be used without distributed by and expect same result as
order by ?

On Sat, Mar 7, 2015 at 7:05 PM, Alexander Pivovarov <ap...@gmail.com>
wrote:

> sort by query produces multiple independent files.
>
> order by - just one file
>
> usually sort by is used with distributed by.
> In older hive versions (0.7) they might be used to implement local sort
> within partition
> similar to RANK() OVER (PARTITION BY A ORDER BY B)
>
>
> On Sat, Mar 7, 2015 at 3:02 PM, max scalf <or...@gmail.com> wrote:
>
>> Hello all,
>>
>> I am a new to hadoop and hive in general and i am reading "hadoop the
>> definitive guide" by Tom White and on page 504 for the hive chapter, Tom
>> says below with regards to soritng
>>
>> *Sorting and Aggregating*
>> *Sorting data in Hive can be achieved by using a standard ORDER BY
>> clause. ORDER BY performs a parallel total sort of the input (like that
>> described in “Total Sort” on page 261). When a globally sorted result is
>> not required—and in many cases it isn’t—you can use Hive’s nonstandard
>> extension, SORT BY, instead. SORT BY produces a sorted file per reducer.*
>>
>>
>> My Questions is, what exactly does he mean by "globally sorted result"?,
>> if the sort by operation produces a sorted file per reducer does that mean
>> at the end of the sort all the reducer are put back together to give the
>> correct results ?
>>
>>
>>
>>
>

Re: sorting in hive -- general

Posted by max scalf <or...@gmail.com>.

Thank you Alexander.  So is it fair to assume when sort by is used and
multiple files are produced per reducer at the end of it all of then are
put togeather/merged to get the results back?

And can sort by be used without distributed by and expect same result as
order by ?

On Sat, Mar 7, 2015 at 7:05 PM, Alexander Pivovarov <ap...@gmail.com>
wrote:

> sort by query produces multiple independent files.
>
> order by - just one file
>
> usually sort by is used with distributed by.
> In older hive versions (0.7) they might be used to implement local sort
> within partition
> similar to RANK() OVER (PARTITION BY A ORDER BY B)
>
>
> On Sat, Mar 7, 2015 at 3:02 PM, max scalf <or...@gmail.com> wrote:
>
>> Hello all,
>>
>> I am a new to hadoop and hive in general and i am reading "hadoop the
>> definitive guide" by Tom White and on page 504 for the hive chapter, Tom
>> says below with regards to soritng
>>
>> *Sorting and Aggregating*
>> *Sorting data in Hive can be achieved by using a standard ORDER BY
>> clause. ORDER BY performs a parallel total sort of the input (like that
>> described in “Total Sort” on page 261). When a globally sorted result is
>> not required—and in many cases it isn’t—you can use Hive’s nonstandard
>> extension, SORT BY, instead. SORT BY produces a sorted file per reducer.*
>>
>>
>> My Questions is, what exactly does he mean by "globally sorted result"?,
>> if the sort by operation produces a sorted file per reducer does that mean
>> at the end of the sort all the reducer are put back together to give the
>> correct results ?
>>
>>
>>
>>
>

Re: sorting in hive -- general

Posted by max scalf <or...@gmail.com>.

Thank you Alexander.  So is it fair to assume when sort by is used and
multiple files are produced per reducer at the end of it all of then are
put togeather/merged to get the results back?

And can sort by be used without distributed by and expect same result as
order by ?

On Sat, Mar 7, 2015 at 7:05 PM, Alexander Pivovarov <ap...@gmail.com>
wrote:

> sort by query produces multiple independent files.
>
> order by - just one file
>
> usually sort by is used with distributed by.
> In older hive versions (0.7) they might be used to implement local sort
> within partition
> similar to RANK() OVER (PARTITION BY A ORDER BY B)
>
>
> On Sat, Mar 7, 2015 at 3:02 PM, max scalf <or...@gmail.com> wrote:
>
>> Hello all,
>>
>> I am a new to hadoop and hive in general and i am reading "hadoop the
>> definitive guide" by Tom White and on page 504 for the hive chapter, Tom
>> says below with regards to soritng
>>
>> *Sorting and Aggregating*
>> *Sorting data in Hive can be achieved by using a standard ORDER BY
>> clause. ORDER BY performs a parallel total sort of the input (like that
>> described in “Total Sort” on page 261). When a globally sorted result is
>> not required—and in many cases it isn’t—you can use Hive’s nonstandard
>> extension, SORT BY, instead. SORT BY produces a sorted file per reducer.*
>>
>>
>> My Questions is, what exactly does he mean by "globally sorted result"?,
>> if the sort by operation produces a sorted file per reducer does that mean
>> at the end of the sort all the reducer are put back together to give the
>> correct results ?
>>
>>
>>
>>
>

Re: sorting in hive -- general

Posted by Alexander Pivovarov <ap...@gmail.com>.

sort by query produces multiple independent files.

order by - just one file

usually sort by is used with distributed by.
In older hive versions (0.7) they might be used to implement local sort
within partition
similar to RANK() OVER (PARTITION BY A ORDER BY B)


On Sat, Mar 7, 2015 at 3:02 PM, max scalf <or...@gmail.com> wrote:

> Hello all,
>
> I am a new to hadoop and hive in general and i am reading "hadoop the
> definitive guide" by Tom White and on page 504 for the hive chapter, Tom
> says below with regards to soritng
>
> *Sorting and Aggregating*
> *Sorting data in Hive can be achieved by using a standard ORDER BY clause.
> ORDER BY performs a parallel total sort of the input (like that described
> in “Total Sort” on page 261). When a globally sorted result is not
> required—and in many cases it isn’t—you can use Hive’s nonstandard
> extension, SORT BY, instead. SORT BY produces a sorted file per reducer.*
>
>
> My Questions is, what exactly does he mean by "globally sorted result"?,
> if the sort by operation produces a sorted file per reducer does that mean
> at the end of the sort all the reducer are put back together to give the
> correct results ?
>
>
>
>

Re: sorting in hive -- general

Posted by Alexander Pivovarov <ap...@gmail.com>.

sort by query produces multiple independent files.

order by - just one file

usually sort by is used with distributed by.
In older hive versions (0.7) they might be used to implement local sort
within partition
similar to RANK() OVER (PARTITION BY A ORDER BY B)


On Sat, Mar 7, 2015 at 3:02 PM, max scalf <or...@gmail.com> wrote:

> Hello all,
>
> I am a new to hadoop and hive in general and i am reading "hadoop the
> definitive guide" by Tom White and on page 504 for the hive chapter, Tom
> says below with regards to soritng
>
> *Sorting and Aggregating*
> *Sorting data in Hive can be achieved by using a standard ORDER BY clause.
> ORDER BY performs a parallel total sort of the input (like that described
> in “Total Sort” on page 261). When a globally sorted result is not
> required—and in many cases it isn’t—you can use Hive’s nonstandard
> extension, SORT BY, instead. SORT BY produces a sorted file per reducer.*
>
>
> My Questions is, what exactly does he mean by "globally sorted result"?,
> if the sort by operation produces a sorted file per reducer does that mean
> at the end of the sort all the reducer are put back together to give the
> correct results ?
>
>
>
>

Re: sorting in hive -- general

Posted by Alexander Pivovarov <ap...@gmail.com>.

sort by query produces multiple independent files.

order by - just one file

usually sort by is used with distributed by.
In older hive versions (0.7) they might be used to implement local sort
within partition
similar to RANK() OVER (PARTITION BY A ORDER BY B)


On Sat, Mar 7, 2015 at 3:02 PM, max scalf <or...@gmail.com> wrote:

> Hello all,
>
> I am a new to hadoop and hive in general and i am reading "hadoop the
> definitive guide" by Tom White and on page 504 for the hive chapter, Tom
> says below with regards to soritng
>
> *Sorting and Aggregating*
> *Sorting data in Hive can be achieved by using a standard ORDER BY clause.
> ORDER BY performs a parallel total sort of the input (like that described
> in “Total Sort” on page 261). When a globally sorted result is not
> required—and in many cases it isn’t—you can use Hive’s nonstandard
> extension, SORT BY, instead. SORT BY produces a sorted file per reducer.*
>
>
> My Questions is, what exactly does he mean by "globally sorted result"?,
> if the sort by operation produces a sorted file per reducer does that mean
> at the end of the sort all the reducer are put back together to give the
> correct results ?
>
>
>
>

Re: sorting in hive -- general

Posted by Alexander Pivovarov <ap...@gmail.com>.

sort by query produces multiple independent files.

order by - just one file

usually sort by is used with distributed by.
In older hive versions (0.7) they might be used to implement local sort
within partition
similar to RANK() OVER (PARTITION BY A ORDER BY B)


On Sat, Mar 7, 2015 at 3:02 PM, max scalf <or...@gmail.com> wrote:

> Hello all,
>
> I am a new to hadoop and hive in general and i am reading "hadoop the
> definitive guide" by Tom White and on page 504 for the hive chapter, Tom
> says below with regards to soritng
>
> *Sorting and Aggregating*
> *Sorting data in Hive can be achieved by using a standard ORDER BY clause.
> ORDER BY performs a parallel total sort of the input (like that described
> in “Total Sort” on page 261). When a globally sorted result is not
> required—and in many cases it isn’t—you can use Hive’s nonstandard
> extension, SORT BY, instead. SORT BY produces a sorted file per reducer.*
>
>
> My Questions is, what exactly does he mean by "globally sorted result"?,
> if the sort by operation produces a sorted file per reducer does that mean
> at the end of the sort all the reducer are put back together to give the
> correct results ?
>
>
>
>

Re: sorting in hive -- general

Posted by max scalf <or...@gmail.com>.

Thank you...

On Mon, Mar 9, 2015 at 2:23 AM, r7raul1984@163.com <r7...@163.com>
wrote:

> read this article
> http://www.philippeadjiman.com/blog/2009/12/20/hadoop-tutorial-series-issue-2-getting-started-with-customized-partitioning/
>
>
> then read
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SortBy
>
> ------------------------------
> r7raul1984@163.com
>
>
> *From:* max scalf <or...@gmail.com>
> *Date:* 2015-03-08 07:02
> *To:* HDP mailing list <us...@hadoop.apache.org>; Hive Mailing List
> <us...@hive.apache.org>
> *Subject:* sorting in hive -- general
> Hello all,
>
> I am a new to hadoop and hive in general and i am reading "hadoop the
> definitive guide" by Tom White and on page 504 for the hive chapter, Tom
> says below with regards to soritng
>
> *Sorting and Aggregating*
> *Sorting data in Hive can be achieved by using a standard ORDER BY clause.
> ORDER BY performs a parallel total sort of the input (like that described
> in “Total Sort” on page 261). When a globally sorted result is not
> required—and in many cases it isn’t—you can use Hive’s nonstandard
> extension, SORT BY, instead. SORT BY produces a sorted file per reducer.*
>
>
> My Questions is, what exactly does he mean by "globally sorted result"?,
> if the sort by operation produces a sorted file per reducer does that mean
> at the end of the sort all the reducer are put back together to give the
> correct results ?
>
>
>
>

Re: sorting in hive -- general

Posted by "r7raul1984@163.com" <r7...@163.com>.

read this article http://www.philippeadjiman.com/blog/2009/12/20/hadoop-tutorial-series-issue-2-getting-started-with-customized-partitioning/ 

then read   https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SortBy

r7raul1984@163.com

From: max scalf
Date: 2015-03-08 07:02
To: HDP mailing list; Hive Mailing List
Subject: sorting in hive -- general
Hello all,

I am a new to hadoop and hive in general and i am reading "hadoop the definitive guide" by Tom White and on page 504 for the hive chapter, Tom says below with regards to soritng

Sorting and Aggregating
Sorting data in Hive can be achieved by using a standard ORDER BY clause. ORDER BY performs a parallel total sort of the input (like that described in “Total Sort” on page 261). When a globally sorted result is not required—and in many cases it isn’t—you can use Hive’s nonstandard extension, SORT BY, instead. SORT BY produces a sorted file per reducer.

My Questions is, what exactly does he mean by "globally sorted result"?, if the sort by operation produces a sorted file per reducer does that mean at the end of the sort all the reducer are put back together to give the correct results ?

Re: sorting in hive -- general

Posted by Alexander Pivovarov <ap...@gmail.com>.

sort by query produces multiple independent files.

order by - just one file

usually sort by is used with distributed by.
In older hive versions (0.7) they might be used to implement local sort
within partition
similar to RANK() OVER (PARTITION BY A ORDER BY B)


On Sat, Mar 7, 2015 at 3:02 PM, max scalf <or...@gmail.com> wrote:

> Hello all,
>
> I am a new to hadoop and hive in general and i am reading "hadoop the
> definitive guide" by Tom White and on page 504 for the hive chapter, Tom
> says below with regards to soritng
>
> *Sorting and Aggregating*
> *Sorting data in Hive can be achieved by using a standard ORDER BY clause.
> ORDER BY performs a parallel total sort of the input (like that described
> in “Total Sort” on page 261). When a globally sorted result is not
> required—and in many cases it isn’t—you can use Hive’s nonstandard
> extension, SORT BY, instead. SORT BY produces a sorted file per reducer.*
>
>
> My Questions is, what exactly does he mean by "globally sorted result"?,
> if the sort by operation produces a sorted file per reducer does that mean
> at the end of the sort all the reducer are put back together to give the
> correct results ?
>
>
>
>