You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by John <jo...@gmail.com> on 2013/09/13 16:31:40 UTC

Sort Order in HBase with Pig/Piglatin in Java

I have created a HBase Table in the hbase shell and added some data. In
http://hbase.apache.org/book/dm.sort.html is written that the datasets are
first sorted by the rowkey and then the column. So I tried something in the
HBase Shell: http://pastebin.com/gLVAX0rJ

Everything looks fine. I got the right order a -> c -> d like expected.

Now I tried the same with Apache Pig in Java: http://pastebin.com/jdTpj4Fu

I got this result:

(key1,[c#val,d#val,a#val])

So, now the order is c -> d -> a. That seems a little odd to me, shouldn't
it be the same like in HBase? It's important for me to get the right order
because I transform the map afterwards into a bag and then join it with
other tables. If both inputs are sorted I could use a merge join without
sorting these two datasets. So does anyone know how it is possible to get
the sorted map (or bag) of the columns?


thanks

Re: Sort Order in HBase with Pig/Piglatin in Java

Posted by Pradeep Gollakota <pr...@gmail.com>.
No problem! In this case, insertion order is the same as natural order, so
I think a LinkedHashMap is probably a better choice for this particular use
case.

Here's a great SO post about the differences between HashMap, TreeMap and
LinkedHashMap.
http://stackoverflow.com/questions/2889777/difference-between-hashmap-linkedhashmap-and-sortedmap-in-java




On Fri, Sep 13, 2013 at 9:29 AM, John <jo...@gmail.com> wrote:

> Hi, thanks for your quick answer! I figured it out by my self since the
> mailing server was down the last 2hours?!  Btw. I did option 1. But I used
> a LinkedHashMap insead. Do you knows whats the better choice? TreeMap
> or LinkedHashMap?
>
> Anyway thanks :)
>
>
> 2013/9/13 Pradeep Gollakota <pr...@gmail.com>
>
> > Thats a great observation John! The problem is that HBaseStorage maps
> > columns families into a HashMap, so the sort ordering is completely lost.
> >
> > You have two options:
> >
> > 1. Modify HBaseStorage to use a SortedMap data structure (i.e. TreeMap)
> and
> > use the modified HBaseStorage. (or make it configurable)
> > 2. Since you convert the map to a bag, you can sort the bag in a nested
> > foreach statement.
> >
> > I prefer option 1 myself because it would be more performant than option
> 2.
> >
> >
> > On Fri, Sep 13, 2013 at 7:31 AM, John <jo...@gmail.com>
> wrote:
> >
> > > I have created a HBase Table in the hbase shell and added some data. In
> > > http://hbase.apache.org/book/dm.sort.html is written that the datasets
> > are
> > > first sorted by the rowkey and then the column. So I tried something in
> > the
> > > HBase Shell: http://pastebin.com/gLVAX0rJ
> > >
> > > Everything looks fine. I got the right order a -> c -> d like expected.
> > >
> > > Now I tried the same with Apache Pig in Java:
> > http://pastebin.com/jdTpj4Fu
> > >
> > > I got this result:
> > >
> > > (key1,[c#val,d#val,a#val])
> > >
> > > So, now the order is c -> d -> a. That seems a little odd to me,
> > shouldn't
> > > it be the same like in HBase? It's important for me to get the right
> > order
> > > because I transform the map afterwards into a bag and then join it with
> > > other tables. If both inputs are sorted I could use a merge join
> without
> > > sorting these two datasets. So does anyone know how it is possible to
> get
> > > the sorted map (or bag) of the columns?
> > >
> > >
> > > thanks
> > >
> >
>

Re: Sort Order in HBase with Pig/Piglatin in Java

Posted by Shahab Yunus <sh...@gmail.com>.
"but since hbase returns the values sorted"

You are right. I just wanted to highlight the subtlety that you are
essentially relying on the external mechanism for the desired feature
(sorting) rather than the collection/container itself (as in TreeMap.) In
this case, it is most probably redundant and you can use LinkedHashMap to
avoid resorting.

Regards,
Shahab


On Fri, Sep 13, 2013 at 12:50 PM, John <jo...@gmail.com> wrote:

> Yes, thats a good point @ Shahab, but since hbase returns the values sorted
> everything shlould be fine and I can use the LinkedHashMap.
>
> Thanks to both of you!
>
>
> 2013/9/13 Shahab Yunus <sh...@gmail.com>
>
> > "But I used
> > a LinkedHashMap insead. Do you knows whats the better choice? TreeMap
> > or LinkedHashMap?"
> >
> > If you are asking from functionality perspective then there is a
> difference
> > between them that LinkedHashMap maintains the order in which items were
> > entered in the map. So if they were entered in the correct order then you
> > are fine but for any reason if they are not entered in the correct order
> > that you want (i.e. some kind of sort order) then you will not be able to
> > get your desired order.
> >
> > TreeMap on the other handle, makes sure that the ordering is
> > right according to the natural ordering of the elements. Gives you more
> > security in terms of what you want.
> >
> > Regards,
> > Shahba
> >
> >
> > On Fri, Sep 13, 2013 at 12:29 PM, John <jo...@gmail.com>
> wrote:
> >
> > > Hi, thanks for your quick answer! I figured it out by my self since the
> > > mailing server was down the last 2hours?!  Btw. I did option 1. But I
> > used
> > > a LinkedHashMap insead. Do you knows whats the better choice? TreeMap
> > > or LinkedHashMap?
> > >
> > > Anyway thanks :)
> > >
> > >
> > > 2013/9/13 Pradeep Gollakota <pr...@gmail.com>
> > >
> > > > Thats a great observation John! The problem is that HBaseStorage maps
> > > > columns families into a HashMap, so the sort ordering is completely
> > lost.
> > > >
> > > > You have two options:
> > > >
> > > > 1. Modify HBaseStorage to use a SortedMap data structure (i.e.
> TreeMap)
> > > and
> > > > use the modified HBaseStorage. (or make it configurable)
> > > > 2. Since you convert the map to a bag, you can sort the bag in a
> nested
> > > > foreach statement.
> > > >
> > > > I prefer option 1 myself because it would be more performant than
> > option
> > > 2.
> > > >
> > > >
> > > > On Fri, Sep 13, 2013 at 7:31 AM, John <jo...@gmail.com>
> > > wrote:
> > > >
> > > > > I have created a HBase Table in the hbase shell and added some
> data.
> > In
> > > > > http://hbase.apache.org/book/dm.sort.html is written that the
> > datasets
> > > > are
> > > > > first sorted by the rowkey and then the column. So I tried
> something
> > in
> > > > the
> > > > > HBase Shell: http://pastebin.com/gLVAX0rJ
> > > > >
> > > > > Everything looks fine. I got the right order a -> c -> d like
> > expected.
> > > > >
> > > > > Now I tried the same with Apache Pig in Java:
> > > > http://pastebin.com/jdTpj4Fu
> > > > >
> > > > > I got this result:
> > > > >
> > > > > (key1,[c#val,d#val,a#val])
> > > > >
> > > > > So, now the order is c -> d -> a. That seems a little odd to me,
> > > > shouldn't
> > > > > it be the same like in HBase? It's important for me to get the
> right
> > > > order
> > > > > because I transform the map afterwards into a bag and then join it
> > with
> > > > > other tables. If both inputs are sorted I could use a merge join
> > > without
> > > > > sorting these two datasets. So does anyone know how it is possible
> to
> > > get
> > > > > the sorted map (or bag) of the columns?
> > > > >
> > > > >
> > > > > thanks
> > > > >
> > > >
> > >
> >
>

Re: Sort Order in HBase with Pig/Piglatin in Java

Posted by John <jo...@gmail.com>.
Yes, thats a good point @ Shahab, but since hbase returns the values sorted
everything shlould be fine and I can use the LinkedHashMap.

Thanks to both of you!


2013/9/13 Shahab Yunus <sh...@gmail.com>

> "But I used
> a LinkedHashMap insead. Do you knows whats the better choice? TreeMap
> or LinkedHashMap?"
>
> If you are asking from functionality perspective then there is a difference
> between them that LinkedHashMap maintains the order in which items were
> entered in the map. So if they were entered in the correct order then you
> are fine but for any reason if they are not entered in the correct order
> that you want (i.e. some kind of sort order) then you will not be able to
> get your desired order.
>
> TreeMap on the other handle, makes sure that the ordering is
> right according to the natural ordering of the elements. Gives you more
> security in terms of what you want.
>
> Regards,
> Shahba
>
>
> On Fri, Sep 13, 2013 at 12:29 PM, John <jo...@gmail.com> wrote:
>
> > Hi, thanks for your quick answer! I figured it out by my self since the
> > mailing server was down the last 2hours?!  Btw. I did option 1. But I
> used
> > a LinkedHashMap insead. Do you knows whats the better choice? TreeMap
> > or LinkedHashMap?
> >
> > Anyway thanks :)
> >
> >
> > 2013/9/13 Pradeep Gollakota <pr...@gmail.com>
> >
> > > Thats a great observation John! The problem is that HBaseStorage maps
> > > columns families into a HashMap, so the sort ordering is completely
> lost.
> > >
> > > You have two options:
> > >
> > > 1. Modify HBaseStorage to use a SortedMap data structure (i.e. TreeMap)
> > and
> > > use the modified HBaseStorage. (or make it configurable)
> > > 2. Since you convert the map to a bag, you can sort the bag in a nested
> > > foreach statement.
> > >
> > > I prefer option 1 myself because it would be more performant than
> option
> > 2.
> > >
> > >
> > > On Fri, Sep 13, 2013 at 7:31 AM, John <jo...@gmail.com>
> > wrote:
> > >
> > > > I have created a HBase Table in the hbase shell and added some data.
> In
> > > > http://hbase.apache.org/book/dm.sort.html is written that the
> datasets
> > > are
> > > > first sorted by the rowkey and then the column. So I tried something
> in
> > > the
> > > > HBase Shell: http://pastebin.com/gLVAX0rJ
> > > >
> > > > Everything looks fine. I got the right order a -> c -> d like
> expected.
> > > >
> > > > Now I tried the same with Apache Pig in Java:
> > > http://pastebin.com/jdTpj4Fu
> > > >
> > > > I got this result:
> > > >
> > > > (key1,[c#val,d#val,a#val])
> > > >
> > > > So, now the order is c -> d -> a. That seems a little odd to me,
> > > shouldn't
> > > > it be the same like in HBase? It's important for me to get the right
> > > order
> > > > because I transform the map afterwards into a bag and then join it
> with
> > > > other tables. If both inputs are sorted I could use a merge join
> > without
> > > > sorting these two datasets. So does anyone know how it is possible to
> > get
> > > > the sorted map (or bag) of the columns?
> > > >
> > > >
> > > > thanks
> > > >
> > >
> >
>

Re: Sort Order in HBase with Pig/Piglatin in Java

Posted by Shahab Yunus <sh...@gmail.com>.
"But I used
a LinkedHashMap insead. Do you knows whats the better choice? TreeMap
or LinkedHashMap?"

If you are asking from functionality perspective then there is a difference
between them that LinkedHashMap maintains the order in which items were
entered in the map. So if they were entered in the correct order then you
are fine but for any reason if they are not entered in the correct order
that you want (i.e. some kind of sort order) then you will not be able to
get your desired order.

TreeMap on the other handle, makes sure that the ordering is
right according to the natural ordering of the elements. Gives you more
security in terms of what you want.

Regards,
Shahba


On Fri, Sep 13, 2013 at 12:29 PM, John <jo...@gmail.com> wrote:

> Hi, thanks for your quick answer! I figured it out by my self since the
> mailing server was down the last 2hours?!  Btw. I did option 1. But I used
> a LinkedHashMap insead. Do you knows whats the better choice? TreeMap
> or LinkedHashMap?
>
> Anyway thanks :)
>
>
> 2013/9/13 Pradeep Gollakota <pr...@gmail.com>
>
> > Thats a great observation John! The problem is that HBaseStorage maps
> > columns families into a HashMap, so the sort ordering is completely lost.
> >
> > You have two options:
> >
> > 1. Modify HBaseStorage to use a SortedMap data structure (i.e. TreeMap)
> and
> > use the modified HBaseStorage. (or make it configurable)
> > 2. Since you convert the map to a bag, you can sort the bag in a nested
> > foreach statement.
> >
> > I prefer option 1 myself because it would be more performant than option
> 2.
> >
> >
> > On Fri, Sep 13, 2013 at 7:31 AM, John <jo...@gmail.com>
> wrote:
> >
> > > I have created a HBase Table in the hbase shell and added some data. In
> > > http://hbase.apache.org/book/dm.sort.html is written that the datasets
> > are
> > > first sorted by the rowkey and then the column. So I tried something in
> > the
> > > HBase Shell: http://pastebin.com/gLVAX0rJ
> > >
> > > Everything looks fine. I got the right order a -> c -> d like expected.
> > >
> > > Now I tried the same with Apache Pig in Java:
> > http://pastebin.com/jdTpj4Fu
> > >
> > > I got this result:
> > >
> > > (key1,[c#val,d#val,a#val])
> > >
> > > So, now the order is c -> d -> a. That seems a little odd to me,
> > shouldn't
> > > it be the same like in HBase? It's important for me to get the right
> > order
> > > because I transform the map afterwards into a bag and then join it with
> > > other tables. If both inputs are sorted I could use a merge join
> without
> > > sorting these two datasets. So does anyone know how it is possible to
> get
> > > the sorted map (or bag) of the columns?
> > >
> > >
> > > thanks
> > >
> >
>

Re: Sort Order in HBase with Pig/Piglatin in Java

Posted by John <jo...@gmail.com>.
Hi, thanks for your quick answer! I figured it out by my self since the
mailing server was down the last 2hours?!  Btw. I did option 1. But I used
a LinkedHashMap insead. Do you knows whats the better choice? TreeMap
or LinkedHashMap?

Anyway thanks :)


2013/9/13 Pradeep Gollakota <pr...@gmail.com>

> Thats a great observation John! The problem is that HBaseStorage maps
> columns families into a HashMap, so the sort ordering is completely lost.
>
> You have two options:
>
> 1. Modify HBaseStorage to use a SortedMap data structure (i.e. TreeMap) and
> use the modified HBaseStorage. (or make it configurable)
> 2. Since you convert the map to a bag, you can sort the bag in a nested
> foreach statement.
>
> I prefer option 1 myself because it would be more performant than option 2.
>
>
> On Fri, Sep 13, 2013 at 7:31 AM, John <jo...@gmail.com> wrote:
>
> > I have created a HBase Table in the hbase shell and added some data. In
> > http://hbase.apache.org/book/dm.sort.html is written that the datasets
> are
> > first sorted by the rowkey and then the column. So I tried something in
> the
> > HBase Shell: http://pastebin.com/gLVAX0rJ
> >
> > Everything looks fine. I got the right order a -> c -> d like expected.
> >
> > Now I tried the same with Apache Pig in Java:
> http://pastebin.com/jdTpj4Fu
> >
> > I got this result:
> >
> > (key1,[c#val,d#val,a#val])
> >
> > So, now the order is c -> d -> a. That seems a little odd to me,
> shouldn't
> > it be the same like in HBase? It's important for me to get the right
> order
> > because I transform the map afterwards into a bag and then join it with
> > other tables. If both inputs are sorted I could use a merge join without
> > sorting these two datasets. So does anyone know how it is possible to get
> > the sorted map (or bag) of the columns?
> >
> >
> > thanks
> >
>

Re: Sort Order in HBase with Pig/Piglatin in Java

Posted by Pradeep Gollakota <pr...@gmail.com>.
Thats a great observation John! The problem is that HBaseStorage maps
columns families into a HashMap, so the sort ordering is completely lost.

You have two options:

1. Modify HBaseStorage to use a SortedMap data structure (i.e. TreeMap) and
use the modified HBaseStorage. (or make it configurable)
2. Since you convert the map to a bag, you can sort the bag in a nested
foreach statement.

I prefer option 1 myself because it would be more performant than option 2.


On Fri, Sep 13, 2013 at 7:31 AM, John <jo...@gmail.com> wrote:

> I have created a HBase Table in the hbase shell and added some data. In
> http://hbase.apache.org/book/dm.sort.html is written that the datasets are
> first sorted by the rowkey and then the column. So I tried something in the
> HBase Shell: http://pastebin.com/gLVAX0rJ
>
> Everything looks fine. I got the right order a -> c -> d like expected.
>
> Now I tried the same with Apache Pig in Java: http://pastebin.com/jdTpj4Fu
>
> I got this result:
>
> (key1,[c#val,d#val,a#val])
>
> So, now the order is c -> d -> a. That seems a little odd to me, shouldn't
> it be the same like in HBase? It's important for me to get the right order
> because I transform the map afterwards into a bag and then join it with
> other tables. If both inputs are sorted I could use a merge join without
> sorting these two datasets. So does anyone know how it is possible to get
> the sorted map (or bag) of the columns?
>
>
> thanks
>