You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Shaikh Ahmed <rn...@gmail.com> on 2014/05/23 23:49:55 UTC

Copy some records from Huge hbase table to another table

Hi,

We have one huge HBase table with billions of rows. This table holds the
information about websites and number of hits on that site in every 15
minutes.

Every website will have multiple records in data with different number of
hit count and last updated timestamp.

Now, we want to create another Hbase table which will contain information
about only those TOP 10 websites which are having more number of hits.

We are seeking help from experts to achieve this requirement.
How we can filter top 10 websites based on hits count from billions of
records and copy it into our new HBase table?

I will greatly appreciate kind support from group members.

Thanks in advance.
Regards,
Riyaz

Re: Copy some records from Huge hbase table to another table

Posted by Shaikh Ahmed <rn...@gmail.com>.

Thanks James,

I will investigate Apache Phoenix and get back to group with updates.


Regards,
Riyaz


On Sat, May 24, 2014 at 4:15 AM, James Taylor <jt...@salesforce.com>wrote:

> Hi Riyaz,
> You can do this with a single SQL command using Apache Phoenix, a SQL
> engine on top of HBase, and you'll get better performance than if you hand
> coded it using the HBase client APIs. Depending on your current schema, you
> may be able to run this command with no change to your data. Let's assume
> you have an MD5 hash of the website and the date/time in the row key with
> your website and counts in key values. That schema could be modeled like
> this in Phoenix:
>
> CREATE VIEW WEBSITE_STATS (
>     WEBSITE_MD5 BINARY(16) NOT NULL,
>     DATE_COLLECTED UNSIGNED_DATE NOT NULL,
>     WEBSITE_URL VARCHAR,
>     HIT_COUNT UNSIGNED_LONG,
>     CONSTRAINT PK PRIMARY KEY (WEBSITE_MD5, DATE_COLLECTED));
>
> You could issue this create view statement and map directly to your HBase
> table. I used the UNSIGNED types above as they match the serialization you
> get when you use the HBase Bytes utility methods. Phoenix normalizes column
> names by upper casing them, so if your column qualifiers are lower case,
> you'd want to put the column names above in double quotes.
>
> Next, you'd create a table to hold the top10 information:
>
> CREATE TABLE WEBSITE_TOP10 (
>     WEBSITE_URL VARCHAR PRIMARY KEY,
>     TOTAL_HIT_COUNT BIGINT);
>
> Then you'd run an UPSERT SELECT command like this:
>
> UPSERT INTO WEBSITE_TOP10
> SELECT WEBSITE_URL, SUM(HIT_COUNT) FROM WEBSITE_STATS
> GROUP BY WEBSITE_URL
> ORDER BY SUM(HIT_COUNT)
> LIMIT 10;
>
> Phoenix will run the SELECT part of this in parallel on the client and use
> a coprocessor on the server side to aggregate over the WEBSITE_URLs
> returning the distinct set of urls per region with a final merge sort
> happening o the client to compute the total sum. Then the client will hold
> on to the top 10 rows it sees and upsert these into the WEBSITE_TOP10
> table.
>
> HTH,
> James
>
>
>
>
>
>
> On Fri, May 23, 2014 at 5:14 PM, Ted Yu <yu...@gmail.com> wrote:
>
> > Would the new HBase table reside in the same cluster as the original
> table
> > ?
> >
> > See this recent thread: http://search-hadoop.com/m/DHED4uBNqJ1
> >
> > Cheers
> >
> >
> > On Fri, May 23, 2014 at 2:49 PM, Shaikh Ahmed <rn...@gmail.com>
> > wrote:
> >
> > > Hi,
> > >
> > > We have one huge HBase table with billions of rows. This table holds
> the
> > > information about websites and number of hits on that site in every 15
> > > minutes.
> > >
> > > Every website will have multiple records in data with different number
> of
> > > hit count and last updated timestamp.
> > >
> > > Now, we want to create another Hbase table which will contain
> information
> > > about only those TOP 10 websites which are having more number of hits.
> > >
> > > We are seeking help from experts to achieve this requirement.
> > > How we can filter top 10 websites based on hits count from billions of
> > > records and copy it into our new HBase table?
> > >
> > > I will greatly appreciate kind support from group members.
> > >
> > > Thanks in advance.
> > > Regards,
> > > Riyaz
> > >
> >
>

Re: Copy some records from Huge hbase table to another table

Posted by James Taylor <jt...@salesforce.com>.

Hi Riyaz,
You can do this with a single SQL command using Apache Phoenix, a SQL
engine on top of HBase, and you'll get better performance than if you hand
coded it using the HBase client APIs. Depending on your current schema, you
may be able to run this command with no change to your data. Let's assume
you have an MD5 hash of the website and the date/time in the row key with
your website and counts in key values. That schema could be modeled like
this in Phoenix:

CREATE VIEW WEBSITE_STATS (
    WEBSITE_MD5 BINARY(16) NOT NULL,
    DATE_COLLECTED UNSIGNED_DATE NOT NULL,
    WEBSITE_URL VARCHAR,
    HIT_COUNT UNSIGNED_LONG,
    CONSTRAINT PK PRIMARY KEY (WEBSITE_MD5, DATE_COLLECTED));

You could issue this create view statement and map directly to your HBase
table. I used the UNSIGNED types above as they match the serialization you
get when you use the HBase Bytes utility methods. Phoenix normalizes column
names by upper casing them, so if your column qualifiers are lower case,
you'd want to put the column names above in double quotes.

Next, you'd create a table to hold the top10 information:

CREATE TABLE WEBSITE_TOP10 (
    WEBSITE_URL VARCHAR PRIMARY KEY,
    TOTAL_HIT_COUNT BIGINT);

Then you'd run an UPSERT SELECT command like this:

UPSERT INTO WEBSITE_TOP10
SELECT WEBSITE_URL, SUM(HIT_COUNT) FROM WEBSITE_STATS
GROUP BY WEBSITE_URL
ORDER BY SUM(HIT_COUNT)
LIMIT 10;

Phoenix will run the SELECT part of this in parallel on the client and use
a coprocessor on the server side to aggregate over the WEBSITE_URLs
returning the distinct set of urls per region with a final merge sort
happening o the client to compute the total sum. Then the client will hold
on to the top 10 rows it sees and upsert these into the WEBSITE_TOP10 table.

HTH,
James

On Fri, May 23, 2014 at 5:14 PM, Ted Yu <yu...@gmail.com> wrote:

> Would the new HBase table reside in the same cluster as the original table
> ?
>
> See this recent thread: http://search-hadoop.com/m/DHED4uBNqJ1
>
> Cheers
>
>
> On Fri, May 23, 2014 at 2:49 PM, Shaikh Ahmed <rn...@gmail.com>
> wrote:
>
> > Hi,
> >
> > We have one huge HBase table with billions of rows. This table holds the
> > information about websites and number of hits on that site in every 15
> > minutes.
> >
> > Every website will have multiple records in data with different number of
> > hit count and last updated timestamp.
> >
> > Now, we want to create another Hbase table which will contain information
> > about only those TOP 10 websites which are having more number of hits.
> >
> > We are seeking help from experts to achieve this requirement.
> > How we can filter top 10 websites based on hits count from billions of
> > records and copy it into our new HBase table?
> >
> > I will greatly appreciate kind support from group members.
> >
> > Thanks in advance.
> > Regards,
> > Riyaz
> >
>

Re: Copy some records from Huge hbase table to another table

Posted by Ted Yu <yu...@gmail.com>.

Would the new HBase table reside in the same cluster as the original table ?

See this recent thread: http://search-hadoop.com/m/DHED4uBNqJ1

Cheers


On Fri, May 23, 2014 at 2:49 PM, Shaikh Ahmed <rn...@gmail.com> wrote:

> Hi,
>
> We have one huge HBase table with billions of rows. This table holds the
> information about websites and number of hits on that site in every 15
> minutes.
>
> Every website will have multiple records in data with different number of
> hit count and last updated timestamp.
>
> Now, we want to create another Hbase table which will contain information
> about only those TOP 10 websites which are having more number of hits.
>
> We are seeking help from experts to achieve this requirement.
> How we can filter top 10 websites based on hits count from billions of
> records and copy it into our new HBase table?
>
> I will greatly appreciate kind support from group members.
>
> Thanks in advance.
> Regards,
> Riyaz
>