You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@phoenix.apache.org by Gaurav Kanade <ga...@gmail.com> on 2015/09/25 23:10:54 UTC

Problems with Phoenix SqlLine loading large amounts of data

Hello Guys

I was able to load my large data set (200 G) with phoenix bulk load tool
with your help last week.

But I am running into other problem running queries on this now using
sqlline.

All I am trying to do is run a simple count(*) query.

Initially I hit timeout issues due to a socketconnection exception, I
figure out a way to get past this (set the hbase.client.retry.count) in
hbase-site.xml on the client side

However I seem to never be able to successfully run the count(*) even
though I increased my phoenix query timeout to 10 minutes (the query
eventually times out)

To try to figure out what was happening I tried to do count(*) on smaller
chunks of data (I filtered on the first column in primary key) so I tried
where Col1 < x, Col1 < y.

The queries seem to succeed on the smaller chunks; but soon hit a point
where they cannot scale and hit the same timeout issues. What is weird is
for e.g. the behavior is very erratic too; sometimes a query may time out
(10 mins) on first attempt and complete in 2 minutes at the second attempt
and so on.

I tried to look at the region server logs and I see no errors that might
point to something except for responseTooSlow message from time to time on
scan requests.

I realize I don't have much specific error messages and so on to provide
but that is mainly because I couldn't find any; it seems to me from the
nature of the overall behavior that I might be missing something obvious at
a high level; if so it would be great if you could point me in that
direction.

The schema for the table is as follows (900 million rows): (and I am using
a 32 node cluster now)

CREATE TABLE AP1.EstimateA33(BookId INTEGER NOT NULL, MarketId INTEGER NOT
NULL,  StationId INTEGER NOT NULL,   EstimateTypeId INTEGER NOT NULL,
 DaypartId INTEGER NOT NULL,  DemographicId INTEGER NOT NULL,
 EstimateValue INTEGER,        StateId INTEGER,        StateCode VARCHAR,
     StateName VARCHAR,             EstimateType VARCHAR,        MarketName
VARCHAR,                                   StationName VARCHAR,
    DaypartName VARCHAR,                BookType VARCHAR,
               BookYear INTEGER,        GeographyId INTEGER,
 SeasonId INTEGER,        BookGeography VARCHAR,       BookSeason VARCHAR,
                                 DemographicType VARCHAR CONSTRAINT
pk_Estimates PRIMARY KEY (BookId, MarketId, StationId, EstimateTypeId,
DaypartId, DemographicId)) SALT_BUCKETS = 33

-- 
Thanks
Gaurav

Re: Problems with Phoenix SqlLine loading large amounts of data

Posted by Gaurav Kanade <ga...@gmail.com>.
Hey Gabriel
Thanks a lot for the suggestion; it turns out we were not quite utilizing
our phoenix client (the sqlline program) correctly - having the right
configs in the right place for the client etc. As a result I tried with a
bunch of different salt bucket configs and we were not seeing any results.
Once I put the right configs in the right place we got the desired results
and made sense.

Thanks a ton for your help!

Gaurav

On 28 September 2015 at 10:41, Gabriel Reid <ga...@gmail.com> wrote:

> Hi Gaurav,
>
> Looking at your DDL statement, I'm guessing that your table is
> currently made up of 33 regions, which means that the time to do a
> full count query will take at least as long as it takes to count 27
> million rows with a single thread (900 million threads divided by 33
> regions).
>
> The most-likely reason for issues like this is poor parallelization of
> the query. Have you tried running "UPDATE STATISTICS" on the table? If
> not, please see this link:
> http://phoenix.apache.org/update_statistics.html
>
> Manually splitting the table will also likely improve the
> parallelization of a select count(*) query.
>
> - Gabriel
>
>
> On Fri, Sep 25, 2015 at 11:10 PM, Gaurav Kanade <ga...@gmail.com>
> wrote:
> > Hello Guys
> >
> > I was able to load my large data set (200 G) with phoenix bulk load tool
> > with your help last week.
> >
> > But I am running into other problem running queries on this now using
> > sqlline.
> >
> > All I am trying to do is run a simple count(*) query.
> >
> > Initially I hit timeout issues due to a socketconnection exception, I
> figure
> > out a way to get past this (set the hbase.client.retry.count) in
> > hbase-site.xml on the client side
> >
> > However I seem to never be able to successfully run the count(*) even
> though
> > I increased my phoenix query timeout to 10 minutes (the query eventually
> > times out)
> >
> > To try to figure out what was happening I tried to do count(*) on smaller
> > chunks of data (I filtered on the first column in primary key) so I tried
> > where Col1 < x, Col1 < y.
> >
> > The queries seem to succeed on the smaller chunks; but soon hit a point
> > where they cannot scale and hit the same timeout issues. What is weird is
> > for e.g. the behavior is very erratic too; sometimes a query may time out
> > (10 mins) on first attempt and complete in 2 minutes at the second
> attempt
> > and so on.
> >
> > I tried to look at the region server logs and I see no errors that might
> > point to something except for responseTooSlow message from time to time
> on
> > scan requests.
> >
> > I realize I don't have much specific error messages and so on to provide
> but
> > that is mainly because I couldn't find any; it seems to me from the
> nature
> > of the overall behavior that I might be missing something obvious at a
> high
> > level; if so it would be great if you could point me in that direction.
> >
> > The schema for the table is as follows (900 million rows): (and I am
> using a
> > 32 node cluster now)
> >
> > CREATE TABLE AP1.EstimateA33(BookId INTEGER NOT NULL, MarketId INTEGER
> NOT
> > NULL,  StationId INTEGER NOT NULL,   EstimateTypeId INTEGER NOT NULL,
> > DaypartId INTEGER NOT NULL,  DemographicId INTEGER NOT NULL,
> EstimateValue
> > INTEGER,        StateId INTEGER,        StateCode VARCHAR,
> StateName
> > VARCHAR,             EstimateType VARCHAR,        MarketName VARCHAR,
> > StationName VARCHAR,                 DaypartName VARCHAR,
> > BookType VARCHAR,                                  BookYear INTEGER,
> > GeographyId INTEGER,        SeasonId INTEGER,        BookGeography
> VARCHAR,
> > BookSeason VARCHAR,                                  DemographicType
> VARCHAR
> > CONSTRAINT pk_Estimates PRIMARY KEY (BookId, MarketId, StationId,
> > EstimateTypeId, DaypartId, DemographicId)) SALT_BUCKETS = 33
> >
> > --
> > Thanks
> > Gaurav
>

Re: Problems with Phoenix SqlLine loading large amounts of data

Posted by Gabriel Reid <ga...@gmail.com>.
Hi Gaurav,

Looking at your DDL statement, I'm guessing that your table is
currently made up of 33 regions, which means that the time to do a
full count query will take at least as long as it takes to count 27
million rows with a single thread (900 million threads divided by 33
regions).

The most-likely reason for issues like this is poor parallelization of
the query. Have you tried running "UPDATE STATISTICS" on the table? If
not, please see this link:
http://phoenix.apache.org/update_statistics.html

Manually splitting the table will also likely improve the
parallelization of a select count(*) query.

- Gabriel


On Fri, Sep 25, 2015 at 11:10 PM, Gaurav Kanade <ga...@gmail.com> wrote:
> Hello Guys
>
> I was able to load my large data set (200 G) with phoenix bulk load tool
> with your help last week.
>
> But I am running into other problem running queries on this now using
> sqlline.
>
> All I am trying to do is run a simple count(*) query.
>
> Initially I hit timeout issues due to a socketconnection exception, I figure
> out a way to get past this (set the hbase.client.retry.count) in
> hbase-site.xml on the client side
>
> However I seem to never be able to successfully run the count(*) even though
> I increased my phoenix query timeout to 10 minutes (the query eventually
> times out)
>
> To try to figure out what was happening I tried to do count(*) on smaller
> chunks of data (I filtered on the first column in primary key) so I tried
> where Col1 < x, Col1 < y.
>
> The queries seem to succeed on the smaller chunks; but soon hit a point
> where they cannot scale and hit the same timeout issues. What is weird is
> for e.g. the behavior is very erratic too; sometimes a query may time out
> (10 mins) on first attempt and complete in 2 minutes at the second attempt
> and so on.
>
> I tried to look at the region server logs and I see no errors that might
> point to something except for responseTooSlow message from time to time on
> scan requests.
>
> I realize I don't have much specific error messages and so on to provide but
> that is mainly because I couldn't find any; it seems to me from the nature
> of the overall behavior that I might be missing something obvious at a high
> level; if so it would be great if you could point me in that direction.
>
> The schema for the table is as follows (900 million rows): (and I am using a
> 32 node cluster now)
>
> CREATE TABLE AP1.EstimateA33(BookId INTEGER NOT NULL, MarketId INTEGER NOT
> NULL,  StationId INTEGER NOT NULL,   EstimateTypeId INTEGER NOT NULL,
> DaypartId INTEGER NOT NULL,  DemographicId INTEGER NOT NULL,  EstimateValue
> INTEGER,        StateId INTEGER,        StateCode VARCHAR,        StateName
> VARCHAR,             EstimateType VARCHAR,        MarketName VARCHAR,
> StationName VARCHAR,                 DaypartName VARCHAR,
> BookType VARCHAR,                                  BookYear INTEGER,
> GeographyId INTEGER,        SeasonId INTEGER,        BookGeography VARCHAR,
> BookSeason VARCHAR,                                  DemographicType VARCHAR
> CONSTRAINT pk_Estimates PRIMARY KEY (BookId, MarketId, StationId,
> EstimateTypeId, DaypartId, DemographicId)) SALT_BUCKETS = 33
>
> --
> Thanks
> Gaurav