You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Vidhyashankar Venkataraman <vi...@yahoo-inc.com> on 2011/05/18 01:25:24 UTC

A few issues we ran into the last couple of weeks.

(Running Hbase 0.90.0 on 700+ nodes.)

You may have seen many (or mostly all) of the following issues already:
   1. HConnection.isTableAvailable: This doesn't seem to be working all the time. In particular, I had this code after creating a table asynchronously:

   do {
      LOG.info("Table " + tableName + "not yet available... Sleeping for" + sleepTime + "milliseconds...");
      Thread.sleep(sleepTime);
    } while (!conn.isTableAvailable(table.getTableName()));
    LOG.info("Table is available!! : "+tableName+" Available? "+conn.isTableAvailable(table.getTableName()));

It comes out of the loop but then I see this:
Table is available!! : <TABLE> Available? false

And then I see that not all the regions are yet available.


   2. The master getting stuck unable to delete a WAL (I have seen this before on this forum and a related JIRA on this one): We had worked around by manually deleting a WAL. But during times when the master crashed during table creation (with split key boundaries), the node that took over next as the master (failover) started getting stuck for around 25% of the cluster. I had to wipe out all the logs so that the master could start up right.

But even then, the regionservers which had suffered the log issue couldn't recognize the failed over master. (Is this something that has been observed before?)


   3. createTableAsync with incorrect split keys: By mistake, I had some duplicate keys in the split key byte array while calling the createTableAsync function. The master crashed throwing a KeeperException (thanks to the duplicate keys I guess?)


Also, can you let me know why createTableAsync blocks for some time and throws a socket timeout exception when I try creating a table with a large number of regions?

Thank you
Vidhya

Re: A few issues we ran into the last couple of weeks.

Posted by Ted Yu <yu...@gmail.com>.

For 3, we have the following check in createTable():
        if(lastKey != null && Bytes.equals(splitKey, lastKey)) {
          throw new IllegalArgumentException("All split keys must be unique,
" +
            "found duplicate: " + Bytes.toStringBinary(splitKey) +

I wonder why you used createTableAsync() directly which doesn't have split
key check.

On Tue, May 17, 2011 at 4:59 PM, Ted Yu <yu...@gmail.com> wrote:

> For 1, the check in HCM.isTableAvailable() is:
>       return available.get() && (regionCount.get() > 0);
> This explains why some regions aren't available.
>
> For 3, can you provide a unit test so that we can investigate further ?
>
> Thanks
>
>
> On Tue, May 17, 2011 at 4:25 PM, Vidhyashankar Venkataraman <
> vidhyash@yahoo-inc.com> wrote:
>
>> (Running Hbase 0.90.0 on 700+ nodes.)
>>
>> You may have seen many (or mostly all) of the following issues already:
>>   1. HConnection.isTableAvailable: This doesn't seem to be working all the
>> time. In particular, I had this code after creating a table asynchronously:
>>
>>   do {
>>      LOG.info("Table " + tableName + "not yet available... Sleeping for" +
>> sleepTime + "milliseconds...");
>>      Thread.sleep(sleepTime);
>>    } while (!conn.isTableAvailable(table.getTableName()));
>>    LOG.info("Table is available!! : "+tableName+" Available?
>> "+conn.isTableAvailable(table.getTableName()));
>>
>> It comes out of the loop but then I see this:
>> Table is available!! : <TABLE> Available? false
>>
>> And then I see that not all the regions are yet available.
>>
>>
>>   2. The master getting stuck unable to delete a WAL (I have seen this
>> before on this forum and a related JIRA on this one): We had worked around
>> by manually deleting a WAL. But during times when the master crashed during
>> table creation (with split key boundaries), the node that took over next as
>> the master (failover) started getting stuck for around 25% of the cluster. I
>> had to wipe out all the logs so that the master could start up right.
>>
>> But even then, the regionservers which had suffered the log issue couldn't
>> recognize the failed over master. (Is this something that has been observed
>> before?)
>>
>>
>>   3. createTableAsync with incorrect split keys: By mistake, I had some
>> duplicate keys in the split key byte array while calling the
>> createTableAsync function. The master crashed throwing a KeeperException
>> (thanks to the duplicate keys I guess?)
>>
>>
>> Also, can you let me know why createTableAsync blocks for some time and
>> throws a socket timeout exception when I try creating a table with a large
>> number of regions?
>>
>> Thank you
>> Vidhya
>>
>
>

Re: A few issues we ran into the last couple of weeks.

Posted by Ted Yu <yu...@gmail.com>.

Vidhyashankar:
table.getRegionsInfo() is for advanced users (such as you) :-)
Anyway, we shouldn't enforce user to call it.

On Wed, May 18, 2011 at 11:12 AM, Vidhyashankar Venkataraman <
vidhyash@yahoo-inc.com> wrote:

> Thanks Ted! Will do it right away.
>
> 1. we should provide the following new API where numOfRegions is the
> expected number of regions to go online:
>
> I used table.getRegionsInfo() to make sure all regions were online instead
> of this function. But that function requires apriori knowledge of the number
> of regions.
>
> V
> P.S:  Copy-pasting my full name could be a little tedious!
>
>
> On 5/18/11 11:02 AM, "Ted Yu" <yu...@gmail.com> wrote:
>
> Vidhyashankar:
> Please file the following JIRAs:
> 1. we should provide the following new API where numOfRegions is the
> expected number of regions to go online:
>    public boolean isTableAvailable(final byte[] tableName, int
> numOfRegions) throws IOException {
>
> 2. HBaseAdmin.createTableAsync() should check whether there're duplicate
> keys. Since it is a public method, we shouldn't solely reply on
> createTable() to perform the check.
>
> Thanks
>
> On Wed, May 18, 2011 at 10:46 AM, Vidhyashankar Venkataraman <
> vidhyash@yahoo-inc.com> wrote:
>
> > As in, the use of isTableAvailable there indicates, a bulk load should
> > happen only if all the regions are available.
> >
> > But that may not be the case since the function returns back true if even
> > one region (regionCount.get()>0 check) is online.
> >
> > V
> >
> >
> > On 5/17/11 7:14 PM, "Ted Yu" <yu...@gmail.com> wrote:
> >
> > Did you mean that coming out of the following loop, the table might still
> > be
> > unavailable if there were many regions ?
> >    while (!conn.isTableAvailable(table.getTableName()) &&
> > (ctr<TABLE_CREATE_MAX_RETRIES)) {
> >
> > Cheers
> >
> > On Tue, May 17, 2011 at 7:10 PM, Ted Yu <yu...@gmail.com> wrote:
> >
> > > >> Also some of the source for which we had used this function may be
> > > broken (for example in LoadIncrementalHFiles.java)
> > > Can you be more specific ?
> > >
> > > Thanks
> > >
> > >
> > > On Tue, May 17, 2011 at 5:54 PM, Vidhyashankar Venkataraman <
> > > vidhyash@yahoo-inc.com> wrote:
> > >
> > >> >> For 1, the check in HCM.isTableAvailable() is:
> > >> >>      return available.get() && (regionCount.get() > 0);
> > >> >> This explains why some regions aren't available.
> > >>
> > >> The javadoc says the function returns true if all regions are
> available.
> > >> Clearly this statement is wrong going by what is there in the code.
> Also
> > >> some of the source for which we had used this function may be broken
> > (for
> > >> example in LoadIncrementalHFiles.java).
> > >>
> > >> >> For 3, can you provide a unit test so that we can investigate
> further
> > ?
> > >>
> > >> The problem is I am unable to get the master crash consistently. I can
> > >> send you the key split.
> > >>
> > >> Thank you
> > >> Vidhya
> > >>
> > >> On 5/17/11 4:59 PM, "Ted Yu" <yu...@gmail.com> wrote:
> > >>
> > >> For 1, the check in HCM.isTableAvailable() is:
> > >>      return available.get() && (regionCount.get() > 0);
> > >> This explains why some regions aren't available.
> > >>
> > >> For 3, can you provide a unit test so that we can investigate further
> ?
> > >>
> > >> Thanks
> > >>
> > >> On Tue, May 17, 2011 at 4:25 PM, Vidhyashankar Venkataraman <
> > >> vidhyash@yahoo-inc.com> wrote:
> > >>
> > >> > (Running Hbase 0.90.0 on 700+ nodes.)
> > >> >
> > >> > You may have seen many (or mostly all) of the following issues
> > already:
> > >> >   1. HConnection.isTableAvailable: This doesn't seem to be working
> all
> > >> the
> > >> > time. In particular, I had this code after creating a table
> > >> asynchronously:
> > >> >
> > >> >   do {
> > >> >      LOG.info("Table " + tableName + "not yet available... Sleeping
> > for"
> > >> +
> > >> > sleepTime + "milliseconds...");
> > >> >      Thread.sleep(sleepTime);
> > >> >    } while (!conn.isTableAvailable(table.getTableName()));
> > >> >    LOG.info("Table is available!! : "+tableName+" Available?
> > >> > "+conn.isTableAvailable(table.getTableName()));
> > >> >
> > >> > It comes out of the loop but then I see this:
> > >> > Table is available!! : <TABLE> Available? false
> > >> >
> > >> > And then I see that not all the regions are yet available.
> > >> >
> > >> >
> > >> >   2. The master getting stuck unable to delete a WAL (I have seen
> this
> > >> > before on this forum and a related JIRA on this one): We had worked
> > >> around
> > >> > by manually deleting a WAL. But during times when the master crashed
> > >> during
> > >> > table creation (with split key boundaries), the node that took over
> > next
> > >> as
> > >> > the master (failover) started getting stuck for around 25% of the
> > >> cluster. I
> > >> > had to wipe out all the logs so that the master could start up
> right.
> > >> >
> > >> > But even then, the regionservers which had suffered the log issue
> > >> couldn't
> > >> > recognize the failed over master. (Is this something that has been
> > >> observed
> > >> > before?)
> > >> >
> > >> >
> > >> >   3. createTableAsync with incorrect split keys: By mistake, I had
> > some
> > >> > duplicate keys in the split key byte array while calling the
> > >> > createTableAsync function. The master crashed throwing a
> > KeeperException
> > >> > (thanks to the duplicate keys I guess?)
> > >> >
> > >> >
> > >> > Also, can you let me know why createTableAsync blocks for some time
> > and
> > >> > throws a socket timeout exception when I try creating a table with a
> > >> large
> > >> > number of regions?
> > >> >
> > >> > Thank you
> > >> > Vidhya
> > >> >
> > >>
> > >>
> > >
> >
> >
>
>

Re: A few issues we ran into the last couple of weeks.

Posted by Vidhyashankar Venkataraman <vi...@yahoo-inc.com>.

Thanks Ted! Will do it right away.

1. we should provide the following new API where numOfRegions is the
expected number of regions to go online:

I used table.getRegionsInfo() to make sure all regions were online instead of this function. But that function requires apriori knowledge of the number of regions.

V
P.S:  Copy-pasting my full name could be a little tedious!


On 5/18/11 11:02 AM, "Ted Yu" <yu...@gmail.com> wrote:

Vidhyashankar:
Please file the following JIRAs:
1. we should provide the following new API where numOfRegions is the
expected number of regions to go online:
    public boolean isTableAvailable(final byte[] tableName, int
numOfRegions) throws IOException {

2. HBaseAdmin.createTableAsync() should check whether there're duplicate
keys. Since it is a public method, we shouldn't solely reply on
createTable() to perform the check.

Thanks

On Wed, May 18, 2011 at 10:46 AM, Vidhyashankar Venkataraman <
vidhyash@yahoo-inc.com> wrote:

> As in, the use of isTableAvailable there indicates, a bulk load should
> happen only if all the regions are available.
>
> But that may not be the case since the function returns back true if even
> one region (regionCount.get()>0 check) is online.
>
> V
>
>
> On 5/17/11 7:14 PM, "Ted Yu" <yu...@gmail.com> wrote:
>
> Did you mean that coming out of the following loop, the table might still
> be
> unavailable if there were many regions ?
>    while (!conn.isTableAvailable(table.getTableName()) &&
> (ctr<TABLE_CREATE_MAX_RETRIES)) {
>
> Cheers
>
> On Tue, May 17, 2011 at 7:10 PM, Ted Yu <yu...@gmail.com> wrote:
>
> > >> Also some of the source for which we had used this function may be
> > broken (for example in LoadIncrementalHFiles.java)
> > Can you be more specific ?
> >
> > Thanks
> >
> >
> > On Tue, May 17, 2011 at 5:54 PM, Vidhyashankar Venkataraman <
> > vidhyash@yahoo-inc.com> wrote:
> >
> >> >> For 1, the check in HCM.isTableAvailable() is:
> >> >>      return available.get() && (regionCount.get() > 0);
> >> >> This explains why some regions aren't available.
> >>
> >> The javadoc says the function returns true if all regions are available.
> >> Clearly this statement is wrong going by what is there in the code. Also
> >> some of the source for which we had used this function may be broken
> (for
> >> example in LoadIncrementalHFiles.java).
> >>
> >> >> For 3, can you provide a unit test so that we can investigate further
> ?
> >>
> >> The problem is I am unable to get the master crash consistently. I can
> >> send you the key split.
> >>
> >> Thank you
> >> Vidhya
> >>
> >> On 5/17/11 4:59 PM, "Ted Yu" <yu...@gmail.com> wrote:
> >>
> >> For 1, the check in HCM.isTableAvailable() is:
> >>      return available.get() && (regionCount.get() > 0);
> >> This explains why some regions aren't available.
> >>
> >> For 3, can you provide a unit test so that we can investigate further ?
> >>
> >> Thanks
> >>
> >> On Tue, May 17, 2011 at 4:25 PM, Vidhyashankar Venkataraman <
> >> vidhyash@yahoo-inc.com> wrote:
> >>
> >> > (Running Hbase 0.90.0 on 700+ nodes.)
> >> >
> >> > You may have seen many (or mostly all) of the following issues
> already:
> >> >   1. HConnection.isTableAvailable: This doesn't seem to be working all
> >> the
> >> > time. In particular, I had this code after creating a table
> >> asynchronously:
> >> >
> >> >   do {
> >> >      LOG.info("Table " + tableName + "not yet available... Sleeping
> for"
> >> +
> >> > sleepTime + "milliseconds...");
> >> >      Thread.sleep(sleepTime);
> >> >    } while (!conn.isTableAvailable(table.getTableName()));
> >> >    LOG.info("Table is available!! : "+tableName+" Available?
> >> > "+conn.isTableAvailable(table.getTableName()));
> >> >
> >> > It comes out of the loop but then I see this:
> >> > Table is available!! : <TABLE> Available? false
> >> >
> >> > And then I see that not all the regions are yet available.
> >> >
> >> >
> >> >   2. The master getting stuck unable to delete a WAL (I have seen this
> >> > before on this forum and a related JIRA on this one): We had worked
> >> around
> >> > by manually deleting a WAL. But during times when the master crashed
> >> during
> >> > table creation (with split key boundaries), the node that took over
> next
> >> as
> >> > the master (failover) started getting stuck for around 25% of the
> >> cluster. I
> >> > had to wipe out all the logs so that the master could start up right.
> >> >
> >> > But even then, the regionservers which had suffered the log issue
> >> couldn't
> >> > recognize the failed over master. (Is this something that has been
> >> observed
> >> > before?)
> >> >
> >> >
> >> >   3. createTableAsync with incorrect split keys: By mistake, I had
> some
> >> > duplicate keys in the split key byte array while calling the
> >> > createTableAsync function. The master crashed throwing a
> KeeperException
> >> > (thanks to the duplicate keys I guess?)
> >> >
> >> >
> >> > Also, can you let me know why createTableAsync blocks for some time
> and
> >> > throws a socket timeout exception when I try creating a table with a
> >> large
> >> > number of regions?
> >> >
> >> > Thank you
> >> > Vidhya
> >> >
> >>
> >>
> >
>
>

Re: A few issues we ran into the last couple of weeks.

Posted by Ted Yu <yu...@gmail.com>.

Vidhyashankar:
Please file the following JIRAs:
1. we should provide the following new API where numOfRegions is the
expected number of regions to go online:
    public boolean isTableAvailable(final byte[] tableName, int
numOfRegions) throws IOException {

2. HBaseAdmin.createTableAsync() should check whether there're duplicate
keys. Since it is a public method, we shouldn't solely reply on
createTable() to perform the check.

Thanks

On Wed, May 18, 2011 at 10:46 AM, Vidhyashankar Venkataraman <
vidhyash@yahoo-inc.com> wrote:

> As in, the use of isTableAvailable there indicates, a bulk load should
> happen only if all the regions are available.
>
> But that may not be the case since the function returns back true if even
> one region (regionCount.get()>0 check) is online.
>
> V
>
>
> On 5/17/11 7:14 PM, "Ted Yu" <yu...@gmail.com> wrote:
>
> Did you mean that coming out of the following loop, the table might still
> be
> unavailable if there were many regions ?
>    while (!conn.isTableAvailable(table.getTableName()) &&
> (ctr<TABLE_CREATE_MAX_RETRIES)) {
>
> Cheers
>
> On Tue, May 17, 2011 at 7:10 PM, Ted Yu <yu...@gmail.com> wrote:
>
> > >> Also some of the source for which we had used this function may be
> > broken (for example in LoadIncrementalHFiles.java)
> > Can you be more specific ?
> >
> > Thanks
> >
> >
> > On Tue, May 17, 2011 at 5:54 PM, Vidhyashankar Venkataraman <
> > vidhyash@yahoo-inc.com> wrote:
> >
> >> >> For 1, the check in HCM.isTableAvailable() is:
> >> >>      return available.get() && (regionCount.get() > 0);
> >> >> This explains why some regions aren't available.
> >>
> >> The javadoc says the function returns true if all regions are available.
> >> Clearly this statement is wrong going by what is there in the code. Also
> >> some of the source for which we had used this function may be broken
> (for
> >> example in LoadIncrementalHFiles.java).
> >>
> >> >> For 3, can you provide a unit test so that we can investigate further
> ?
> >>
> >> The problem is I am unable to get the master crash consistently. I can
> >> send you the key split.
> >>
> >> Thank you
> >> Vidhya
> >>
> >> On 5/17/11 4:59 PM, "Ted Yu" <yu...@gmail.com> wrote:
> >>
> >> For 1, the check in HCM.isTableAvailable() is:
> >>      return available.get() && (regionCount.get() > 0);
> >> This explains why some regions aren't available.
> >>
> >> For 3, can you provide a unit test so that we can investigate further ?
> >>
> >> Thanks
> >>
> >> On Tue, May 17, 2011 at 4:25 PM, Vidhyashankar Venkataraman <
> >> vidhyash@yahoo-inc.com> wrote:
> >>
> >> > (Running Hbase 0.90.0 on 700+ nodes.)
> >> >
> >> > You may have seen many (or mostly all) of the following issues
> already:
> >> >   1. HConnection.isTableAvailable: This doesn't seem to be working all
> >> the
> >> > time. In particular, I had this code after creating a table
> >> asynchronously:
> >> >
> >> >   do {
> >> >      LOG.info("Table " + tableName + "not yet available... Sleeping
> for"
> >> +
> >> > sleepTime + "milliseconds...");
> >> >      Thread.sleep(sleepTime);
> >> >    } while (!conn.isTableAvailable(table.getTableName()));
> >> >    LOG.info("Table is available!! : "+tableName+" Available?
> >> > "+conn.isTableAvailable(table.getTableName()));
> >> >
> >> > It comes out of the loop but then I see this:
> >> > Table is available!! : <TABLE> Available? false
> >> >
> >> > And then I see that not all the regions are yet available.
> >> >
> >> >
> >> >   2. The master getting stuck unable to delete a WAL (I have seen this
> >> > before on this forum and a related JIRA on this one): We had worked
> >> around
> >> > by manually deleting a WAL. But during times when the master crashed
> >> during
> >> > table creation (with split key boundaries), the node that took over
> next
> >> as
> >> > the master (failover) started getting stuck for around 25% of the
> >> cluster. I
> >> > had to wipe out all the logs so that the master could start up right.
> >> >
> >> > But even then, the regionservers which had suffered the log issue
> >> couldn't
> >> > recognize the failed over master. (Is this something that has been
> >> observed
> >> > before?)
> >> >
> >> >
> >> >   3. createTableAsync with incorrect split keys: By mistake, I had
> some
> >> > duplicate keys in the split key byte array while calling the
> >> > createTableAsync function. The master crashed throwing a
> KeeperException
> >> > (thanks to the duplicate keys I guess?)
> >> >
> >> >
> >> > Also, can you let me know why createTableAsync blocks for some time
> and
> >> > throws a socket timeout exception when I try creating a table with a
> >> large
> >> > number of regions?
> >> >
> >> > Thank you
> >> > Vidhya
> >> >
> >>
> >>
> >
>
>

Re: A few issues we ran into the last couple of weeks.

Posted by Vidhyashankar Venkataraman <vi...@yahoo-inc.com>.

As in, the use of isTableAvailable there indicates, a bulk load should happen only if all the regions are available.

But that may not be the case since the function returns back true if even one region (regionCount.get()>0 check) is online.

V


On 5/17/11 7:14 PM, "Ted Yu" <yu...@gmail.com> wrote:

Did you mean that coming out of the following loop, the table might still be
unavailable if there were many regions ?
    while (!conn.isTableAvailable(table.getTableName()) &&
(ctr<TABLE_CREATE_MAX_RETRIES)) {

Cheers

On Tue, May 17, 2011 at 7:10 PM, Ted Yu <yu...@gmail.com> wrote:

> >> Also some of the source for which we had used this function may be
> broken (for example in LoadIncrementalHFiles.java)
> Can you be more specific ?
>
> Thanks
>
>
> On Tue, May 17, 2011 at 5:54 PM, Vidhyashankar Venkataraman <
> vidhyash@yahoo-inc.com> wrote:
>
>> >> For 1, the check in HCM.isTableAvailable() is:
>> >>      return available.get() && (regionCount.get() > 0);
>> >> This explains why some regions aren't available.
>>
>> The javadoc says the function returns true if all regions are available.
>> Clearly this statement is wrong going by what is there in the code. Also
>> some of the source for which we had used this function may be broken (for
>> example in LoadIncrementalHFiles.java).
>>
>> >> For 3, can you provide a unit test so that we can investigate further ?
>>
>> The problem is I am unable to get the master crash consistently. I can
>> send you the key split.
>>
>> Thank you
>> Vidhya
>>
>> On 5/17/11 4:59 PM, "Ted Yu" <yu...@gmail.com> wrote:
>>
>> For 1, the check in HCM.isTableAvailable() is:
>>      return available.get() && (regionCount.get() > 0);
>> This explains why some regions aren't available.
>>
>> For 3, can you provide a unit test so that we can investigate further ?
>>
>> Thanks
>>
>> On Tue, May 17, 2011 at 4:25 PM, Vidhyashankar Venkataraman <
>> vidhyash@yahoo-inc.com> wrote:
>>
>> > (Running Hbase 0.90.0 on 700+ nodes.)
>> >
>> > You may have seen many (or mostly all) of the following issues already:
>> >   1. HConnection.isTableAvailable: This doesn't seem to be working all
>> the
>> > time. In particular, I had this code after creating a table
>> asynchronously:
>> >
>> >   do {
>> >      LOG.info("Table " + tableName + "not yet available... Sleeping for"
>> +
>> > sleepTime + "milliseconds...");
>> >      Thread.sleep(sleepTime);
>> >    } while (!conn.isTableAvailable(table.getTableName()));
>> >    LOG.info("Table is available!! : "+tableName+" Available?
>> > "+conn.isTableAvailable(table.getTableName()));
>> >
>> > It comes out of the loop but then I see this:
>> > Table is available!! : <TABLE> Available? false
>> >
>> > And then I see that not all the regions are yet available.
>> >
>> >
>> >   2. The master getting stuck unable to delete a WAL (I have seen this
>> > before on this forum and a related JIRA on this one): We had worked
>> around
>> > by manually deleting a WAL. But during times when the master crashed
>> during
>> > table creation (with split key boundaries), the node that took over next
>> as
>> > the master (failover) started getting stuck for around 25% of the
>> cluster. I
>> > had to wipe out all the logs so that the master could start up right.
>> >
>> > But even then, the regionservers which had suffered the log issue
>> couldn't
>> > recognize the failed over master. (Is this something that has been
>> observed
>> > before?)
>> >
>> >
>> >   3. createTableAsync with incorrect split keys: By mistake, I had some
>> > duplicate keys in the split key byte array while calling the
>> > createTableAsync function. The master crashed throwing a KeeperException
>> > (thanks to the duplicate keys I guess?)
>> >
>> >
>> > Also, can you let me know why createTableAsync blocks for some time and
>> > throws a socket timeout exception when I try creating a table with a
>> large
>> > number of regions?
>> >
>> > Thank you
>> > Vidhya
>> >
>>
>>
>

Re: A few issues we ran into the last couple of weeks.

Posted by Ted Yu <yu...@gmail.com>.

Did you mean that coming out of the following loop, the table might still be
unavailable if there were many regions ?
    while (!conn.isTableAvailable(table.getTableName()) &&
(ctr<TABLE_CREATE_MAX_RETRIES)) {

Cheers

On Tue, May 17, 2011 at 7:10 PM, Ted Yu <yu...@gmail.com> wrote:

> >> Also some of the source for which we had used this function may be
> broken (for example in LoadIncrementalHFiles.java)
> Can you be more specific ?
>
> Thanks
>
>
> On Tue, May 17, 2011 at 5:54 PM, Vidhyashankar Venkataraman <
> vidhyash@yahoo-inc.com> wrote:
>
>> >> For 1, the check in HCM.isTableAvailable() is:
>> >>      return available.get() && (regionCount.get() > 0);
>> >> This explains why some regions aren't available.
>>
>> The javadoc says the function returns true if all regions are available.
>> Clearly this statement is wrong going by what is there in the code. Also
>> some of the source for which we had used this function may be broken (for
>> example in LoadIncrementalHFiles.java).
>>
>> >> For 3, can you provide a unit test so that we can investigate further ?
>>
>> The problem is I am unable to get the master crash consistently. I can
>> send you the key split.
>>
>> Thank you
>> Vidhya
>>
>> On 5/17/11 4:59 PM, "Ted Yu" <yu...@gmail.com> wrote:
>>
>> For 1, the check in HCM.isTableAvailable() is:
>>      return available.get() && (regionCount.get() > 0);
>> This explains why some regions aren't available.
>>
>> For 3, can you provide a unit test so that we can investigate further ?
>>
>> Thanks
>>
>> On Tue, May 17, 2011 at 4:25 PM, Vidhyashankar Venkataraman <
>> vidhyash@yahoo-inc.com> wrote:
>>
>> > (Running Hbase 0.90.0 on 700+ nodes.)
>> >
>> > You may have seen many (or mostly all) of the following issues already:
>> >   1. HConnection.isTableAvailable: This doesn't seem to be working all
>> the
>> > time. In particular, I had this code after creating a table
>> asynchronously:
>> >
>> >   do {
>> >      LOG.info("Table " + tableName + "not yet available... Sleeping for"
>> +
>> > sleepTime + "milliseconds...");
>> >      Thread.sleep(sleepTime);
>> >    } while (!conn.isTableAvailable(table.getTableName()));
>> >    LOG.info("Table is available!! : "+tableName+" Available?
>> > "+conn.isTableAvailable(table.getTableName()));
>> >
>> > It comes out of the loop but then I see this:
>> > Table is available!! : <TABLE> Available? false
>> >
>> > And then I see that not all the regions are yet available.
>> >
>> >
>> >   2. The master getting stuck unable to delete a WAL (I have seen this
>> > before on this forum and a related JIRA on this one): We had worked
>> around
>> > by manually deleting a WAL. But during times when the master crashed
>> during
>> > table creation (with split key boundaries), the node that took over next
>> as
>> > the master (failover) started getting stuck for around 25% of the
>> cluster. I
>> > had to wipe out all the logs so that the master could start up right.
>> >
>> > But even then, the regionservers which had suffered the log issue
>> couldn't
>> > recognize the failed over master. (Is this something that has been
>> observed
>> > before?)
>> >
>> >
>> >   3. createTableAsync with incorrect split keys: By mistake, I had some
>> > duplicate keys in the split key byte array while calling the
>> > createTableAsync function. The master crashed throwing a KeeperException
>> > (thanks to the duplicate keys I guess?)
>> >
>> >
>> > Also, can you let me know why createTableAsync blocks for some time and
>> > throws a socket timeout exception when I try creating a table with a
>> large
>> > number of regions?
>> >
>> > Thank you
>> > Vidhya
>> >
>>
>>
>

Re: A few issues we ran into the last couple of weeks.

Posted by Ted Yu <yu...@gmail.com>.

>> Also some of the source for which we had used this function may be broken
(for example in LoadIncrementalHFiles.java)
Can you be more specific ?

Thanks

On Tue, May 17, 2011 at 5:54 PM, Vidhyashankar Venkataraman <
vidhyash@yahoo-inc.com> wrote:

> >> For 1, the check in HCM.isTableAvailable() is:
> >>      return available.get() && (regionCount.get() > 0);
> >> This explains why some regions aren't available.
>
> The javadoc says the function returns true if all regions are available.
> Clearly this statement is wrong going by what is there in the code. Also
> some of the source for which we had used this function may be broken (for
> example in LoadIncrementalHFiles.java).
>
> >> For 3, can you provide a unit test so that we can investigate further ?
>
> The problem is I am unable to get the master crash consistently. I can send
> you the key split.
>
> Thank you
> Vidhya
>
> On 5/17/11 4:59 PM, "Ted Yu" <yu...@gmail.com> wrote:
>
> For 1, the check in HCM.isTableAvailable() is:
>      return available.get() && (regionCount.get() > 0);
> This explains why some regions aren't available.
>
> For 3, can you provide a unit test so that we can investigate further ?
>
> Thanks
>
> On Tue, May 17, 2011 at 4:25 PM, Vidhyashankar Venkataraman <
> vidhyash@yahoo-inc.com> wrote:
>
> > (Running Hbase 0.90.0 on 700+ nodes.)
> >
> > You may have seen many (or mostly all) of the following issues already:
> >   1. HConnection.isTableAvailable: This doesn't seem to be working all
> the
> > time. In particular, I had this code after creating a table
> asynchronously:
> >
> >   do {
> >      LOG.info("Table " + tableName + "not yet available... Sleeping for"
> +
> > sleepTime + "milliseconds...");
> >      Thread.sleep(sleepTime);
> >    } while (!conn.isTableAvailable(table.getTableName()));
> >    LOG.info("Table is available!! : "+tableName+" Available?
> > "+conn.isTableAvailable(table.getTableName()));
> >
> > It comes out of the loop but then I see this:
> > Table is available!! : <TABLE> Available? false
> >
> > And then I see that not all the regions are yet available.
> >
> >
> >   2. The master getting stuck unable to delete a WAL (I have seen this
> > before on this forum and a related JIRA on this one): We had worked
> around
> > by manually deleting a WAL. But during times when the master crashed
> during
> > table creation (with split key boundaries), the node that took over next
> as
> > the master (failover) started getting stuck for around 25% of the
> cluster. I
> > had to wipe out all the logs so that the master could start up right.
> >
> > But even then, the regionservers which had suffered the log issue
> couldn't
> > recognize the failed over master. (Is this something that has been
> observed
> > before?)
> >
> >
> >   3. createTableAsync with incorrect split keys: By mistake, I had some
> > duplicate keys in the split key byte array while calling the
> > createTableAsync function. The master crashed throwing a KeeperException
> > (thanks to the duplicate keys I guess?)
> >
> >
> > Also, can you let me know why createTableAsync blocks for some time and
> > throws a socket timeout exception when I try creating a table with a
> large
> > number of regions?
> >
> > Thank you
> > Vidhya
> >
>
>

Re: A few issues we ran into the last couple of weeks.

Posted by Vidhyashankar Venkataraman <vi...@yahoo-inc.com>.

>> For 1, the check in HCM.isTableAvailable() is:
>>      return available.get() && (regionCount.get() > 0);
>> This explains why some regions aren't available.

The javadoc says the function returns true if all regions are available. Clearly this statement is wrong going by what is there in the code. Also some of the source for which we had used this function may be broken (for example in LoadIncrementalHFiles.java).

>> For 3, can you provide a unit test so that we can investigate further ?

The problem is I am unable to get the master crash consistently. I can send you the key split.

Thank you
Vidhya

On 5/17/11 4:59 PM, "Ted Yu" <yu...@gmail.com> wrote:

For 1, the check in HCM.isTableAvailable() is:
      return available.get() && (regionCount.get() > 0);
This explains why some regions aren't available.

For 3, can you provide a unit test so that we can investigate further ?

Thanks

On Tue, May 17, 2011 at 4:25 PM, Vidhyashankar Venkataraman <
vidhyash@yahoo-inc.com> wrote:

> (Running Hbase 0.90.0 on 700+ nodes.)
>
> You may have seen many (or mostly all) of the following issues already:
>   1. HConnection.isTableAvailable: This doesn't seem to be working all the
> time. In particular, I had this code after creating a table asynchronously:
>
>   do {
>      LOG.info("Table " + tableName + "not yet available... Sleeping for" +
> sleepTime + "milliseconds...");
>      Thread.sleep(sleepTime);
>    } while (!conn.isTableAvailable(table.getTableName()));
>    LOG.info("Table is available!! : "+tableName+" Available?
> "+conn.isTableAvailable(table.getTableName()));
>
> It comes out of the loop but then I see this:
> Table is available!! : <TABLE> Available? false
>
> And then I see that not all the regions are yet available.
>
>
>   2. The master getting stuck unable to delete a WAL (I have seen this
> before on this forum and a related JIRA on this one): We had worked around
> by manually deleting a WAL. But during times when the master crashed during
> table creation (with split key boundaries), the node that took over next as
> the master (failover) started getting stuck for around 25% of the cluster. I
> had to wipe out all the logs so that the master could start up right.
>
> But even then, the regionservers which had suffered the log issue couldn't
> recognize the failed over master. (Is this something that has been observed
> before?)
>
>
>   3. createTableAsync with incorrect split keys: By mistake, I had some
> duplicate keys in the split key byte array while calling the
> createTableAsync function. The master crashed throwing a KeeperException
> (thanks to the duplicate keys I guess?)
>
>
> Also, can you let me know why createTableAsync blocks for some time and
> throws a socket timeout exception when I try creating a table with a large
> number of regions?
>
> Thank you
> Vidhya
>

Re: A few issues we ran into the last couple of weeks.

Posted by Ted Yu <yu...@gmail.com>.

For 1, the check in HCM.isTableAvailable() is:
      return available.get() && (regionCount.get() > 0);
This explains why some regions aren't available.

For 3, can you provide a unit test so that we can investigate further ?

Thanks

On Tue, May 17, 2011 at 4:25 PM, Vidhyashankar Venkataraman <
vidhyash@yahoo-inc.com> wrote:

> (Running Hbase 0.90.0 on 700+ nodes.)
>
> You may have seen many (or mostly all) of the following issues already:
>   1. HConnection.isTableAvailable: This doesn't seem to be working all the
> time. In particular, I had this code after creating a table asynchronously:
>
>   do {
>      LOG.info("Table " + tableName + "not yet available... Sleeping for" +
> sleepTime + "milliseconds...");
>      Thread.sleep(sleepTime);
>    } while (!conn.isTableAvailable(table.getTableName()));
>    LOG.info("Table is available!! : "+tableName+" Available?
> "+conn.isTableAvailable(table.getTableName()));
>
> It comes out of the loop but then I see this:
> Table is available!! : <TABLE> Available? false
>
> And then I see that not all the regions are yet available.
>
>
>   2. The master getting stuck unable to delete a WAL (I have seen this
> before on this forum and a related JIRA on this one): We had worked around
> by manually deleting a WAL. But during times when the master crashed during
> table creation (with split key boundaries), the node that took over next as
> the master (failover) started getting stuck for around 25% of the cluster. I
> had to wipe out all the logs so that the master could start up right.
>
> But even then, the regionservers which had suffered the log issue couldn't
> recognize the failed over master. (Is this something that has been observed
> before?)
>
>
>   3. createTableAsync with incorrect split keys: By mistake, I had some
> duplicate keys in the split key byte array while calling the
> createTableAsync function. The master crashed throwing a KeeperException
> (thanks to the duplicate keys I guess?)
>
>
> Also, can you let me know why createTableAsync blocks for some time and
> throws a socket timeout exception when I try creating a table with a large
> number of regions?
>
> Thank you
> Vidhya
>

Re: A few issues we ran into the last couple of weeks.

Posted by Stack <st...@duboce.net>.

On Tue, May 17, 2011 at 4:25 PM, Vidhyashankar Venkataraman
<vi...@yahoo-inc.com> wrote:
>   2. The master getting stuck unable to delete a WAL (I have seen this before on this forum and a related JIRA on this one): We had worked around by manually deleting a WAL. But during times when the master crashed during table creation (with split key boundaries), the node that took over next as the master (failover) started getting stuck for around 25% of the cluster. I had to wipe out all the logs so that the master could start up right.
>
> But even then, the regionservers which had suffered the log issue couldn't recognize the failed over master. (Is this something that has been observed before?)
>

Please file an issue w/ log samples Vidhya.


>   3. createTableAsync with incorrect split keys: By mistake, I had some duplicate keys in the split key byte array while calling the createTableAsync function. The master crashed throwing a KeeperException (thanks to the duplicate keys I guess?)
>

Do you have the exception Vidhya?  I'd think it'd be easy to add a
check of the keys passed before running the create.

> Also, can you let me know why createTableAsync blocks for some time and throws a socket timeout exception when I try creating a table with a large number of regions?
>

It shouldn't be blocking.  It should return.  Is this hbase-3744 fixed
in 0.90.3?

St.Ack