You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@trafodion.apache.org by Eric Owhadi <er...@esgyn.com> on 2015/07/22 00:49:20 UTC

optimization of deleteColumns?

In my Trafodion code learning journey, I stumbled on this code that I
believe could be optimized if I understand it right:
I read that it is scanning the specific column of a table in order to find
all the RowId to pass to deleteRow, along with column.
The problem is that we are using a scanner without using the
"KeyOnlyFilter" filter, that would have the nice property of removing the
value that we don't use anyway. Using KeyOnlyFilter would allow to bump up
the numReqRows too, since buffers are not wasted passing along values that
we don't need. The code would have to be changed to allow optional passing
of KeyOnlyFilter via the JNI layer. Am I reading right?
If yes, should I open a Jira on this?

Best regards,
Eric Owhadi

code included for conveniance:

Int32 ExpHbaseInterface_JNI::deleteColumns(
         HbaseStr &tblName,
         HbaseStr& column)
{
  Int32 retcode = 0;

  LIST(HbaseStr) columns(heap_);
  columns.insert(column);
  htc_ = client_->getHTableClient((NAHeap *)heap_, tblName.val, useTRex_,
hbs_);
  if (htc_ == NULL)
  {
    retCode_ = HBC_ERROR_GET_HTC_EXCEPTION;
    return HBASE_OPEN_ERROR;
  }
  Int64 transID = getTransactionIDFromContext();

  int numReqRows = 100;
  retcode = htc_->startScan(transID, "", "", columns, -1, FALSE,
numReqRows, FALSE,
       NULL, NULL, NULL, NULL);
  if (retcode != HTC_OK)
    return retcode;

  NABoolean done = FALSE;
  HbaseStr rowID;
  do {
     // Added the for loop to consider using deleteRows
     // to delete the column for all rows in the batch
     for (int rowNo = 0; rowNo < numReqRows; rowNo++)
     {
         retcode = htc_->nextRow();
         if (retcode != HTC_OK)
         {
            done = TRUE;
        break;
         }
         retcode = htc_->getRowID(rowID);
         if (retcode != HBASE_ACCESS_SUCCESS)
         {
            done = TRUE;
            break;
         }
         retcode = htc_->deleteRow(transID, rowID, &columns, -1);
         if (retcode != HTC_OK)
         {
            done = TRUE;
            break;
     }
     }
  } while (!(done || isParentQueryCanceled()));
  scanClose();
  if (retcode == HTC_DONE)
     return HBASE_ACCESS_SUCCESS;
  return HBASE_ACCESS_ERROR;
}

Re: optimization of deleteColumns?

Posted by Dave Birdsall <da...@esgyn.com>.

Hi,

And, yes, once you decide to move forward, please open a JIRA.

Dave

On Wed, Jul 22, 2015 at 9:55 PM, Selva Govindarajan <
selva.govindarajan@esgyn.com> wrote:

> I was referring to drop column scenario.  In Trafodion,  we delete all
> cells given a rowid except for the drop column scenario. Hence deleteRows
> is not sending the columns parameter.  The deleteRow had column parameter
> to take care of drop column scenario.  So, if we choose any of three
> options mentioned, we can remove the column parameter in deleteRow and
> introduce the needed new methods.
>
> Yes. At least for this case it is possible to create multiple threads to
> scan and delete in parallel. However HTable/RMInterface object is not
> thread safe, So we might need to create as many HTable/RMInterface objects
> as the number of threads and ensure it is transactional too.
>
>
>
> On Wed, Jul 22, 2015 at 4:09 PM, Eric Owhadi <er...@esgyn.com>
> wrote:
>
> > Not sure I understand, all this to improve the drop column scenario that
> we
> > have considered not important?
> > Or are you thinking of another delete scenario?
> >
> > If we want to optimize further by doing parallel plan, I don't think
> > optimizer is needed. By using the same mechanism that I am planning for
> > ParallelScan that will multi thread by region, load balancing on regions
> > servers, just altering the parallelScan to issue a delete Rows on each
> > thread to unsure that rows to delete on a multiple delete are from same
> > region.
> >
> > I agree that coproc would be the fastest method, but is it worth going
> that
> > route given the limited scenario?
> >
> > Eric
> >
> > -----Original Message-----
> > From: Selva Govindarajan [mailto:selva.govindarajan@esgyn.com]
> > Sent: Wednesday, July 22, 2015 5:17 PM
> > To: dev@trafodion.incubator.apache.org
> > Subject: RE: optimization of deleteColumns?
> >
> > Please give consideration to these options  to improve the performance
> for
> > this scenario
> >
> > 1) Move the implementation to Java to
> >         - Reduce JNI to java transitions
> >                 - Enable multiple deletes
> > 2) Use co-processor to delete
> > 3) Introduce a SQL command like DELETE <column_name> FROM <table_name>
> and
> > teach optimizer to do parallel plan and use rowset to delete the column
> > value.
> >
> > Selva
> >
> >
> >
> > -----Original Message-----
> > From: Eric Owhadi [mailto:eric.owhadi@esgyn.com]
> > Sent: Wednesday, July 22, 2015 3:28 AM
> > To: dev@trafodion.incubator.apache.org
> > Subject: Re: optimization of deleteColumns?
> >
> > Actually, looking at the code further, I believe that there is an even
> more
> > important possible improvement (I would guess at least 10 time more
> > important than the KeyOnlyFilter trick):
> > The code is using looping and triggering single delete instead of doing
> > batch deletes. The reason being that existing deleteRows does not take
> > columns as parameter. But we could alter it to add it. This would make
> > cleaner API and allow doing this optimization. I understand that these
> > ALTER
> > stuff are not frequently used, but I can imagine that the DBA doing
> schema
> > changes on a database with millions of records might not appreciate it if
> > it
> > takes too long to drop a column?
> > Should we improve? Should I create a JIRA, even if we decide not to work
> on
> > it to document the potential improvement?If you think it is worth it, I
> can
> > assign it to myself as a learning exercise to see if I can go to the full
> > process?
> > Eric
> >
> >
> > On Tue, Jul 21, 2015 at 11:39 PM, Anoop Sharma <an...@esgyn.com>
> > wrote:
> >
> > > Yes, Selva is right. This code is used to delete the specified column
> > > from all rows of a table if that column exists.
> > > This is done as part of 'alter table drop column' command.
> > >
> > > The specified column is removed from metadata and then from the table.
> > > For correctness of just the drop command, one can remove that column
> > > from metadata and not remove it from the actual hbase table.
> > > This would work since referencing that column in a query will return
> > > an error during compile time and one will never reach the point of
> > > selecting it from the table.
> > > However, if a column is later added with the same name, then incorrect
> > > results will be returned due to existing column values that were not
> > > deleted during the drop command.
> > >
> > > anoop
> > >
> > > -----Original Message-----
> > > From: Selva Govindarajan [mailto:selva.govindarajan@esgyn.com]
> > > Sent: Tuesday, July 21, 2015 8:47 PM
> > > To: dev@trafodion.incubator.apache.org
> > > Subject: RE: optimization of deleteColumns?
> > >
> > > Hi Eric,
> > >
> > > I believe this code is used to drop a column from all rows in the
> > > trafodion table to support ALTER TABLE .. DROP COLUMN command. Yes. It
> > > is possible to optimize the code further. Drop column is rarely used
> > > and hence I guess this part of the code didn’t come under radar for
> > > improvement.
> > >
> > >
> > > -----Original Message-----
> > > From: Eric Owhadi [mailto:eric.owhadi@esgyn.com]
> > > Sent: Tuesday, July 21, 2015 3:49 PM
> > > To: dev@trafodion.incubator.apache.org
> > > Subject: optimization of deleteColumns?
> > >
> > > In my Trafodion code learning journey, I stumbled on this code that I
> > > believe could be optimized if I understand it right:
> > > I read that it is scanning the specific column of a table in order to
> > > find all the RowId to pass to deleteRow, along with column.
> > > The problem is that we are using a scanner without using the
> > > "KeyOnlyFilter" filter, that would have the nice property of removing
> > > the value that we don't use anyway. Using KeyOnlyFilter would allow to
> > > bump up the numReqRows too, since buffers are not wasted passing along
> > > values that we don't need. The code would have to be changed to allow
> > > optional passing of KeyOnlyFilter via the JNI layer. Am I reading
> right?
> > > If yes, should I open a Jira on this?
> > >
> > > Best regards,
> > > Eric Owhadi
> > >
> > > code included for conveniance:
> > >
> > > Int32 ExpHbaseInterface_JNI::deleteColumns(
> > >          HbaseStr &tblName,
> > >          HbaseStr& column)
> > > {
> > >   Int32 retcode = 0;
> > >
> > >   LIST(HbaseStr) columns(heap_);
> > >   columns.insert(column);
> > >   htc_ = client_->getHTableClient((NAHeap *)heap_, tblName.val,
> > > useTRex_, hbs_);
> > >   if (htc_ == NULL)
> > >   {
> > >     retCode_ = HBC_ERROR_GET_HTC_EXCEPTION;
> > >     return HBASE_OPEN_ERROR;
> > >   }
> > >   Int64 transID = getTransactionIDFromContext();
> > >
> > >   int numReqRows = 100;
> > >   retcode = htc_->startScan(transID, "", "", columns, -1, FALSE,
> > > numReqRows, FALSE,
> > >        NULL, NULL, NULL, NULL);
> > >   if (retcode != HTC_OK)
> > >     return retcode;
> > >
> > >   NABoolean done = FALSE;
> > >   HbaseStr rowID;
> > >   do {
> > >      // Added the for loop to consider using deleteRows
> > >      // to delete the column for all rows in the batch
> > >      for (int rowNo = 0; rowNo < numReqRows; rowNo++)
> > >      {
> > >          retcode = htc_->nextRow();
> > >          if (retcode != HTC_OK)
> > >          {
> > >             done = TRUE;
> > >         break;
> > >          }
> > >          retcode = htc_->getRowID(rowID);
> > >          if (retcode != HBASE_ACCESS_SUCCESS)
> > >          {
> > >             done = TRUE;
> > >             break;
> > >          }
> > >          retcode = htc_->deleteRow(transID, rowID, &columns, -1);
> > >          if (retcode != HTC_OK)
> > >          {
> > >             done = TRUE;
> > >             break;
> > >      }
> > >      }
> > >   } while (!(done || isParentQueryCanceled()));
> > >   scanClose();
> > >   if (retcode == HTC_DONE)
> > >      return HBASE_ACCESS_SUCCESS;
> > >   return HBASE_ACCESS_ERROR;
> > > }
> > >
> > >
> > >
> >
>
>
>
> --
> - cheers
> selvag
>

Re: optimization of deleteColumns?

Posted by Selva Govindarajan <se...@esgyn.com>.

I was referring to drop column scenario.  In Trafodion,  we delete all
cells given a rowid except for the drop column scenario. Hence deleteRows
is not sending the columns parameter.  The deleteRow had column parameter
to take care of drop column scenario.  So, if we choose any of three
options mentioned, we can remove the column parameter in deleteRow and
introduce the needed new methods.

Yes. At least for this case it is possible to create multiple threads to
scan and delete in parallel. However HTable/RMInterface object is not
thread safe, So we might need to create as many HTable/RMInterface objects
as the number of threads and ensure it is transactional too.



On Wed, Jul 22, 2015 at 4:09 PM, Eric Owhadi <er...@esgyn.com> wrote:

> Not sure I understand, all this to improve the drop column scenario that we
> have considered not important?
> Or are you thinking of another delete scenario?
>
> If we want to optimize further by doing parallel plan, I don't think
> optimizer is needed. By using the same mechanism that I am planning for
> ParallelScan that will multi thread by region, load balancing on regions
> servers, just altering the parallelScan to issue a delete Rows on each
> thread to unsure that rows to delete on a multiple delete are from same
> region.
>
> I agree that coproc would be the fastest method, but is it worth going that
> route given the limited scenario?
>
> Eric
>
> -----Original Message-----
> From: Selva Govindarajan [mailto:selva.govindarajan@esgyn.com]
> Sent: Wednesday, July 22, 2015 5:17 PM
> To: dev@trafodion.incubator.apache.org
> Subject: RE: optimization of deleteColumns?
>
> Please give consideration to these options  to improve the performance for
> this scenario
>
> 1) Move the implementation to Java to
>         - Reduce JNI to java transitions
>                 - Enable multiple deletes
> 2) Use co-processor to delete
> 3) Introduce a SQL command like DELETE <column_name> FROM <table_name> and
> teach optimizer to do parallel plan and use rowset to delete the column
> value.
>
> Selva
>
>
>
> -----Original Message-----
> From: Eric Owhadi [mailto:eric.owhadi@esgyn.com]
> Sent: Wednesday, July 22, 2015 3:28 AM
> To: dev@trafodion.incubator.apache.org
> Subject: Re: optimization of deleteColumns?
>
> Actually, looking at the code further, I believe that there is an even more
> important possible improvement (I would guess at least 10 time more
> important than the KeyOnlyFilter trick):
> The code is using looping and triggering single delete instead of doing
> batch deletes. The reason being that existing deleteRows does not take
> columns as parameter. But we could alter it to add it. This would make
> cleaner API and allow doing this optimization. I understand that these
> ALTER
> stuff are not frequently used, but I can imagine that the DBA doing schema
> changes on a database with millions of records might not appreciate it if
> it
> takes too long to drop a column?
> Should we improve? Should I create a JIRA, even if we decide not to work on
> it to document the potential improvement?If you think it is worth it, I can
> assign it to myself as a learning exercise to see if I can go to the full
> process?
> Eric
>
>
> On Tue, Jul 21, 2015 at 11:39 PM, Anoop Sharma <an...@esgyn.com>
> wrote:
>
> > Yes, Selva is right. This code is used to delete the specified column
> > from all rows of a table if that column exists.
> > This is done as part of 'alter table drop column' command.
> >
> > The specified column is removed from metadata and then from the table.
> > For correctness of just the drop command, one can remove that column
> > from metadata and not remove it from the actual hbase table.
> > This would work since referencing that column in a query will return
> > an error during compile time and one will never reach the point of
> > selecting it from the table.
> > However, if a column is later added with the same name, then incorrect
> > results will be returned due to existing column values that were not
> > deleted during the drop command.
> >
> > anoop
> >
> > -----Original Message-----
> > From: Selva Govindarajan [mailto:selva.govindarajan@esgyn.com]
> > Sent: Tuesday, July 21, 2015 8:47 PM
> > To: dev@trafodion.incubator.apache.org
> > Subject: RE: optimization of deleteColumns?
> >
> > Hi Eric,
> >
> > I believe this code is used to drop a column from all rows in the
> > trafodion table to support ALTER TABLE .. DROP COLUMN command. Yes. It
> > is possible to optimize the code further. Drop column is rarely used
> > and hence I guess this part of the code didn’t come under radar for
> > improvement.
> >
> >
> > -----Original Message-----
> > From: Eric Owhadi [mailto:eric.owhadi@esgyn.com]
> > Sent: Tuesday, July 21, 2015 3:49 PM
> > To: dev@trafodion.incubator.apache.org
> > Subject: optimization of deleteColumns?
> >
> > In my Trafodion code learning journey, I stumbled on this code that I
> > believe could be optimized if I understand it right:
> > I read that it is scanning the specific column of a table in order to
> > find all the RowId to pass to deleteRow, along with column.
> > The problem is that we are using a scanner without using the
> > "KeyOnlyFilter" filter, that would have the nice property of removing
> > the value that we don't use anyway. Using KeyOnlyFilter would allow to
> > bump up the numReqRows too, since buffers are not wasted passing along
> > values that we don't need. The code would have to be changed to allow
> > optional passing of KeyOnlyFilter via the JNI layer. Am I reading right?
> > If yes, should I open a Jira on this?
> >
> > Best regards,
> > Eric Owhadi
> >
> > code included for conveniance:
> >
> > Int32 ExpHbaseInterface_JNI::deleteColumns(
> >          HbaseStr &tblName,
> >          HbaseStr& column)
> > {
> >   Int32 retcode = 0;
> >
> >   LIST(HbaseStr) columns(heap_);
> >   columns.insert(column);
> >   htc_ = client_->getHTableClient((NAHeap *)heap_, tblName.val,
> > useTRex_, hbs_);
> >   if (htc_ == NULL)
> >   {
> >     retCode_ = HBC_ERROR_GET_HTC_EXCEPTION;
> >     return HBASE_OPEN_ERROR;
> >   }
> >   Int64 transID = getTransactionIDFromContext();
> >
> >   int numReqRows = 100;
> >   retcode = htc_->startScan(transID, "", "", columns, -1, FALSE,
> > numReqRows, FALSE,
> >        NULL, NULL, NULL, NULL);
> >   if (retcode != HTC_OK)
> >     return retcode;
> >
> >   NABoolean done = FALSE;
> >   HbaseStr rowID;
> >   do {
> >      // Added the for loop to consider using deleteRows
> >      // to delete the column for all rows in the batch
> >      for (int rowNo = 0; rowNo < numReqRows; rowNo++)
> >      {
> >          retcode = htc_->nextRow();
> >          if (retcode != HTC_OK)
> >          {
> >             done = TRUE;
> >         break;
> >          }
> >          retcode = htc_->getRowID(rowID);
> >          if (retcode != HBASE_ACCESS_SUCCESS)
> >          {
> >             done = TRUE;
> >             break;
> >          }
> >          retcode = htc_->deleteRow(transID, rowID, &columns, -1);
> >          if (retcode != HTC_OK)
> >          {
> >             done = TRUE;
> >             break;
> >      }
> >      }
> >   } while (!(done || isParentQueryCanceled()));
> >   scanClose();
> >   if (retcode == HTC_DONE)
> >      return HBASE_ACCESS_SUCCESS;
> >   return HBASE_ACCESS_ERROR;
> > }
> >
> >
> >
>



-- 
- cheers
selvag

RE: optimization of deleteColumns?

Posted by Eric Owhadi <er...@esgyn.com>.

Not sure I understand, all this to improve the drop column scenario that we
have considered not important?
Or are you thinking of another delete scenario?

If we want to optimize further by doing parallel plan, I don't think
optimizer is needed. By using the same mechanism that I am planning for
ParallelScan that will multi thread by region, load balancing on regions
servers, just altering the parallelScan to issue a delete Rows on each
thread to unsure that rows to delete on a multiple delete are from same
region.

I agree that coproc would be the fastest method, but is it worth going that
route given the limited scenario?

Eric

-----Original Message-----
From: Selva Govindarajan [mailto:selva.govindarajan@esgyn.com]
Sent: Wednesday, July 22, 2015 5:17 PM
To: dev@trafodion.incubator.apache.org
Subject: RE: optimization of deleteColumns?

Please give consideration to these options  to improve the performance for
this scenario

1) Move the implementation to Java to
	- Reduce JNI to java transitions
                - Enable multiple deletes
2) Use co-processor to delete
3) Introduce a SQL command like DELETE <column_name> FROM <table_name> and
teach optimizer to do parallel plan and use rowset to delete the column
value.

Selva



-----Original Message-----
From: Eric Owhadi [mailto:eric.owhadi@esgyn.com]
Sent: Wednesday, July 22, 2015 3:28 AM
To: dev@trafodion.incubator.apache.org
Subject: Re: optimization of deleteColumns?

Actually, looking at the code further, I believe that there is an even more
important possible improvement (I would guess at least 10 time more
important than the KeyOnlyFilter trick):
The code is using looping and triggering single delete instead of doing
batch deletes. The reason being that existing deleteRows does not take
columns as parameter. But we could alter it to add it. This would make
cleaner API and allow doing this optimization. I understand that these ALTER
stuff are not frequently used, but I can imagine that the DBA doing schema
changes on a database with millions of records might not appreciate it if it
takes too long to drop a column?
Should we improve? Should I create a JIRA, even if we decide not to work on
it to document the potential improvement?If you think it is worth it, I can
assign it to myself as a learning exercise to see if I can go to the full
process?
Eric


On Tue, Jul 21, 2015 at 11:39 PM, Anoop Sharma <an...@esgyn.com>
wrote:

> Yes, Selva is right. This code is used to delete the specified column
> from all rows of a table if that column exists.
> This is done as part of 'alter table drop column' command.
>
> The specified column is removed from metadata and then from the table.
> For correctness of just the drop command, one can remove that column
> from metadata and not remove it from the actual hbase table.
> This would work since referencing that column in a query will return
> an error during compile time and one will never reach the point of
> selecting it from the table.
> However, if a column is later added with the same name, then incorrect
> results will be returned due to existing column values that were not
> deleted during the drop command.
>
> anoop
>
> -----Original Message-----
> From: Selva Govindarajan [mailto:selva.govindarajan@esgyn.com]
> Sent: Tuesday, July 21, 2015 8:47 PM
> To: dev@trafodion.incubator.apache.org
> Subject: RE: optimization of deleteColumns?
>
> Hi Eric,
>
> I believe this code is used to drop a column from all rows in the
> trafodion table to support ALTER TABLE .. DROP COLUMN command. Yes. It
> is possible to optimize the code further. Drop column is rarely used
> and hence I guess this part of the code didn’t come under radar for
> improvement.
>
>
> -----Original Message-----
> From: Eric Owhadi [mailto:eric.owhadi@esgyn.com]
> Sent: Tuesday, July 21, 2015 3:49 PM
> To: dev@trafodion.incubator.apache.org
> Subject: optimization of deleteColumns?
>
> In my Trafodion code learning journey, I stumbled on this code that I
> believe could be optimized if I understand it right:
> I read that it is scanning the specific column of a table in order to
> find all the RowId to pass to deleteRow, along with column.
> The problem is that we are using a scanner without using the
> "KeyOnlyFilter" filter, that would have the nice property of removing
> the value that we don't use anyway. Using KeyOnlyFilter would allow to
> bump up the numReqRows too, since buffers are not wasted passing along
> values that we don't need. The code would have to be changed to allow
> optional passing of KeyOnlyFilter via the JNI layer. Am I reading right?
> If yes, should I open a Jira on this?
>
> Best regards,
> Eric Owhadi
>
> code included for conveniance:
>
> Int32 ExpHbaseInterface_JNI::deleteColumns(
>          HbaseStr &tblName,
>          HbaseStr& column)
> {
>   Int32 retcode = 0;
>
>   LIST(HbaseStr) columns(heap_);
>   columns.insert(column);
>   htc_ = client_->getHTableClient((NAHeap *)heap_, tblName.val,
> useTRex_, hbs_);
>   if (htc_ == NULL)
>   {
>     retCode_ = HBC_ERROR_GET_HTC_EXCEPTION;
>     return HBASE_OPEN_ERROR;
>   }
>   Int64 transID = getTransactionIDFromContext();
>
>   int numReqRows = 100;
>   retcode = htc_->startScan(transID, "", "", columns, -1, FALSE,
> numReqRows, FALSE,
>        NULL, NULL, NULL, NULL);
>   if (retcode != HTC_OK)
>     return retcode;
>
>   NABoolean done = FALSE;
>   HbaseStr rowID;
>   do {
>      // Added the for loop to consider using deleteRows
>      // to delete the column for all rows in the batch
>      for (int rowNo = 0; rowNo < numReqRows; rowNo++)
>      {
>          retcode = htc_->nextRow();
>          if (retcode != HTC_OK)
>          {
>             done = TRUE;
>         break;
>          }
>          retcode = htc_->getRowID(rowID);
>          if (retcode != HBASE_ACCESS_SUCCESS)
>          {
>             done = TRUE;
>             break;
>          }
>          retcode = htc_->deleteRow(transID, rowID, &columns, -1);
>          if (retcode != HTC_OK)
>          {
>             done = TRUE;
>             break;
>      }
>      }
>   } while (!(done || isParentQueryCanceled()));
>   scanClose();
>   if (retcode == HTC_DONE)
>      return HBASE_ACCESS_SUCCESS;
>   return HBASE_ACCESS_ERROR;
> }
>
>
>

RE: optimization of deleteColumns?

Posted by Selva Govindarajan <se...@esgyn.com>.

Please give consideration to these options  to improve the performance for this scenario

1) Move the implementation to Java to 
	- Reduce JNI to java transitions 
                - Enable multiple deletes
2) Use co-processor to delete 
3) Introduce a SQL command like DELETE <column_name> FROM <table_name> and teach optimizer to do parallel plan and use rowset to delete the column value.

Selva



-----Original Message-----
From: Eric Owhadi [mailto:eric.owhadi@esgyn.com] 
Sent: Wednesday, July 22, 2015 3:28 AM
To: dev@trafodion.incubator.apache.org
Subject: Re: optimization of deleteColumns?

Actually, looking at the code further, I believe that there is an even more important possible improvement (I would guess at least 10 time more important than the KeyOnlyFilter trick):
The code is using looping and triggering single delete instead of doing batch deletes. The reason being that existing deleteRows does not take columns as parameter. But we could alter it to add it. This would make cleaner API and allow doing this optimization. I understand that these ALTER stuff are not frequently used, but I can imagine that the DBA doing schema changes on a database with millions of records might not appreciate it if it takes too long to drop a column?
Should we improve? Should I create a JIRA, even if we decide not to work on it to document the potential improvement?If you think it is worth it, I can assign it to myself as a learning exercise to see if I can go to the full process?
Eric


On Tue, Jul 21, 2015 at 11:39 PM, Anoop Sharma <an...@esgyn.com>
wrote:

> Yes, Selva is right. This code is used to delete the specified column 
> from all rows of a table if that column exists.
> This is done as part of 'alter table drop column' command.
>
> The specified column is removed from metadata and then from the table. 
> For correctness of just the drop command, one can remove that column 
> from metadata and not remove it from the actual hbase table.
> This would work since referencing that column in a query will return 
> an error during compile time and one will never reach the point of 
> selecting it from the table.
> However, if a column is later added with the same name, then incorrect 
> results will be returned due to existing column values that were not 
> deleted during the drop command.
>
> anoop
>
> -----Original Message-----
> From: Selva Govindarajan [mailto:selva.govindarajan@esgyn.com]
> Sent: Tuesday, July 21, 2015 8:47 PM
> To: dev@trafodion.incubator.apache.org
> Subject: RE: optimization of deleteColumns?
>
> Hi Eric,
>
> I believe this code is used to drop a column from all rows in the 
> trafodion table to support ALTER TABLE .. DROP COLUMN command. Yes. It 
> is possible to optimize the code further. Drop column is rarely used 
> and hence I guess this part of the code didn’t come under radar for improvement.
>
>
> -----Original Message-----
> From: Eric Owhadi [mailto:eric.owhadi@esgyn.com]
> Sent: Tuesday, July 21, 2015 3:49 PM
> To: dev@trafodion.incubator.apache.org
> Subject: optimization of deleteColumns?
>
> In my Trafodion code learning journey, I stumbled on this code that I 
> believe could be optimized if I understand it right:
> I read that it is scanning the specific column of a table in order to 
> find all the RowId to pass to deleteRow, along with column.
> The problem is that we are using a scanner without using the 
> "KeyOnlyFilter" filter, that would have the nice property of removing 
> the value that we don't use anyway. Using KeyOnlyFilter would allow to 
> bump up the numReqRows too, since buffers are not wasted passing along 
> values that we don't need. The code would have to be changed to allow 
> optional passing of KeyOnlyFilter via the JNI layer. Am I reading right?
> If yes, should I open a Jira on this?
>
> Best regards,
> Eric Owhadi
>
> code included for conveniance:
>
> Int32 ExpHbaseInterface_JNI::deleteColumns(
>          HbaseStr &tblName,
>          HbaseStr& column)
> {
>   Int32 retcode = 0;
>
>   LIST(HbaseStr) columns(heap_);
>   columns.insert(column);
>   htc_ = client_->getHTableClient((NAHeap *)heap_, tblName.val, 
> useTRex_, hbs_);
>   if (htc_ == NULL)
>   {
>     retCode_ = HBC_ERROR_GET_HTC_EXCEPTION;
>     return HBASE_OPEN_ERROR;
>   }
>   Int64 transID = getTransactionIDFromContext();
>
>   int numReqRows = 100;
>   retcode = htc_->startScan(transID, "", "", columns, -1, FALSE, 
> numReqRows, FALSE,
>        NULL, NULL, NULL, NULL);
>   if (retcode != HTC_OK)
>     return retcode;
>
>   NABoolean done = FALSE;
>   HbaseStr rowID;
>   do {
>      // Added the for loop to consider using deleteRows
>      // to delete the column for all rows in the batch
>      for (int rowNo = 0; rowNo < numReqRows; rowNo++)
>      {
>          retcode = htc_->nextRow();
>          if (retcode != HTC_OK)
>          {
>             done = TRUE;
>         break;
>          }
>          retcode = htc_->getRowID(rowID);
>          if (retcode != HBASE_ACCESS_SUCCESS)
>          {
>             done = TRUE;
>             break;
>          }
>          retcode = htc_->deleteRow(transID, rowID, &columns, -1);
>          if (retcode != HTC_OK)
>          {
>             done = TRUE;
>             break;
>      }
>      }
>   } while (!(done || isParentQueryCanceled()));
>   scanClose();
>   if (retcode == HTC_DONE)
>      return HBASE_ACCESS_SUCCESS;
>   return HBASE_ACCESS_ERROR;
> }
>
>
>

Re: optimization of deleteColumns?

Posted by Eric Owhadi <er...@esgyn.com>.

Actually, looking at the code further, I believe that there is an even more
important possible improvement (I would guess at least 10 time more
important than the KeyOnlyFilter trick):
The code is using looping and triggering single delete instead of doing
batch deletes. The reason being that existing deleteRows does not take
columns as parameter. But we could alter it to add it. This would make
cleaner API and allow doing this optimization. I understand that these
ALTER stuff are not frequently used, but I can imagine that the DBA doing
schema changes on a database with millions of records might not appreciate
it if it takes too long to drop a column?
Should we improve? Should I create a JIRA, even if we decide not to work on
it to document the potential improvement?If you think it is worth it, I can
assign it to myself as a learning exercise to see if I can go to the full
process?
Eric


On Tue, Jul 21, 2015 at 11:39 PM, Anoop Sharma <an...@esgyn.com>
wrote:

> Yes, Selva is right. This code is used to delete the specified
> column from all rows of a table if that column exists.
> This is done as part of 'alter table drop column' command.
>
> The specified column is removed from metadata and then from the
> table. For correctness of just the drop command, one can remove that column
> from metadata and not remove it from the actual hbase table.
> This would work since referencing that column in a query will return an
> error
> during compile time and one will never reach the point of selecting it
> from the table.
> However, if a column is later added with the same name, then incorrect
> results will
> be returned due to existing column values that were not deleted during the
> drop command.
>
> anoop
>
> -----Original Message-----
> From: Selva Govindarajan [mailto:selva.govindarajan@esgyn.com]
> Sent: Tuesday, July 21, 2015 8:47 PM
> To: dev@trafodion.incubator.apache.org
> Subject: RE: optimization of deleteColumns?
>
> Hi Eric,
>
> I believe this code is used to drop a column from all rows in the
> trafodion table to support ALTER TABLE .. DROP COLUMN command. Yes. It is
> possible to optimize the code further. Drop column is rarely used and hence
> I guess this part of the code didn’t come under radar for improvement.
>
>
> -----Original Message-----
> From: Eric Owhadi [mailto:eric.owhadi@esgyn.com]
> Sent: Tuesday, July 21, 2015 3:49 PM
> To: dev@trafodion.incubator.apache.org
> Subject: optimization of deleteColumns?
>
> In my Trafodion code learning journey, I stumbled on this code that I
> believe could be optimized if I understand it right:
> I read that it is scanning the specific column of a table in order to find
> all the RowId to pass to deleteRow, along with column.
> The problem is that we are using a scanner without using the
> "KeyOnlyFilter" filter, that would have the nice property of removing the
> value that we don't use anyway. Using KeyOnlyFilter would allow to bump up
> the numReqRows too, since buffers are not wasted passing along values that
> we don't need. The code would have to be changed to allow optional passing
> of KeyOnlyFilter via the JNI layer. Am I reading right?
> If yes, should I open a Jira on this?
>
> Best regards,
> Eric Owhadi
>
> code included for conveniance:
>
> Int32 ExpHbaseInterface_JNI::deleteColumns(
>          HbaseStr &tblName,
>          HbaseStr& column)
> {
>   Int32 retcode = 0;
>
>   LIST(HbaseStr) columns(heap_);
>   columns.insert(column);
>   htc_ = client_->getHTableClient((NAHeap *)heap_, tblName.val, useTRex_,
> hbs_);
>   if (htc_ == NULL)
>   {
>     retCode_ = HBC_ERROR_GET_HTC_EXCEPTION;
>     return HBASE_OPEN_ERROR;
>   }
>   Int64 transID = getTransactionIDFromContext();
>
>   int numReqRows = 100;
>   retcode = htc_->startScan(transID, "", "", columns, -1, FALSE,
> numReqRows, FALSE,
>        NULL, NULL, NULL, NULL);
>   if (retcode != HTC_OK)
>     return retcode;
>
>   NABoolean done = FALSE;
>   HbaseStr rowID;
>   do {
>      // Added the for loop to consider using deleteRows
>      // to delete the column for all rows in the batch
>      for (int rowNo = 0; rowNo < numReqRows; rowNo++)
>      {
>          retcode = htc_->nextRow();
>          if (retcode != HTC_OK)
>          {
>             done = TRUE;
>         break;
>          }
>          retcode = htc_->getRowID(rowID);
>          if (retcode != HBASE_ACCESS_SUCCESS)
>          {
>             done = TRUE;
>             break;
>          }
>          retcode = htc_->deleteRow(transID, rowID, &columns, -1);
>          if (retcode != HTC_OK)
>          {
>             done = TRUE;
>             break;
>      }
>      }
>   } while (!(done || isParentQueryCanceled()));
>   scanClose();
>   if (retcode == HTC_DONE)
>      return HBASE_ACCESS_SUCCESS;
>   return HBASE_ACCESS_ERROR;
> }
>
>
>

RE: optimization of deleteColumns?

Posted by Anoop Sharma <an...@esgyn.com>.

Yes, Selva is right. This code is used to delete the specified
column from all rows of a table if that column exists.
This is done as part of 'alter table drop column' command.

The specified column is removed from metadata and then from the
table. For correctness of just the drop command, one can remove that column
from metadata and not remove it from the actual hbase table.
This would work since referencing that column in a query will return an error
during compile time and one will never reach the point of selecting it from the table.
However, if a column is later added with the same name, then incorrect results will
be returned due to existing column values that were not deleted during the drop command.

anoop

-----Original Message-----
From: Selva Govindarajan [mailto:selva.govindarajan@esgyn.com] 
Sent: Tuesday, July 21, 2015 8:47 PM
To: dev@trafodion.incubator.apache.org
Subject: RE: optimization of deleteColumns?

Hi Eric,

I believe this code is used to drop a column from all rows in the trafodion table to support ALTER TABLE .. DROP COLUMN command. Yes. It is possible to optimize the code further. Drop column is rarely used and hence I guess this part of the code didn’t come under radar for improvement.


-----Original Message-----
From: Eric Owhadi [mailto:eric.owhadi@esgyn.com] 
Sent: Tuesday, July 21, 2015 3:49 PM
To: dev@trafodion.incubator.apache.org
Subject: optimization of deleteColumns?

In my Trafodion code learning journey, I stumbled on this code that I believe could be optimized if I understand it right:
I read that it is scanning the specific column of a table in order to find all the RowId to pass to deleteRow, along with column.
The problem is that we are using a scanner without using the "KeyOnlyFilter" filter, that would have the nice property of removing the value that we don't use anyway. Using KeyOnlyFilter would allow to bump up the numReqRows too, since buffers are not wasted passing along values that we don't need. The code would have to be changed to allow optional passing of KeyOnlyFilter via the JNI layer. Am I reading right?
If yes, should I open a Jira on this?

Best regards,
Eric Owhadi

code included for conveniance:

Int32 ExpHbaseInterface_JNI::deleteColumns(
         HbaseStr &tblName,
         HbaseStr& column)
{
  Int32 retcode = 0;

  LIST(HbaseStr) columns(heap_);
  columns.insert(column);
  htc_ = client_->getHTableClient((NAHeap *)heap_, tblName.val, useTRex_, hbs_);
  if (htc_ == NULL)
  {
    retCode_ = HBC_ERROR_GET_HTC_EXCEPTION;
    return HBASE_OPEN_ERROR;
  }
  Int64 transID = getTransactionIDFromContext();

  int numReqRows = 100;
  retcode = htc_->startScan(transID, "", "", columns, -1, FALSE, numReqRows, FALSE,
       NULL, NULL, NULL, NULL);
  if (retcode != HTC_OK)
    return retcode;

  NABoolean done = FALSE;
  HbaseStr rowID;
  do {
     // Added the for loop to consider using deleteRows
     // to delete the column for all rows in the batch
     for (int rowNo = 0; rowNo < numReqRows; rowNo++)
     {
         retcode = htc_->nextRow();
         if (retcode != HTC_OK)
         {
            done = TRUE;
        break;
         }
         retcode = htc_->getRowID(rowID);
         if (retcode != HBASE_ACCESS_SUCCESS)
         {
            done = TRUE;
            break;
         }
         retcode = htc_->deleteRow(transID, rowID, &columns, -1);
         if (retcode != HTC_OK)
         {
            done = TRUE;
            break;
     }
     }
  } while (!(done || isParentQueryCanceled()));
  scanClose();
  if (retcode == HTC_DONE)
     return HBASE_ACCESS_SUCCESS;
  return HBASE_ACCESS_ERROR;
}

RE: optimization of deleteColumns?

Posted by Selva Govindarajan <se...@esgyn.com>.

Hi Eric,

I believe this code is used to drop a column from all rows in the trafodion table to support ALTER TABLE .. DROP COLUMN command. Yes. It is possible to optimize the code further. Drop column is rarely used and hence I guess this part of the code didn’t come under radar for improvement.

-----Original Message-----
From: Eric Owhadi [mailto:eric.owhadi@esgyn.com] 
Sent: Tuesday, July 21, 2015 3:49 PM
To: dev@trafodion.incubator.apache.org
Subject: optimization of deleteColumns?

In my Trafodion code learning journey, I stumbled on this code that I believe could be optimized if I understand it right:
I read that it is scanning the specific column of a table in order to find all the RowId to pass to deleteRow, along with column.
The problem is that we are using a scanner without using the "KeyOnlyFilter" filter, that would have the nice property of removing the value that we don't use anyway. Using KeyOnlyFilter would allow to bump up the numReqRows too, since buffers are not wasted passing along values that we don't need. The code would have to be changed to allow optional passing of KeyOnlyFilter via the JNI layer. Am I reading right?
If yes, should I open a Jira on this?

Best regards,
Eric Owhadi

code included for conveniance:

Int32 ExpHbaseInterface_JNI::deleteColumns(
         HbaseStr &tblName,
         HbaseStr& column)
{
  Int32 retcode = 0;

  LIST(HbaseStr) columns(heap_);
  columns.insert(column);
  htc_ = client_->getHTableClient((NAHeap *)heap_, tblName.val, useTRex_, hbs_);
  if (htc_ == NULL)
  {
    retCode_ = HBC_ERROR_GET_HTC_EXCEPTION;
    return HBASE_OPEN_ERROR;
  }
  Int64 transID = getTransactionIDFromContext();

  int numReqRows = 100;
  retcode = htc_->startScan(transID, "", "", columns, -1, FALSE, numReqRows, FALSE,
       NULL, NULL, NULL, NULL);
  if (retcode != HTC_OK)
    return retcode;

  NABoolean done = FALSE;
  HbaseStr rowID;
  do {
     // Added the for loop to consider using deleteRows
     // to delete the column for all rows in the batch
     for (int rowNo = 0; rowNo < numReqRows; rowNo++)
     {
         retcode = htc_->nextRow();
         if (retcode != HTC_OK)
         {
            done = TRUE;
        break;
         }
         retcode = htc_->getRowID(rowID);
         if (retcode != HBASE_ACCESS_SUCCESS)
         {
            done = TRUE;
            break;
         }
         retcode = htc_->deleteRow(transID, rowID, &columns, -1);
         if (retcode != HTC_OK)
         {
            done = TRUE;
            break;
     }
     }
  } while (!(done || isParentQueryCanceled()));
  scanClose();
  if (retcode == HTC_DONE)
     return HBASE_ACCESS_SUCCESS;
  return HBASE_ACCESS_ERROR;
}