You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@accumulo.apache.org by BlackJack76 <ju...@gmail.com> on 2014/05/12 15:23:41 UTC

Delete All Data In Table

Besides using the tableOperations to deleteRows or delete the table entirely,
what is the fastest way to delete all data in a table?  I am currently using
a BatchDeleter but it is extremely slow when I have a large amount of data. 
Any better options?

I don't want to use the tableOperations because both the deleteRows and
delete blow away the splits.  I would like to keep the splits in place.

Appreciate your thoughts!




--
View this message in context: http://apache-accumulo.1065345.n5.nabble.com/Delete-All-Data-In-Table-tp9748.html
Sent from the Users mailing list archive at Nabble.com.

RE: Delete All Data In Table

Posted by BlackJack76 <ju...@gmail.com>.
Bob,

The results were a great success.  The speed was much faster than using the
BatchDeleter.  Thanks again to Josh for recommending this approach!



--
View this message in context: http://apache-accumulo.1065345.n5.nabble.com/Delete-All-Data-In-Table-tp9748p9807.html
Sent from the Users mailing list archive at Nabble.com.

RE: Delete All Data In Table

Posted by Bo...@l-3com.com.
When/if you do the speed comparison would you mind sharing the results?

-----Original Message-----
From: BlackJack76 [mailto:justin.loy@gmail.com] 
Sent: Monday, May 12, 2014 5:22 PM
To: user@accumulo.apache.org
Subject: Re: Delete All Data In Table

Josh,

Thanks for the tip!  I was able to delete the data and keep the splits using the following code:

    String dataTableName = "dataTable";
    String iteratorName = "devNull";
    
    if(conn.tableOperations().exists(dataTableName)) 
    {
      IteratorSetting setting = new IteratorSetting(25, iteratorName, DevNull.class);

      EnumSet<IteratorScope> scopes = EnumSet.noneOf(IteratorScope.class);
      scopes.add(IteratorScope.minc);
      scopes.add(IteratorScope.majc);
      scopes.add(IteratorScope.scan);

      conn.tableOperations().attachIterator(dataTableName, setting, scopes);
      
      conn.tableOperations().compact(dataTableName, null, null, true, true);

      conn.tableOperations().removeIterator(dataTableName, iteratorName, scopes); 
    }

I haven't tested out the speed yet to see how it compares to BatchDeleter but it definitely works.  Thanks for your help!





--
View this message in context: http://apache-accumulo.1065345.n5.nabble.com/Delete-All-Data-In-Table-tp9748p9774.html
Sent from the Users mailing list archive at Nabble.com.

Re: Delete All Data In Table

Posted by BlackJack76 <ju...@gmail.com>.
Josh,

Thanks for the tip!  I was able to delete the data and keep the splits using
the following code:

    String dataTableName = "dataTable";
    String iteratorName = "devNull";
    
    if(conn.tableOperations().exists(dataTableName)) 
    {
      IteratorSetting setting = new IteratorSetting(25, iteratorName,
DevNull.class);

      EnumSet<IteratorScope> scopes = EnumSet.noneOf(IteratorScope.class);
      scopes.add(IteratorScope.minc);
      scopes.add(IteratorScope.majc);
      scopes.add(IteratorScope.scan);

      conn.tableOperations().attachIterator(dataTableName, setting, scopes);
      
      conn.tableOperations().compact(dataTableName, null, null, true, true);

      conn.tableOperations().removeIterator(dataTableName, iteratorName,
scopes); 
    }

I haven't tested out the speed yet to see how it compares to BatchDeleter
but it definitely works.  Thanks for your help!





--
View this message in context: http://apache-accumulo.1065345.n5.nabble.com/Delete-All-Data-In-Table-tp9748p9774.html
Sent from the Users mailing list archive at Nabble.com.

Re: Delete All Data In Table

Posted by Michael Wall <mj...@gmail.com>.
If you are simply trying to clean up data between unit tests, you should
look at using the MiniAccumuloCluster.  I heard there may be a blog article
coming out soon for testing stuff.  Until then, take a look at Instamo
Archetype for an example usage at
https://git-wip-us.apache.org/repos/asf?p=accumulo-instamo-archetype.git;a=tree
.

Mike


On Mon, May 12, 2014 at 10:45 AM, BlackJack76 <ju...@gmail.com> wrote:

> David,
>
> Thanks for your response.  I have a variety of unit tests.  For each unit
> test, I insert and search for certain data.  I don't want data from the
> previous unit test to be present in the table.
>
> The main issue is that I can't delete the table nor can I create a new one
> because of my tablet balancer.  If I do, the splits won't be applied
> properly.  The table needs to exist and be split properly when Accumulo
> starts up.
>
> Thanks again!
>
>
>
>
>
> --
> View this message in context:
> http://apache-accumulo.1065345.n5.nabble.com/Delete-All-Data-In-Table-tp9748p9764.html
> Sent from the Users mailing list archive at Nabble.com.
>

Re: Delete All Data In Table

Posted by BlackJack76 <ju...@gmail.com>.
David,

Thanks for your response.  I have a variety of unit tests.  For each unit
test, I insert and search for certain data.  I don't want data from the
previous unit test to be present in the table.

The main issue is that I can't delete the table nor can I create a new one
because of my tablet balancer.  If I do, the splits won't be applied
properly.  The table needs to exist and be split properly when Accumulo
starts up.

Thanks again!





--
View this message in context: http://apache-accumulo.1065345.n5.nabble.com/Delete-All-Data-In-Table-tp9748p9764.html
Sent from the Users mailing list archive at Nabble.com.

Re: Delete All Data In Table

Posted by David Medinets <da...@gmail.com>.
Can you discuss the use case you are trying to resolve? Why delete all
entries from a table instead of creating a new one?


On Mon, May 12, 2014 at 9:23 AM, BlackJack76 <ju...@gmail.com> wrote:

> Besides using the tableOperations to deleteRows or delete the table
> entirely,
> what is the fastest way to delete all data in a table?  I am currently
> using
> a BatchDeleter but it is extremely slow when I have a large amount of data.
> Any better options?
>
> I don't want to use the tableOperations because both the deleteRows and
> delete blow away the splits.  I would like to keep the splits in place.
>
> Appreciate your thoughts!
>
>
>
>
> --
> View this message in context:
> http://apache-accumulo.1065345.n5.nabble.com/Delete-All-Data-In-Table-tp9748.html
> Sent from the Users mailing list archive at Nabble.com.
>

RE: Delete All Data In Table

Posted by Bo...@l-3com.com.
If deleting data is something you do regularly, consider using the AgeOff iterator (or similar customer iterator based on your data) to keep older data systematically purged with the compactions.  This way you don't incur the overhead of one large batch process.     

-----Original Message-----
From: BlackJack76 [mailto:justin.loy@gmail.com] 
Sent: Monday, May 12, 2014 8:31 AM
To: user@accumulo.apache.org
Subject: Delete All Data In Table

Besides using the tableOperations to deleteRows or delete the table entirely, what is the fastest way to delete all data in a table?  I am currently using a BatchDeleter but it is extremely slow when I have a large amount of data. 
Any better options?

I don't want to use the tableOperations because both the deleteRows and delete blow away the splits.  I would like to keep the splits in place.

Appreciate your thoughts!




--
View this message in context: http://apache-accumulo.1065345.n5.nabble.com/Delete-All-Data-In-Table-tp9748.html
Sent from the Users mailing list archive at Nabble.com.

Re: Delete All Data In Table

Posted by BlackJack76 <ju...@gmail.com>.
William,

Appreciate the thoughts!

I am only doing this as part of my unit tests on my system.  There are no
other clients writing data to the system while I am running the DevNull
iterator.



--
View this message in context: http://apache-accumulo.1065345.n5.nabble.com/Delete-All-Data-In-Table-tp9748p9806.html
Sent from the Users mailing list archive at Nabble.com.

Re: Delete All Data In Table

Posted by William Slacum <wi...@accumulo.net>.
I don't think this has been directly said, but setting DevNull or an age
off iterator compaction time will mean you have to read all of your data.
You're doing it in parallel, but still evaluating all the data against the
iterator.

Also, are you trying to do this with 0 downtime-- ie, will clients write
data during the compaction period? I think that might cause issues if
another compaction is queued while you have DevNull set as the compaction
iterator, and possibly age off.


On Mon, May 12, 2014 at 12:57 PM, BlackJack76 <ju...@gmail.com> wrote:

> Josh,
>
> Also, I apologize for the second reply but wanted to touch on your last
> point.  I am fairly new to the Accumulo community and not really sure how
> or
> what the process is for submitting a patch or updating any documentation.
>  I
> would be happy to contribute but just don't have the know how.  Thanks
> again!
>
>
>
> --
> View this message in context:
> http://apache-accumulo.1065345.n5.nabble.com/Delete-All-Data-In-Table-tp9748p9775.html
> Sent from the Users mailing list archive at Nabble.com.
>

Re: Delete All Data In Table

Posted by BlackJack76 <ju...@gmail.com>.
Josh,

Also, I apologize for the second reply but wanted to touch on your last
point.  I am fairly new to the Accumulo community and not really sure how or
what the process is for submitting a patch or updating any documentation.  I
would be happy to contribute but just don't have the know how.  Thanks
again!



--
View this message in context: http://apache-accumulo.1065345.n5.nabble.com/Delete-All-Data-In-Table-tp9748p9775.html
Sent from the Users mailing list archive at Nabble.com.

Re: Delete All Data In Table

Posted by Josh Elser <jo...@gmail.com>.
On 5/12/14, 10:12 AM, BlackJack76 wrote:
> I am not familiar with the DevNullIterator.  I will have to look into that.

Ah, misspoke on the class name. Using the shell:

config -t your_table_name -s 
table.iterator.majc.devnull=21,org.apache.accumulo.core.iterators.DevNull
compact -t your_table_name -w

There's also a `wait` option on the compact method that you can use to 
programmatically compact the table and wait for it to finish.

The DevNull iterator, obviously, a pun on /dev/null which just consumes 
all data sent to it. This iterator will never return any data. We use it 
internally for development on Accumulo to benchmark the internals 
without being affected by disk speed. It is an "internal class", so just 
be aware that it might change out from underneath you across versions 
(but it hasn't since Accumulo has been in Apache, so you're probably 
going to be ok :D)

> Also, do you know if I could attach some sort of custom RowFilter to the
> table that would do the trick?

Possibly, but you'd pretty much just be reimplementing what DevNull is 
doing less efficiently because of the RowFilter constraints.

Making an org.apache.accumulo.core.iterators.users.DevNull iterator may 
be useful if you'd like to submit a patch. We could document it better 
giving some user-facing examples of when such a class would be useful.

Re: Delete All Data In Table

Posted by BlackJack76 <ju...@gmail.com>.
Thanks William and Josh!

Your suggestion of saving and reapplying the splits was something I was
doing previously and it worked great.  However, I wrote a custom tablet
balancer that doesn't balance the tables (long story so I won't bore you
with details).  Instead, I do all my balancing from my client.  Therefore,
the splits need to be in place when Accumulo starts.  If I delete the table
or the rows then I need to restart Accumulo to have them applied which is
not desirable.  Like I said, long story.

I am not familiar with the DevNullIterator.  I will have to look into that.

Also, do you know if I could attach some sort of custom RowFilter to the
table that would do the trick?

Thanks again!





--
View this message in context: http://apache-accumulo.1065345.n5.nabble.com/Delete-All-Data-In-Table-tp9748p9757.html
Sent from the Users mailing list archive at Nabble.com.

Re: Delete All Data In Table

Posted by Josh Elser <jo...@gmail.com>.
Not really, you enumerated the options pretty thoroughly :)

BatchDeleter is slow like you said due to pulling back all of the data to
the client and issuing deletes from there.

You could get the splits for your table (just in memory or write to disk if
they won't fit) and just re-add the splits after.

You could also try setting the DevNullIterator on the table for major
compaction and then compact it. This is just a little round about.
On May 12, 2014 9:30 AM, "BlackJack76" <ju...@gmail.com> wrote:

> Besides using the tableOperations to deleteRows or delete the table
> entirely,
> what is the fastest way to delete all data in a table?  I am currently
> using
> a BatchDeleter but it is extremely slow when I have a large amount of data.
> Any better options?
>
> I don't want to use the tableOperations because both the deleteRows and
> delete blow away the splits.  I would like to keep the splits in place.
>
> Appreciate your thoughts!
>
>
>
>
> --
> View this message in context:
> http://apache-accumulo.1065345.n5.nabble.com/Delete-All-Data-In-Table-tp9748.html
> Sent from the Users mailing list archive at Nabble.com.
>

Re: Delete All Data In Table

Posted by William Slacum <wi...@accumulo.net>.
You could save the splits, delete the table, then reapply the splits.


On Mon, May 12, 2014 at 9:23 AM, BlackJack76 <ju...@gmail.com> wrote:

> Besides using the tableOperations to deleteRows or delete the table
> entirely,
> what is the fastest way to delete all data in a table?  I am currently
> using
> a BatchDeleter but it is extremely slow when I have a large amount of data.
> Any better options?
>
> I don't want to use the tableOperations because both the deleteRows and
> delete blow away the splits.  I would like to keep the splits in place.
>
> Appreciate your thoughts!
>
>
>
>
> --
> View this message in context:
> http://apache-accumulo.1065345.n5.nabble.com/Delete-All-Data-In-Table-tp9748.html
> Sent from the Users mailing list archive at Nabble.com.
>