You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@accumulo.apache.org by Corey Nolet <cn...@texeltek.com> on 2013/01/03 23:41:16 UTC

Filter storing state

Hey Guys,

In "Accumulo 1.3.5", I wrote a "Top N" table structure, services and a
FilteringIterator that would allow us to drop in several keys/values
associated with a UUID (similar to a document id). The UUID was further
associated with an "index" (or type). The purpose of the TopN table was to
keep the keys/values separated so that they could still be queried back
with cell-level tagging, but when I performed a query for an index, I would
get the last N UUIDs and further be able to query the keys/values for each
of those UUIDs.

This problem seemed simple to solve in Accumulo 1.3.5, as I was able to
provide 2 FilteringIterators for compaction time to perform data cleanup of
the table so that any keys/values kept around were guaranteed to be inside
of the range of those keys being managed by the versioning iterator.

Just to recap, I have the following table structure. I also hash the
keys/values and run a filter before the versioning iterator to clean up any
duplicates. There are two types of columns: index & key/value.


Index:

R: index (or "type" of data)
F: '\x00index'
Q: empty
V: uuid\x00hashOfKeys&Values


Key/Value:

R: index (or "type" of data)
F: uuid
Q: key\x00value
V: empty


The filtering iterator that makes sure any key/value rows are in the index
manages a hashset internally. The index rows are purposefully indexed
before the key/value rows so that the filter can build up the hashset
containing those uuids in the index. As the filter iterates into the
key/value rows, it will return true only if the uuid of the key/value
exists inside of the hashset containing the uuids in the index. This worked
with older versions of accumulo but I'm now getting a weird artifact where
INIT() is called on my Filter in the middle of iterating through an index
row.

More specifically, the Filter will iterate through the index rows of a
specific "index" and build up a hashset, then init() will be called which
wipes away the hashset of uuids, then the further goes on to iterate
through the key/value rows. Keep in mind, we are talking about maybe 400k
entries, not enough to have more than 1 tablet.

Any idea why this may have worked on 1.3.5 but doesn't work any longer? I
know it has got to be a huge nono to be storing state inside of a filter,
but I haven't had any issues until trying to update my code for the new
version. If I'm doing this completely wrong, any ideas on how to make this
better?


Thanks!


-- 
Corey Nolet
Senior Software Engineer
TexelTek, inc.
[Office] 301.880.7123
[Cell] 410-903-2110

Re: Filter storing state

Posted by Corey Nolet <cn...@texeltek.com>.
Just to try it out, I set the scan time buffer on the table to 50M- same result. Thankfully, the configuration, reading & manipulation logic for the table is hiding behind a service API so nobody should be manually setting locality groups on the table.

I think for now, the best answer may be to use the filter at compaction (returning true as a default if there's no index in the hash set) in hopes that eventually enough data will be sent through a compaction to make it consistent again.

Meanwhile, I'm going to spend some time researching and working through some other implementations. I'd like to use an iterator to seek through the tablets and reconstruct the keys/values in the order of the indexes managed by the version iterator but that still doesn't solve the compaction problem.




On Jan 3, 2013, at 6:10 PM, Keith Turner wrote:

> On Thu, Jan 3, 2013 at 6:08 PM, Corey Nolet <cn...@texeltek.com> wrote:
>> That's funny you bring that up- because I was JUST discussing this as a possibility with a coworker. Compaction is really the phase that I'm concerned with- as the API for loading the data from the TopN currently only allows you to load the last N keys/values for a single index at a time.
>> 
>> Can I guarantee that compaction will pass each row through a single filter?
> 
> yes and no.   The same iterator instance is used for an entire
> compaction and only inited and seeked once.   However sometimes
> compactions only process a subset of a tablets files..   Therefore you
> can not garuntee you will see all columns in a row, you may only see
> subset.  Also if you have locality groups enabled, each localitly
> group is compacted separately.
> 
>> 
>> 
>> 
>> 
>> On Jan 3, 2013, at 5:54 PM, Keith Turner wrote:
>> 
>>> Data is read from the iterators into a buffer.  When the buffer fills
>>> up, the data is sent to the client and the iterators are reinitialized
>>> to fill up the next buffer.
>>> 
>>> The default buffer size was changed from 50M to 1M at some point.
>>> This is configured via the property table.scan.max.memory
>>> 
>>> The lower buffer size will cause iterator to be reinitialized more
>>> frequently.  Maybe you are seeing this.
>>> 
>>> Keith
>>> 
>>> On Thu, Jan 3, 2013 at 5:41 PM, Corey Nolet <cn...@texeltek.com> wrote:
>>>> Hey Guys,
>>>> 
>>>> In "Accumulo 1.3.5", I wrote a "Top N" table structure, services and a
>>>> FilteringIterator that would allow us to drop in several keys/values
>>>> associated with a UUID (similar to a document id). The UUID was further
>>>> associated with an "index" (or type). The purpose of the TopN table was to
>>>> keep the keys/values separated so that they could still be queried back with
>>>> cell-level tagging, but when I performed a query for an index, I would get
>>>> the last N UUIDs and further be able to query the keys/values for each of
>>>> those UUIDs.
>>>> 
>>>> This problem seemed simple to solve in Accumulo 1.3.5, as I was able to
>>>> provide 2 FilteringIterators for compaction time to perform data cleanup of
>>>> the table so that any keys/values kept around were guaranteed to be inside
>>>> of the range of those keys being managed by the versioning iterator.
>>>> 
>>>> Just to recap, I have the following table structure. I also hash the
>>>> keys/values and run a filter before the versioning iterator to clean up any
>>>> duplicates. There are two types of columns: index & key/value.
>>>> 
>>>> 
>>>> Index:
>>>> 
>>>> R: index (or "type" of data)
>>>> F: '\x00index'
>>>> Q: empty
>>>> V: uuid\x00hashOfKeys&Values
>>>> 
>>>> 
>>>> Key/Value:
>>>> 
>>>> R: index (or "type" of data)
>>>> F: uuid
>>>> Q: key\x00value
>>>> V: empty
>>>> 
>>>> 
>>>> The filtering iterator that makes sure any key/value rows are in the index
>>>> manages a hashset internally. The index rows are purposefully indexed before
>>>> the key/value rows so that the filter can build up the hashset containing
>>>> those uuids in the index. As the filter iterates into the key/value rows, it
>>>> will return true only if the uuid of the key/value exists inside of the
>>>> hashset containing the uuids in the index. This worked with older versions
>>>> of accumulo but I'm now getting a weird artifact where INIT() is called on
>>>> my Filter in the middle of iterating through an index row.
>>>> 
>>>> More specifically, the Filter will iterate through the index rows of a
>>>> specific "index" and build up a hashset, then init() will be called which
>>>> wipes away the hashset of uuids, then the further goes on to iterate through
>>>> the key/value rows. Keep in mind, we are talking about maybe 400k entries,
>>>> not enough to have more than 1 tablet.
>>>> 
>>>> Any idea why this may have worked on 1.3.5 but doesn't work any longer? I
>>>> know it has got to be a huge nono to be storing state inside of a filter,
>>>> but I haven't had any issues until trying to update my code for the new
>>>> version. If I'm doing this completely wrong, any ideas on how to make this
>>>> better?
>>>> 
>>>> 
>>>> Thanks!
>>>> 
>>>> 
>>>> --
>>>> Corey Nolet
>>>> Senior Software Engineer
>>>> TexelTek, inc.
>>>> [Office] 301.880.7123
>>>> [Cell] 410-903-2110
>> 


Re: Filter storing state

Posted by Keith Turner <ke...@deenlo.com>.
On Thu, Jan 3, 2013 at 6:08 PM, Corey Nolet <cn...@texeltek.com> wrote:
> That's funny you bring that up- because I was JUST discussing this as a possibility with a coworker. Compaction is really the phase that I'm concerned with- as the API for loading the data from the TopN currently only allows you to load the last N keys/values for a single index at a time.
>
> Can I guarantee that compaction will pass each row through a single filter?

yes and no.   The same iterator instance is used for an entire
compaction and only inited and seeked once.   However sometimes
compactions only process a subset of a tablets files..   Therefore you
can not garuntee you will see all columns in a row, you may only see
subset.  Also if you have locality groups enabled, each localitly
group is compacted separately.

>
>
>
>
> On Jan 3, 2013, at 5:54 PM, Keith Turner wrote:
>
>> Data is read from the iterators into a buffer.  When the buffer fills
>> up, the data is sent to the client and the iterators are reinitialized
>> to fill up the next buffer.
>>
>> The default buffer size was changed from 50M to 1M at some point.
>> This is configured via the property table.scan.max.memory
>>
>> The lower buffer size will cause iterator to be reinitialized more
>> frequently.  Maybe you are seeing this.
>>
>> Keith
>>
>> On Thu, Jan 3, 2013 at 5:41 PM, Corey Nolet <cn...@texeltek.com> wrote:
>>> Hey Guys,
>>>
>>> In "Accumulo 1.3.5", I wrote a "Top N" table structure, services and a
>>> FilteringIterator that would allow us to drop in several keys/values
>>> associated with a UUID (similar to a document id). The UUID was further
>>> associated with an "index" (or type). The purpose of the TopN table was to
>>> keep the keys/values separated so that they could still be queried back with
>>> cell-level tagging, but when I performed a query for an index, I would get
>>> the last N UUIDs and further be able to query the keys/values for each of
>>> those UUIDs.
>>>
>>> This problem seemed simple to solve in Accumulo 1.3.5, as I was able to
>>> provide 2 FilteringIterators for compaction time to perform data cleanup of
>>> the table so that any keys/values kept around were guaranteed to be inside
>>> of the range of those keys being managed by the versioning iterator.
>>>
>>> Just to recap, I have the following table structure. I also hash the
>>> keys/values and run a filter before the versioning iterator to clean up any
>>> duplicates. There are two types of columns: index & key/value.
>>>
>>>
>>> Index:
>>>
>>> R: index (or "type" of data)
>>> F: '\x00index'
>>> Q: empty
>>> V: uuid\x00hashOfKeys&Values
>>>
>>>
>>> Key/Value:
>>>
>>> R: index (or "type" of data)
>>> F: uuid
>>> Q: key\x00value
>>> V: empty
>>>
>>>
>>> The filtering iterator that makes sure any key/value rows are in the index
>>> manages a hashset internally. The index rows are purposefully indexed before
>>> the key/value rows so that the filter can build up the hashset containing
>>> those uuids in the index. As the filter iterates into the key/value rows, it
>>> will return true only if the uuid of the key/value exists inside of the
>>> hashset containing the uuids in the index. This worked with older versions
>>> of accumulo but I'm now getting a weird artifact where INIT() is called on
>>> my Filter in the middle of iterating through an index row.
>>>
>>> More specifically, the Filter will iterate through the index rows of a
>>> specific "index" and build up a hashset, then init() will be called which
>>> wipes away the hashset of uuids, then the further goes on to iterate through
>>> the key/value rows. Keep in mind, we are talking about maybe 400k entries,
>>> not enough to have more than 1 tablet.
>>>
>>> Any idea why this may have worked on 1.3.5 but doesn't work any longer? I
>>> know it has got to be a huge nono to be storing state inside of a filter,
>>> but I haven't had any issues until trying to update my code for the new
>>> version. If I'm doing this completely wrong, any ideas on how to make this
>>> better?
>>>
>>>
>>> Thanks!
>>>
>>>
>>> --
>>> Corey Nolet
>>> Senior Software Engineer
>>> TexelTek, inc.
>>> [Office] 301.880.7123
>>> [Cell] 410-903-2110
>

Re: Filter storing state

Posted by Corey Nolet <cn...@texeltek.com>.
That's funny you bring that up- because I was JUST discussing this as a possibility with a coworker. Compaction is really the phase that I'm concerned with- as the API for loading the data from the TopN currently only allows you to load the last N keys/values for a single index at a time. 

Can I guarantee that compaction will pass each row through a single filter?




On Jan 3, 2013, at 5:54 PM, Keith Turner wrote:

> Data is read from the iterators into a buffer.  When the buffer fills
> up, the data is sent to the client and the iterators are reinitialized
> to fill up the next buffer.
> 
> The default buffer size was changed from 50M to 1M at some point.
> This is configured via the property table.scan.max.memory
> 
> The lower buffer size will cause iterator to be reinitialized more
> frequently.  Maybe you are seeing this.
> 
> Keith
> 
> On Thu, Jan 3, 2013 at 5:41 PM, Corey Nolet <cn...@texeltek.com> wrote:
>> Hey Guys,
>> 
>> In "Accumulo 1.3.5", I wrote a "Top N" table structure, services and a
>> FilteringIterator that would allow us to drop in several keys/values
>> associated with a UUID (similar to a document id). The UUID was further
>> associated with an "index" (or type). The purpose of the TopN table was to
>> keep the keys/values separated so that they could still be queried back with
>> cell-level tagging, but when I performed a query for an index, I would get
>> the last N UUIDs and further be able to query the keys/values for each of
>> those UUIDs.
>> 
>> This problem seemed simple to solve in Accumulo 1.3.5, as I was able to
>> provide 2 FilteringIterators for compaction time to perform data cleanup of
>> the table so that any keys/values kept around were guaranteed to be inside
>> of the range of those keys being managed by the versioning iterator.
>> 
>> Just to recap, I have the following table structure. I also hash the
>> keys/values and run a filter before the versioning iterator to clean up any
>> duplicates. There are two types of columns: index & key/value.
>> 
>> 
>> Index:
>> 
>> R: index (or "type" of data)
>> F: '\x00index'
>> Q: empty
>> V: uuid\x00hashOfKeys&Values
>> 
>> 
>> Key/Value:
>> 
>> R: index (or "type" of data)
>> F: uuid
>> Q: key\x00value
>> V: empty
>> 
>> 
>> The filtering iterator that makes sure any key/value rows are in the index
>> manages a hashset internally. The index rows are purposefully indexed before
>> the key/value rows so that the filter can build up the hashset containing
>> those uuids in the index. As the filter iterates into the key/value rows, it
>> will return true only if the uuid of the key/value exists inside of the
>> hashset containing the uuids in the index. This worked with older versions
>> of accumulo but I'm now getting a weird artifact where INIT() is called on
>> my Filter in the middle of iterating through an index row.
>> 
>> More specifically, the Filter will iterate through the index rows of a
>> specific "index" and build up a hashset, then init() will be called which
>> wipes away the hashset of uuids, then the further goes on to iterate through
>> the key/value rows. Keep in mind, we are talking about maybe 400k entries,
>> not enough to have more than 1 tablet.
>> 
>> Any idea why this may have worked on 1.3.5 but doesn't work any longer? I
>> know it has got to be a huge nono to be storing state inside of a filter,
>> but I haven't had any issues until trying to update my code for the new
>> version. If I'm doing this completely wrong, any ideas on how to make this
>> better?
>> 
>> 
>> Thanks!
>> 
>> 
>> --
>> Corey Nolet
>> Senior Software Engineer
>> TexelTek, inc.
>> [Office] 301.880.7123
>> [Cell] 410-903-2110


Re: Filter storing state

Posted by Keith Turner <ke...@deenlo.com>.
Data is read from the iterators into a buffer.  When the buffer fills
up, the data is sent to the client and the iterators are reinitialized
to fill up the next buffer.

The default buffer size was changed from 50M to 1M at some point.
This is configured via the property table.scan.max.memory

The lower buffer size will cause iterator to be reinitialized more
frequently.  Maybe you are seeing this.

Keith

On Thu, Jan 3, 2013 at 5:41 PM, Corey Nolet <cn...@texeltek.com> wrote:
> Hey Guys,
>
> In "Accumulo 1.3.5", I wrote a "Top N" table structure, services and a
> FilteringIterator that would allow us to drop in several keys/values
> associated with a UUID (similar to a document id). The UUID was further
> associated with an "index" (or type). The purpose of the TopN table was to
> keep the keys/values separated so that they could still be queried back with
> cell-level tagging, but when I performed a query for an index, I would get
> the last N UUIDs and further be able to query the keys/values for each of
> those UUIDs.
>
> This problem seemed simple to solve in Accumulo 1.3.5, as I was able to
> provide 2 FilteringIterators for compaction time to perform data cleanup of
> the table so that any keys/values kept around were guaranteed to be inside
> of the range of those keys being managed by the versioning iterator.
>
> Just to recap, I have the following table structure. I also hash the
> keys/values and run a filter before the versioning iterator to clean up any
> duplicates. There are two types of columns: index & key/value.
>
>
> Index:
>
> R: index (or "type" of data)
> F: '\x00index'
> Q: empty
> V: uuid\x00hashOfKeys&Values
>
>
> Key/Value:
>
> R: index (or "type" of data)
> F: uuid
> Q: key\x00value
> V: empty
>
>
> The filtering iterator that makes sure any key/value rows are in the index
> manages a hashset internally. The index rows are purposefully indexed before
> the key/value rows so that the filter can build up the hashset containing
> those uuids in the index. As the filter iterates into the key/value rows, it
> will return true only if the uuid of the key/value exists inside of the
> hashset containing the uuids in the index. This worked with older versions
> of accumulo but I'm now getting a weird artifact where INIT() is called on
> my Filter in the middle of iterating through an index row.
>
> More specifically, the Filter will iterate through the index rows of a
> specific "index" and build up a hashset, then init() will be called which
> wipes away the hashset of uuids, then the further goes on to iterate through
> the key/value rows. Keep in mind, we are talking about maybe 400k entries,
> not enough to have more than 1 tablet.
>
> Any idea why this may have worked on 1.3.5 but doesn't work any longer? I
> know it has got to be a huge nono to be storing state inside of a filter,
> but I haven't had any issues until trying to update my code for the new
> version. If I'm doing this completely wrong, any ideas on how to make this
> better?
>
>
> Thanks!
>
>
> --
> Corey Nolet
> Senior Software Engineer
> TexelTek, inc.
> [Office] 301.880.7123
> [Cell] 410-903-2110

Re: Filter storing state

Posted by Corey Nolet <cn...@texeltek.com>.
John thanks for the quick response!

Crazy enough, I'm not doing much differently than the VersioningIterator as it is storing the max number of versions that ti should be returning- right? And that's a scan time iterator (as well as majc/minc).

I am testing it as a scan time iterator (set on the table but using accumulo shell to scan). Perhaps I should force a couple compactions and see what's left afterwards. 





On Jan 3, 2013, at 5:53 PM, John Vines wrote:

> Are you testing this in scan time or via actual minor/major compactions? I know at scan time, there is no guarantee that the iterator remains intact through the entire scan, and it instead may be reconstructed, causing state to be lost. I don't think this is the case for compaction time iterators, but I'm not positive.
> 
> 
> On Thu, Jan 3, 2013 at 5:41 PM, Corey Nolet <cn...@texeltek.com> wrote:
> Hey Guys,
> 
> In "Accumulo 1.3.5", I wrote a "Top N" table structure, services and a FilteringIterator that would allow us to drop in several keys/values associated with a UUID (similar to a document id). The UUID was further associated with an "index" (or type). The purpose of the TopN table was to keep the keys/values separated so that they could still be queried back with cell-level tagging, but when I performed a query for an index, I would get the last N UUIDs and further be able to query the keys/values for each of those UUIDs.
> 
> This problem seemed simple to solve in Accumulo 1.3.5, as I was able to provide 2 FilteringIterators for compaction time to perform data cleanup of the table so that any keys/values kept around were guaranteed to be inside of the range of those keys being managed by the versioning iterator. 
> 
> Just to recap, I have the following table structure. I also hash the keys/values and run a filter before the versioning iterator to clean up any duplicates. There are two types of columns: index & key/value.
> 
> 
> Index: 
> 
> R: index (or "type" of data)
> F: '\x00index'
> Q: empty
> V: uuid\x00hashOfKeys&Values
> 
> 
> Key/Value:
> 
> R: index (or "type" of data)
> F: uuid
> Q: key\x00value
> V: empty
> 
> 
> The filtering iterator that makes sure any key/value rows are in the index manages a hashset internally. The index rows are purposefully indexed before the key/value rows so that the filter can build up the hashset containing those uuids in the index. As the filter iterates into the key/value rows, it will return true only if the uuid of the key/value exists inside of the hashset containing the uuids in the index. This worked with older versions of accumulo but I'm now getting a weird artifact where INIT() is called on my Filter in the middle of iterating through an index row.
> 
> More specifically, the Filter will iterate through the index rows of a specific "index" and build up a hashset, then init() will be called which wipes away the hashset of uuids, then the further goes on to iterate through the key/value rows. Keep in mind, we are talking about maybe 400k entries, not enough to have more than 1 tablet.
> 
> Any idea why this may have worked on 1.3.5 but doesn't work any longer? I know it has got to be a huge nono to be storing state inside of a filter, but I haven't had any issues until trying to update my code for the new version. If I'm doing this completely wrong, any ideas on how to make this better?
> 
> 
> Thanks!
> 
> 
> -- 
> Corey Nolet
> Senior Software Engineer
> TexelTek, inc.
> [Office] 301.880.7123
> [Cell] 410-903-2110
> 


Re: Filter storing state

Posted by John Vines <vi...@apache.org>.
Are you testing this in scan time or via actual minor/major compactions? I
know at scan time, there is no guarantee that the iterator remains intact
through the entire scan, and it instead may be reconstructed, causing state
to be lost. I don't think this is the case for compaction time iterators,
but I'm not positive.


On Thu, Jan 3, 2013 at 5:41 PM, Corey Nolet <cn...@texeltek.com> wrote:

> Hey Guys,
>
> In "Accumulo 1.3.5", I wrote a "Top N" table structure, services and a
> FilteringIterator that would allow us to drop in several keys/values
> associated with a UUID (similar to a document id). The UUID was further
> associated with an "index" (or type). The purpose of the TopN table was to
> keep the keys/values separated so that they could still be queried back
> with cell-level tagging, but when I performed a query for an index, I would
> get the last N UUIDs and further be able to query the keys/values for each
> of those UUIDs.
>
> This problem seemed simple to solve in Accumulo 1.3.5, as I was able to
> provide 2 FilteringIterators for compaction time to perform data cleanup of
> the table so that any keys/values kept around were guaranteed to be inside
> of the range of those keys being managed by the versioning iterator.
>
> Just to recap, I have the following table structure. I also hash the
> keys/values and run a filter before the versioning iterator to clean up any
> duplicates. There are two types of columns: index & key/value.
>
>
> Index:
>
> R: index (or "type" of data)
> F: '\x00index'
> Q: empty
> V: uuid\x00hashOfKeys&Values
>
>
> Key/Value:
>
> R: index (or "type" of data)
> F: uuid
> Q: key\x00value
> V: empty
>
>
> The filtering iterator that makes sure any key/value rows are in the index
> manages a hashset internally. The index rows are purposefully indexed
> before the key/value rows so that the filter can build up the hashset
> containing those uuids in the index. As the filter iterates into the
> key/value rows, it will return true only if the uuid of the key/value
> exists inside of the hashset containing the uuids in the index. This worked
> with older versions of accumulo but I'm now getting a weird artifact where
> INIT() is called on my Filter in the middle of iterating through an index
> row.
>
> More specifically, the Filter will iterate through the index rows of a
> specific "index" and build up a hashset, then init() will be called which
> wipes away the hashset of uuids, then the further goes on to iterate
> through the key/value rows. Keep in mind, we are talking about maybe 400k
> entries, not enough to have more than 1 tablet.
>
> Any idea why this may have worked on 1.3.5 but doesn't work any longer? I
> know it has got to be a huge nono to be storing state inside of a filter,
> but I haven't had any issues until trying to update my code for the new
> version. If I'm doing this completely wrong, any ideas on how to make this
> better?
>
>
> Thanks!
>
>
> --
> Corey Nolet
> Senior Software Engineer
> TexelTek, inc.
> [Office] 301.880.7123
> [Cell] 410-903-2110
>