You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@accumulo.apache.org by Keith Turner <ke...@deenlo.com> on 2015/10/01 17:09:00 UTC
Re: Scan vs Filter performance.

I think you are missing the last one because next() calls super.next() at
the end AND your has hasTop() calls super.hasTop()

On Tue, Sep 29, 2015 at 3:45 PM, Moises Baly <mo...@spatially.com> wrote:

> Hi there,
>
> I'm writing a custom iterator, which essentially is obtaining a range of
> values using a slightly different way to compare the rows (for keeping in
> range). In one test, it should return every row in Accumulo, but it's
> missing the last one. The most important parts of the code would look like
> this:
>
> class CIterator extends WrappingIterator() {
>   private var emitKey: Key = _
>   private var emitValue: Value = _
>
>   override def deepCopy(env: IteratorEnvironment): SortedKeyValueIterator[Key, Value] = {
>     new CIterator(this, env)
>   }
>
>
>   def this(_this: CIterator, env: IteratorEnvironment) = {
>     this()
>     setSource(_this.getSource.deepCopy(env))
>   }
>
>   override def init(source: SortedKeyValueIterator[Key, Value], options: util.Map[String, String], env: IteratorEnvironment) = {
>     super.init(source, options, env)
>   }
>
>   override def getTopKey(): Key = {
>     emitKey
>   }
>
>   override def getTopValue(): Value = {
>     emitValue
>   }
>
>   override def hasTop(): Boolean = {
>     super.hasTop
>   }
>
>   override def seek(range: Range, columnFamilies: util.Collection[ByteSequence], inclusive: Boolean): Unit = {
>
>     ...
>
>     val seekRange = new Range(partialKeyStart.toString, true, partialKeyEnd.toString, true)
>
>     super.seek(seekRange, columnFamilies, inclusive);
>
>     if (super.hasTop()) {
>       next();
>     }
>   }
>
>   override def next(): Unit = {
>     ...
>     val lowerBoundCheck = rangeStart.compareTo(nextKey.getRow.toString)
>     val upperBoundCheck = rangeEnd.compareTo(nextKey.getRow.toString)
>     if (lowerBoundCheck <= 0 && upperBoundCheck >= 0){
>       emitKey = new Key(nextKey)
>       emitValue = new Value(nextValue)
>       if (super.hasTop()){
>         super.next()
>       }
>
>     }
>   }
> }
>
>
> So that code, if I have a range that comprises every row, returns every one of them but the last one. A high level call list would look like this:
>
> Seek ->
> Next ->
> hasTop ->
> Top key ->
> Top key ->
> Top value ->
> hasTop ->
> Top key ->
> Top value ->
> Next ->
> hasTop ->
> Top key ->
> Top key ->
> Top value ->
> (print row - value 1) -> hasTop ->
> Top key ->
> Top value ->
> Next ->
> hasTop ->
> Top key ->
> Top key ->
> Top value -> (print row - value 2) -> hasTop ->
> Top key ->
> Top value ->
> Next ->
> hasTop -> (print row - value 3) ->  hasTop ->
>
> I think I'm missing something on the call tree:
>
> 1- Is it normal to have many subsequent topKey() calls after next()?
>
> 2- This is supposed to give me every row (the condition put in place for the range is working), but as you can see, it stops after the last next() call, for some reason (maybe something to do with the interfaces hierarchy?)
>
> 3- In general, what would be a correct approach (execution path) for building a custom iterator? I'm still hesitant on how the iterator functions (next, seek, getTop...) interact with each other, specially in the way we give back results to clients.
>
> Thank you for your time,
>
>
> Moises
>
>
>
> On Tue, Sep 29, 2015 at 11:16 AM, Keith Turner <ke...@deenlo.com> wrote:
>
>>
>>
>> On Tue, Sep 29, 2015 at 12:59 AM, mohit.kaushik <mohit.kaushik@orkash.com
>> > wrote:
>>
>>> Hi Keith,
>>>
>>> When we fetch a column or column family Ii seems, it does not seek and
>>> only scan by filtering the key/value pairs. But as you said if I design a
>>> custom iterator to fetch a column family, It may work faster.
>>>
>>
>> When column families are fetched, Accumulo will seek[1].  It tries to
>> read 10 cells and then seeks.
>>
>> When fetching family and qualifier, two iterators are used.  The
>> ColumnFamilySkippingIterator and ColumnQualifierFilter.  The
>> ColumnQualifierFilter does a scan of all qualifers within a family [2].
>> The system configures the qualifier filter to have the family skipping iter
>> as a source[3], so it could still seek between families.
>>
>>
>>>
>>> But I want to know what would be the scenario if I define a locality
>>> group for the column family and run the same custom iterator on it which
>>> scan and seeks both? what would be he impact on performance (gain or loss)?
>>>
>>
>> Like Josh said, it really depends on your situation. Its hard to offer an
>> opinion w/o knowing more about the schema and the queries.
>>
>> Below I expanded on what Josh mentioned.
>>
>> If you have a locality group, it can really help in the case where you
>> have many rows that have a few families.  For example if you have 10^7 rows
>> in a tablet and only 10^3 have a certain column family thats in a locality
>> group, it can make it very fast to find those 1000 rows.  W/o a locality
>> group even w/ seeking, you would still be seeking to each row.
>>
>> Conversely if you have 10^2 rows in a tablet, each having many families.
>> If there is a column family you are interested in that only exist in 10
>> rows, you will still need to seek for each row to find it but ~100 seeks is
>> not so bad.
>>
>>
>>
>> [1]:
>> https://github.com/apache/accumulo/blob/1.6.3/core/src/main/java/org/apache/accumulo/core/iterators/system/ColumnFamilySkippingIterator.java#L65
>> [2]:
>> https://github.com/apache/accumulo/blob/1.6.3/core/src/main/java/org/apache/accumulo/core/iterators/system/ColumnQualifierFilter.java#L54
>> [3]:
>> https://github.com/apache/accumulo/blob/1.6.3/server/tserver/src/main/java/org/apache/accumulo/tserver/Tablet.java#L2005
>>
>>
>>>
>>> Thanks
>>> Mohit Kaushik
>>>
>>>
>>> On 09/28/2015 10:49 PM, Moises Baly wrote:
>>>
>>> Hi Keith,
>>>
>>> No I wasn't aware of that. So I'll move forward with the custom
>>> iterator.
>>>
>>> Thank you for your time,
>>>
>>> Moises
>>>
>>> On Mon, Sep 28, 2015 at 12:35 PM, Keith Turner <ke...@deenlo.com> wrote:
>>>
>>>> On Mon, Sep 28, 2015 at 12:19 PM, Moises Baly <mo...@spatially.com>
>>>> wrote:
>>>>
>>>>> Hi all:
>>>>>
>>>>> I would like to perform a range scan on a table, tweaking the
>>>>> definition of what goes into a particular key range. One way I can think of
>>>>> is writing a filter on the key, and that would work fine. But I think it
>>>>> would be slow compared to a scan / seek custom iterator. How does the
>>>>> underlying login works? Does Filter goes through all records, or since is
>>>>> sorted follows the same underlying logic as a scan? Would a custom iterator
>>>>> perform better?
>>>>>
>>>>
>>>> Yes, filter will read all data.  Custom iterator that seeks may be
>>>> faster.
>>>>
>>>> Are you aware of the following?
>>>>
>>>> https://issues.apache.org/jira/browse/ACCUMULO-3961
>>>> https://github.com/apache/accumulo/pull/42
>>>>
>>>>
>>>>>
>>>>> Thank you for your time,
>>>>>
>>>>> Moises
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>>
>>> * Mohit Kaushik*
>>> Software Engineer
>>> A Square,Plot No. 278, Udyog Vihar, Phase 2, Gurgaon 122016, India
>>> *Tel:* +91 (124) 4969352 | *Fax:* +91 (124) 4033553
>>>
>>> <http://politicomapper.orkash.com>interactive social intelligence at
>>> work...
>>>
>>> <https://www.facebook.com/Orkash2012>
>>> <http://www.linkedin.com/company/orkash-services-private-limited>
>>> <https://twitter.com/Orkash>  <http://www.orkash.com/blog/>
>>> <http://www.orkash.com>
>>> <http://www.orkash.com> ... ensuring Assurance in complexity and
>>> uncertainty
>>>
>>> *This message including the attachments, if any, is a confidential
>>> business communication. If you are not the intended recipient it may be
>>> unlawful for you to read, copy, distribute, disclose or otherwise use the
>>> information in this e-mail. If you have received it in error or are not the
>>> intended recipient, please destroy it and notify the sender immediately.
>>> Thank you *
>>>
>>
>>
>