You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@accumulo.apache.org by "shweta.agrawal" <sh...@orkash.com> on 2015/06/12 14:47:40 UTC

Abnormal behaviour of custom iterator in getting entries

Hi,

I am making a custom iterator which returns multiple entries. For some 
entries getTopValue function is called, sometimes skipped. Due to this 
behaviour i am not getting all the entries at scan time which are to be 
returned.

I had written functions calling hierarchy in a text file which is:
hasTop
getTopKey
hasTop
getTopKey
getTopValue
next
hasTop

Thanks
Shweta



Re: Abnormal behaviour of custom iterator in getting entries

Posted by Josh Elser <jo...@gmail.com>.
Also, apparently I wrote something similar to your problem a long time ago:

https://github.com/joshelser/accumulo-column-summing

The above implementation does assume large contiguous ranges. Thought it 
might be helpful anyways.

Josh Elser wrote:
> Good, I'm glad you found it useful.
>
> The important thing to always remember is that your data is split across
> many tablet servers and that Iterators run local to each tablet server.
> As such, you cannot compute a single sum via an iterator, you can, at
> best, compute N intermediate sums -- one of each tabletserver the
> batchscanner had to talk to.
>
> Also ignore my previous comment about a second iterator. I had assumed
> you were doing something fancier than selecting a single column
> qualifier from a row.
>
> Since you're passing in what are likely multiple, disjoint ranges, I'm
> not sure you're going to get much of a performance optimization out of a
> custom iterator in this case. After each seek, your iterator would need
> to return the entries that it summed in the provided Range (the Iterator
> framework isn't designed to know the overall state of the scan -- you
> might have more data to read or you might be done. You must return the
> data when the data you're reading moves outside of the current range).
>
> The way that you'd see the real optimization an Iterator provides is if
> you are scanning over a large, contiguous set of rows specified by a
> single Range (you can get the reduction of reading many key/values into
> a single pair returned).
>
> If I mis-stated your situation, please do let me know.
>
> madhvi wrote:
>> Hi,
>>
>> Thanks for the blog you shared.I found it quite useful for my
>> requirement.
>> "How are you passing these IDs to the batch scanner?"
>> I am passing row ids received as a previous query result from another
>> table as 'new Range(entry.getKey().getRow())' in a Range type list and
>> passing that list to batch Scanner.
>>
>> "Are you trying to sum across all rows that you queried? "
>> Yes we need to sum a particular column qualifier across the rows ids
>> passed to batch scanner.How the summation can be done across the rows as
>> you said "you can put a second iterator "above" the first"?
>>
>> Thanks
>> Madhvi
>> On Wednesday 17 June 2015 08:43 PM, Josh Elser wrote:
>>> Madhvi,
>>>
>>> Understood. A few more questions..
>>>
>>> How are you passing these IDs to the batch scanner? Are you providing
>>> individual Ranges for each ID (e.g. `new Range(new Key("row1", "",
>>> "id1"), true, new Key("row1", "", "id1\x00"), false))`)? Or are you
>>> providing an entire row (or set of rows) and using the
>>> fetchColumns(Text,Text) method (or similar) on the BatchScanner?
>>>
>>> Are you trying to sum across all rows that you queried? Or is your sum
>>> per-row? If the former, that is going to cause you problems. The quick
>>> explanation is that you can't reliably know the tablet boundaries so
>>> you should try to perform an initial sum, per row. If you want, you
>>> can put a second iterator "above" the first and do a summation across
>>> all rows to reduce the amount of data sent to a client. However, if
>>> you use a BatchScanner, you will still have to perform a final
>>> summation at the client.
>>>
>>> Check out
>>> https://blogs.apache.org/accumulo/entry/thinking_about_reads_over_accumulo
>>>
>>> for more details on that..
>>>
>>> madhvi wrote:
>>>> Hi Josh,
>>>>
>>>> Sorry, my company policy doesn't allow me to share full source.What we
>>>> are tryng to do is summing over a unique field stored in column
>>>> qualifier for IDs passed to batch scanner.Can u suggest how it can be
>>>> done in accumulo.
>>>>
>>>> Thanks
>>>> Madhvi
>>>> On Wednesday 17 June 2015 10:32 AM, Josh Elser wrote:
>>>>> You put random values in the family and qualifier? Do I misunderstand
>>>>> you?
>>>>>
>>>>> Also, if you can put up the full source for the iterator, that will be
>>>>> much easier if you need help debugging it. It's hard for us to guess
>>>>> at why your code might not be working as you expect.
>>>>>
>>>>> madhvi wrote:
>>>>>> Hi Josh,
>>>>>>
>>>>>> I have changed HashMap to TreeMap which sorts lexicographically and I
>>>>>> have inserted random values in column family and qualifier.Value of
>>>>>> TreeMap in value.
>>>>>> Used scanner and batch scanner but getting results only with scanner.
>>>>>>
>>>>>> Thanks
>>>>>> Madhvi
>>>>>>
>>>>>> On Tuesday 16 June 2015 08:42 PM, Josh Elser wrote:
>>>>>>> Additionally, you're placing the Value into the ColumnQualifier and
>>>>>>> dropping the ColumnFamily completely. Granted, that may not be a
>>>>>>> problem for the specific data in your table, but it's not going to
>>>>>>> work for any data.
>>>>>>>
>>>>>>> Christopher wrote:
>>>>>>>> You're iterating over a HashMap. That's not sorted.
>>>>>>>>
>>>>>>>> --
>>>>>>>> Christopher L Tubbs II
>>>>>>>> http://gravatar.com/ctubbsii
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Jun 16, 2015 at 1:58 AM, madhvi<ma...@orkash.com>
>>>>>>>> wrote:
>>>>>>>>> Hi Josh,
>>>>>>>>> Thanks for replying. I will enable remote debugger on my Accumulo
>>>>>>>>> server.
>>>>>>>>>
>>>>>>>>> However I am slightly confused with your statement "you are not
>>>>>>>>> returning
>>>>>>>>> your data in sorted order". Can you point the part in my iterator
>>>>>>>>> code which
>>>>>>>>> seems innapropriate and any possible solution for that?
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>> Madhvi
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tuesday 16 June 2015 11:07 AM, Josh Elser wrote:
>>>>>>>>>> //matched the condition and put values to holder map.
>>>>>>>>>
>>>>>>
>>>>
>>

Re: Abnormal behaviour of custom iterator in getting entries

Posted by Josh Elser <jo...@gmail.com>.
Good, I'm glad you found it useful.

The important thing to always remember is that your data is split across 
many tablet servers and that Iterators run local to each tablet server. 
As such, you cannot compute a single sum via an iterator, you can, at 
best, compute N intermediate sums -- one of each tabletserver the 
batchscanner had to talk to.

Also ignore my previous comment about a second iterator. I had assumed 
you were doing something fancier than selecting a single column 
qualifier from a row.

Since you're passing in what are likely multiple, disjoint ranges, I'm 
not sure you're going to get much of a performance optimization out of a 
custom iterator in this case. After each seek, your iterator would need 
to return the entries that it summed in the provided Range (the Iterator 
framework isn't designed to know the overall state of the scan -- you 
might have more data to read or you might be done. You must return the 
data when the data you're reading moves outside of the current range).

The way that you'd see the real optimization an Iterator provides is if 
you are scanning over a large, contiguous set of rows specified by a 
single Range (you can get the reduction of reading many key/values into 
a single pair returned).

If I mis-stated your situation, please do let me know.

madhvi wrote:
> Hi,
>
> Thanks for the blog you shared.I found it quite useful for my requirement.
> "How are you passing these IDs to the batch scanner?"
> I am passing row ids received as a previous query result from another
> table as 'new Range(entry.getKey().getRow())' in a Range type list and
> passing that list to batch Scanner.
>
> "Are you trying to sum across all rows that you queried? "
> Yes we need to sum a particular column qualifier across the rows ids
> passed to batch scanner.How the summation can be done across the rows as
> you said "you can put a second iterator "above" the first"?
>
> Thanks
> Madhvi
> On Wednesday 17 June 2015 08:43 PM, Josh Elser wrote:
>> Madhvi,
>>
>> Understood. A few more questions..
>>
>> How are you passing these IDs to the batch scanner? Are you providing
>> individual Ranges for each ID (e.g. `new Range(new Key("row1", "",
>> "id1"), true, new Key("row1", "", "id1\x00"), false))`)? Or are you
>> providing an entire row (or set of rows) and using the
>> fetchColumns(Text,Text) method (or similar) on the BatchScanner?
>>
>> Are you trying to sum across all rows that you queried? Or is your sum
>> per-row? If the former, that is going to cause you problems. The quick
>> explanation is that you can't reliably know the tablet boundaries so
>> you should try to perform an initial sum, per row. If you want, you
>> can put a second iterator "above" the first and do a summation across
>> all rows to reduce the amount of data sent to a client. However, if
>> you use a BatchScanner, you will still have to perform a final
>> summation at the client.
>>
>> Check out
>> https://blogs.apache.org/accumulo/entry/thinking_about_reads_over_accumulo
>> for more details on that..
>>
>> madhvi wrote:
>>> Hi Josh,
>>>
>>> Sorry, my company policy doesn't allow me to share full source.What we
>>> are tryng to do is summing over a unique field stored in column
>>> qualifier for IDs passed to batch scanner.Can u suggest how it can be
>>> done in accumulo.
>>>
>>> Thanks
>>> Madhvi
>>> On Wednesday 17 June 2015 10:32 AM, Josh Elser wrote:
>>>> You put random values in the family and qualifier? Do I misunderstand
>>>> you?
>>>>
>>>> Also, if you can put up the full source for the iterator, that will be
>>>> much easier if you need help debugging it. It's hard for us to guess
>>>> at why your code might not be working as you expect.
>>>>
>>>> madhvi wrote:
>>>>> Hi Josh,
>>>>>
>>>>> I have changed HashMap to TreeMap which sorts lexicographically and I
>>>>> have inserted random values in column family and qualifier.Value of
>>>>> TreeMap in value.
>>>>> Used scanner and batch scanner but getting results only with scanner.
>>>>>
>>>>> Thanks
>>>>> Madhvi
>>>>>
>>>>> On Tuesday 16 June 2015 08:42 PM, Josh Elser wrote:
>>>>>> Additionally, you're placing the Value into the ColumnQualifier and
>>>>>> dropping the ColumnFamily completely. Granted, that may not be a
>>>>>> problem for the specific data in your table, but it's not going to
>>>>>> work for any data.
>>>>>>
>>>>>> Christopher wrote:
>>>>>>> You're iterating over a HashMap. That's not sorted.
>>>>>>>
>>>>>>> --
>>>>>>> Christopher L Tubbs II
>>>>>>> http://gravatar.com/ctubbsii
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Jun 16, 2015 at 1:58 AM, madhvi<ma...@orkash.com>
>>>>>>> wrote:
>>>>>>>> Hi Josh,
>>>>>>>> Thanks for replying. I will enable remote debugger on my Accumulo
>>>>>>>> server.
>>>>>>>>
>>>>>>>> However I am slightly confused with your statement "you are not
>>>>>>>> returning
>>>>>>>> your data in sorted order". Can you point the part in my iterator
>>>>>>>> code which
>>>>>>>> seems innapropriate and any possible solution for that?
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> Madhvi
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tuesday 16 June 2015 11:07 AM, Josh Elser wrote:
>>>>>>>>> //matched the condition and put values to holder map.
>>>>>>>>
>>>>>
>>>
>

Re: Abnormal behaviour of custom iterator in getting entries

Posted by madhvi <ma...@orkash.com>.
Hi,

Thanks for the blog you shared.I found it quite useful for my requirement.
"How are you passing these IDs to the batch scanner?"
I am passing row ids received as a previous query result from another 
table as 'new Range(entry.getKey().getRow())' in a Range type list and 
passing that list to batch Scanner.

"Are you trying to sum across all rows that you queried? "
Yes we need to sum a particular column qualifier across the rows ids 
passed to batch scanner.How the summation can be done across the rows as 
you said "you can put a second iterator "above" the first"?

Thanks
Madhvi
On Wednesday 17 June 2015 08:43 PM, Josh Elser wrote:
> Madhvi,
>
> Understood. A few more questions..
>
> How are you passing these IDs to the batch scanner? Are you providing 
> individual Ranges for each ID (e.g. `new Range(new Key("row1", "", 
> "id1"), true, new Key("row1", "", "id1\x00"), false))`)? Or are you 
> providing an entire row (or set of rows) and using the 
> fetchColumns(Text,Text) method (or similar) on the BatchScanner?
>
> Are you trying to sum across all rows that you queried? Or is your sum 
> per-row? If the former, that is going to cause you problems. The quick 
> explanation is that you can't reliably know the tablet boundaries so 
> you should try to perform an initial sum, per row. If you want, you 
> can put a second iterator "above" the first and do a summation across 
> all rows to reduce the amount of data sent to a client. However, if 
> you use a BatchScanner, you will still have to perform a final 
> summation at the client.
>
> Check out 
> https://blogs.apache.org/accumulo/entry/thinking_about_reads_over_accumulo 
> for more details on that..
>
> madhvi wrote:
>> Hi Josh,
>>
>> Sorry, my company policy doesn't allow me to share full source.What we
>> are tryng to do is summing over a unique field stored in column
>> qualifier for IDs passed to batch scanner.Can u suggest how it can be
>> done in accumulo.
>>
>> Thanks
>> Madhvi
>> On Wednesday 17 June 2015 10:32 AM, Josh Elser wrote:
>>> You put random values in the family and qualifier? Do I misunderstand
>>> you?
>>>
>>> Also, if you can put up the full source for the iterator, that will be
>>> much easier if you need help debugging it. It's hard for us to guess
>>> at why your code might not be working as you expect.
>>>
>>> madhvi wrote:
>>>> Hi Josh,
>>>>
>>>> I have changed HashMap to TreeMap which sorts lexicographically and I
>>>> have inserted random values in column family and qualifier.Value of
>>>> TreeMap in value.
>>>> Used scanner and batch scanner but getting results only with scanner.
>>>>
>>>> Thanks
>>>> Madhvi
>>>>
>>>> On Tuesday 16 June 2015 08:42 PM, Josh Elser wrote:
>>>>> Additionally, you're placing the Value into the ColumnQualifier and
>>>>> dropping the ColumnFamily completely. Granted, that may not be a
>>>>> problem for the specific data in your table, but it's not going to
>>>>> work for any data.
>>>>>
>>>>> Christopher wrote:
>>>>>> You're iterating over a HashMap. That's not sorted.
>>>>>>
>>>>>> -- 
>>>>>> Christopher L Tubbs II
>>>>>> http://gravatar.com/ctubbsii
>>>>>>
>>>>>>
>>>>>> On Tue, Jun 16, 2015 at 1:58 AM, madhvi<ma...@orkash.com>
>>>>>> wrote:
>>>>>>> Hi Josh,
>>>>>>> Thanks for replying. I will enable remote debugger on my Accumulo
>>>>>>> server.
>>>>>>>
>>>>>>> However I am slightly confused with your statement "you are not
>>>>>>> returning
>>>>>>> your data in sorted order". Can you point the part in my iterator
>>>>>>> code which
>>>>>>> seems innapropriate and any possible solution for that?
>>>>>>>
>>>>>>> Thanks
>>>>>>> Madhvi
>>>>>>>
>>>>>>>
>>>>>>> On Tuesday 16 June 2015 11:07 AM, Josh Elser wrote:
>>>>>>>> //matched the condition and put values to holder map.
>>>>>>>
>>>>
>>


Re: Abnormal behaviour of custom iterator in getting entries

Posted by Dylan Hutchison <dh...@mit.edu>.
Chiming in on one of Josh's comments

Since you're passing in what are likely multiple, disjoint ranges, I'm not
> sure you're going to get much of a performance optimization out of a custom
> iterator in this case. After each seek, your iterator would need to return
> the entries that it summed in the provided Range (the Iterator framework
> isn't designed to know the overall state of the scan -- you might have more
> data to read or you might be done. You must return the data when the data
> you're reading moves outside of the current range).
>
> The way that you'd see the real optimization an Iterator provides is if
> you are scanning over a large, contiguous set of rows specified by a single
> Range (you can get the reduction of reading many key/values into a single
> pair returned).


FYI, it is possible to obtain better custom iterator performance in the
case of scanning with multiple, disjoint ranges.  The trick is to call
BatchScanner's setRanges() with an infinite range, causing Accumulo to run
your iterator on every tablet.  Then, pass your desired ranges to the
iterator directly via iterator options, and let the iterator control
seeking itself.  This is kind of advanced and needs more detailed study,
but you can see a prototype of how I do it in the Graphulo
<https://github.com/Accla/d4m_api_java> library:

https://github.com/Accla/d4m_api_java/blob/master/src/main/java/edu/mit/ll/graphulo/skvi/RemoteSourceIterator.java#L264

or

https://github.com/Accla/d4m_api_java/blob/master/src/main/java/edu/mit/ll/graphulo/skvi/RemoteWriteIterator.java#L360


Cheers, Dylan

On Tue, Jun 23, 2015 at 6:53 AM, madhvi <ma...@orkash.com> wrote:

> Thanks Josh. It really worked for me.
>
>
> On Wednesday 17 June 2015 08:43 PM, Josh Elser wrote:
>
>> Madhvi,
>>
>> Understood. A few more questions..
>>
>> How are you passing these IDs to the batch scanner? Are you providing
>> individual Ranges for each ID (e.g. `new Range(new Key("row1", "", "id1"),
>> true, new Key("row1", "", "id1\x00"), false))`)? Or are you providing an
>> entire row (or set of rows) and using the fetchColumns(Text,Text) method
>> (or similar) on the BatchScanner?
>>
>> Are you trying to sum across all rows that you queried? Or is your sum
>> per-row? If the former, that is going to cause you problems. The quick
>> explanation is that you can't reliably know the tablet boundaries so you
>> should try to perform an initial sum, per row. If you want, you can put a
>> second iterator "above" the first and do a summation across all rows to
>> reduce the amount of data sent to a client. However, if you use a
>> BatchScanner, you will still have to perform a final summation at the
>> client.
>>
>> Check out
>> https://blogs.apache.org/accumulo/entry/thinking_about_reads_over_accumulo
>> for more details on that..
>>
>> madhvi wrote:
>>
>>> Hi Josh,
>>>
>>> Sorry, my company policy doesn't allow me to share full source.What we
>>> are tryng to do is summing over a unique field stored in column
>>> qualifier for IDs passed to batch scanner.Can u suggest how it can be
>>> done in accumulo.
>>>
>>> Thanks
>>> Madhvi
>>> On Wednesday 17 June 2015 10:32 AM, Josh Elser wrote:
>>>
>>>> You put random values in the family and qualifier? Do I misunderstand
>>>> you?
>>>>
>>>> Also, if you can put up the full source for the iterator, that will be
>>>> much easier if you need help debugging it. It's hard for us to guess
>>>> at why your code might not be working as you expect.
>>>>
>>>> madhvi wrote:
>>>>
>>>>> Hi Josh,
>>>>>
>>>>> I have changed HashMap to TreeMap which sorts lexicographically and I
>>>>> have inserted random values in column family and qualifier.Value of
>>>>> TreeMap in value.
>>>>> Used scanner and batch scanner but getting results only with scanner.
>>>>>
>>>>> Thanks
>>>>> Madhvi
>>>>>
>>>>> On Tuesday 16 June 2015 08:42 PM, Josh Elser wrote:
>>>>>
>>>>>> Additionally, you're placing the Value into the ColumnQualifier and
>>>>>> dropping the ColumnFamily completely. Granted, that may not be a
>>>>>> problem for the specific data in your table, but it's not going to
>>>>>> work for any data.
>>>>>>
>>>>>> Christopher wrote:
>>>>>>
>>>>>>> You're iterating over a HashMap. That's not sorted.
>>>>>>>
>>>>>>> --
>>>>>>> Christopher L Tubbs II
>>>>>>> http://gravatar.com/ctubbsii
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Jun 16, 2015 at 1:58 AM, madhvi<ma...@orkash.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Josh,
>>>>>>>> Thanks for replying. I will enable remote debugger on my Accumulo
>>>>>>>> server.
>>>>>>>>
>>>>>>>> However I am slightly confused with your statement "you are not
>>>>>>>> returning
>>>>>>>> your data in sorted order". Can you point the part in my iterator
>>>>>>>> code which
>>>>>>>> seems innapropriate and any possible solution for that?
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> Madhvi
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tuesday 16 June 2015 11:07 AM, Josh Elser wrote:
>>>>>>>>
>>>>>>>>> //matched the condition and put values to holder map.
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>
>>>
>

Re: Abnormal behaviour of custom iterator in getting entries

Posted by madhvi <ma...@orkash.com>.
Thanks Josh. It really worked for me.


On Wednesday 17 June 2015 08:43 PM, Josh Elser wrote:
> Madhvi,
>
> Understood. A few more questions..
>
> How are you passing these IDs to the batch scanner? Are you providing 
> individual Ranges for each ID (e.g. `new Range(new Key("row1", "", 
> "id1"), true, new Key("row1", "", "id1\x00"), false))`)? Or are you 
> providing an entire row (or set of rows) and using the 
> fetchColumns(Text,Text) method (or similar) on the BatchScanner?
>
> Are you trying to sum across all rows that you queried? Or is your sum 
> per-row? If the former, that is going to cause you problems. The quick 
> explanation is that you can't reliably know the tablet boundaries so 
> you should try to perform an initial sum, per row. If you want, you 
> can put a second iterator "above" the first and do a summation across 
> all rows to reduce the amount of data sent to a client. However, if 
> you use a BatchScanner, you will still have to perform a final 
> summation at the client.
>
> Check out 
> https://blogs.apache.org/accumulo/entry/thinking_about_reads_over_accumulo 
> for more details on that..
>
> madhvi wrote:
>> Hi Josh,
>>
>> Sorry, my company policy doesn't allow me to share full source.What we
>> are tryng to do is summing over a unique field stored in column
>> qualifier for IDs passed to batch scanner.Can u suggest how it can be
>> done in accumulo.
>>
>> Thanks
>> Madhvi
>> On Wednesday 17 June 2015 10:32 AM, Josh Elser wrote:
>>> You put random values in the family and qualifier? Do I misunderstand
>>> you?
>>>
>>> Also, if you can put up the full source for the iterator, that will be
>>> much easier if you need help debugging it. It's hard for us to guess
>>> at why your code might not be working as you expect.
>>>
>>> madhvi wrote:
>>>> Hi Josh,
>>>>
>>>> I have changed HashMap to TreeMap which sorts lexicographically and I
>>>> have inserted random values in column family and qualifier.Value of
>>>> TreeMap in value.
>>>> Used scanner and batch scanner but getting results only with scanner.
>>>>
>>>> Thanks
>>>> Madhvi
>>>>
>>>> On Tuesday 16 June 2015 08:42 PM, Josh Elser wrote:
>>>>> Additionally, you're placing the Value into the ColumnQualifier and
>>>>> dropping the ColumnFamily completely. Granted, that may not be a
>>>>> problem for the specific data in your table, but it's not going to
>>>>> work for any data.
>>>>>
>>>>> Christopher wrote:
>>>>>> You're iterating over a HashMap. That's not sorted.
>>>>>>
>>>>>> -- 
>>>>>> Christopher L Tubbs II
>>>>>> http://gravatar.com/ctubbsii
>>>>>>
>>>>>>
>>>>>> On Tue, Jun 16, 2015 at 1:58 AM, madhvi<ma...@orkash.com>
>>>>>> wrote:
>>>>>>> Hi Josh,
>>>>>>> Thanks for replying. I will enable remote debugger on my Accumulo
>>>>>>> server.
>>>>>>>
>>>>>>> However I am slightly confused with your statement "you are not
>>>>>>> returning
>>>>>>> your data in sorted order". Can you point the part in my iterator
>>>>>>> code which
>>>>>>> seems innapropriate and any possible solution for that?
>>>>>>>
>>>>>>> Thanks
>>>>>>> Madhvi
>>>>>>>
>>>>>>>
>>>>>>> On Tuesday 16 June 2015 11:07 AM, Josh Elser wrote:
>>>>>>>> //matched the condition and put values to holder map.
>>>>>>>
>>>>
>>


Re: Abnormal behaviour of custom iterator in getting entries

Posted by Josh Elser <jo...@gmail.com>.
Madhvi,

Understood. A few more questions..

How are you passing these IDs to the batch scanner? Are you providing 
individual Ranges for each ID (e.g. `new Range(new Key("row1", "", 
"id1"), true, new Key("row1", "", "id1\x00"), false))`)? Or are you 
providing an entire row (or set of rows) and using the 
fetchColumns(Text,Text) method (or similar) on the BatchScanner?

Are you trying to sum across all rows that you queried? Or is your sum 
per-row? If the former, that is going to cause you problems. The quick 
explanation is that you can't reliably know the tablet boundaries so you 
should try to perform an initial sum, per row. If you want, you can put 
a second iterator "above" the first and do a summation across all rows 
to reduce the amount of data sent to a client. However, if you use a 
BatchScanner, you will still have to perform a final summation at the 
client.

Check out 
https://blogs.apache.org/accumulo/entry/thinking_about_reads_over_accumulo 
for more details on that..

madhvi wrote:
> Hi Josh,
>
> Sorry, my company policy doesn't allow me to share full source.What we
> are tryng to do is summing over a unique field stored in column
> qualifier for IDs passed to batch scanner.Can u suggest how it can be
> done in accumulo.
>
> Thanks
> Madhvi
> On Wednesday 17 June 2015 10:32 AM, Josh Elser wrote:
>> You put random values in the family and qualifier? Do I misunderstand
>> you?
>>
>> Also, if you can put up the full source for the iterator, that will be
>> much easier if you need help debugging it. It's hard for us to guess
>> at why your code might not be working as you expect.
>>
>> madhvi wrote:
>>> Hi Josh,
>>>
>>> I have changed HashMap to TreeMap which sorts lexicographically and I
>>> have inserted random values in column family and qualifier.Value of
>>> TreeMap in value.
>>> Used scanner and batch scanner but getting results only with scanner.
>>>
>>> Thanks
>>> Madhvi
>>>
>>> On Tuesday 16 June 2015 08:42 PM, Josh Elser wrote:
>>>> Additionally, you're placing the Value into the ColumnQualifier and
>>>> dropping the ColumnFamily completely. Granted, that may not be a
>>>> problem for the specific data in your table, but it's not going to
>>>> work for any data.
>>>>
>>>> Christopher wrote:
>>>>> You're iterating over a HashMap. That's not sorted.
>>>>>
>>>>> --
>>>>> Christopher L Tubbs II
>>>>> http://gravatar.com/ctubbsii
>>>>>
>>>>>
>>>>> On Tue, Jun 16, 2015 at 1:58 AM, madhvi<ma...@orkash.com>
>>>>> wrote:
>>>>>> Hi Josh,
>>>>>> Thanks for replying. I will enable remote debugger on my Accumulo
>>>>>> server.
>>>>>>
>>>>>> However I am slightly confused with your statement "you are not
>>>>>> returning
>>>>>> your data in sorted order". Can you point the part in my iterator
>>>>>> code which
>>>>>> seems innapropriate and any possible solution for that?
>>>>>>
>>>>>> Thanks
>>>>>> Madhvi
>>>>>>
>>>>>>
>>>>>> On Tuesday 16 June 2015 11:07 AM, Josh Elser wrote:
>>>>>>> //matched the condition and put values to holder map.
>>>>>>
>>>
>

Re: Abnormal behaviour of custom iterator in getting entries

Posted by madhvi <ma...@orkash.com>.
Hi Josh,

Sorry, my company policy doesn't allow me to share full source.What we 
are tryng to do is summing over a unique field stored in column 
qualifier for IDs passed to batch scanner.Can u suggest how it can be 
done in accumulo.

Thanks
Madhvi
On Wednesday 17 June 2015 10:32 AM, Josh Elser wrote:
> You put random values in the family and qualifier? Do I misunderstand 
> you?
>
> Also, if you can put up the full source for the iterator, that will be 
> much easier if you need help debugging it. It's hard for us to guess 
> at why your code might not be working as you expect.
>
> madhvi wrote:
>> Hi Josh,
>>
>> I have changed HashMap to TreeMap which sorts lexicographically and I
>> have inserted random values in column family and qualifier.Value of
>> TreeMap in value.
>> Used scanner and batch scanner but getting results only with scanner.
>>
>> Thanks
>> Madhvi
>>
>> On Tuesday 16 June 2015 08:42 PM, Josh Elser wrote:
>>> Additionally, you're placing the Value into the ColumnQualifier and
>>> dropping the ColumnFamily completely. Granted, that may not be a
>>> problem for the specific data in your table, but it's not going to
>>> work for any data.
>>>
>>> Christopher wrote:
>>>> You're iterating over a HashMap. That's not sorted.
>>>>
>>>> -- 
>>>> Christopher L Tubbs II
>>>> http://gravatar.com/ctubbsii
>>>>
>>>>
>>>> On Tue, Jun 16, 2015 at 1:58 AM, madhvi<ma...@orkash.com> 
>>>> wrote:
>>>>> Hi Josh,
>>>>> Thanks for replying. I will enable remote debugger on my Accumulo
>>>>> server.
>>>>>
>>>>> However I am slightly confused with your statement "you are not
>>>>> returning
>>>>> your data in sorted order". Can you point the part in my iterator
>>>>> code which
>>>>> seems innapropriate and any possible solution for that?
>>>>>
>>>>> Thanks
>>>>> Madhvi
>>>>>
>>>>>
>>>>> On Tuesday 16 June 2015 11:07 AM, Josh Elser wrote:
>>>>>> //matched the condition and put values to holder map.
>>>>>
>>


Re: Abnormal behaviour of custom iterator in getting entries

Posted by Josh Elser <jo...@gmail.com>.
You put random values in the family and qualifier? Do I misunderstand you?

Also, if you can put up the full source for the iterator, that will be 
much easier if you need help debugging it. It's hard for us to guess at 
why your code might not be working as you expect.

madhvi wrote:
> Hi Josh,
>
> I have changed HashMap to TreeMap which sorts lexicographically and I
> have inserted random values in column family and qualifier.Value of
> TreeMap in value.
> Used scanner and batch scanner but getting results only with scanner.
>
> Thanks
> Madhvi
>
> On Tuesday 16 June 2015 08:42 PM, Josh Elser wrote:
>> Additionally, you're placing the Value into the ColumnQualifier and
>> dropping the ColumnFamily completely. Granted, that may not be a
>> problem for the specific data in your table, but it's not going to
>> work for any data.
>>
>> Christopher wrote:
>>> You're iterating over a HashMap. That's not sorted.
>>>
>>> --
>>> Christopher L Tubbs II
>>> http://gravatar.com/ctubbsii
>>>
>>>
>>> On Tue, Jun 16, 2015 at 1:58 AM, madhvi<ma...@orkash.com> wrote:
>>>> Hi Josh,
>>>> Thanks for replying. I will enable remote debugger on my Accumulo
>>>> server.
>>>>
>>>> However I am slightly confused with your statement "you are not
>>>> returning
>>>> your data in sorted order". Can you point the part in my iterator
>>>> code which
>>>> seems innapropriate and any possible solution for that?
>>>>
>>>> Thanks
>>>> Madhvi
>>>>
>>>>
>>>> On Tuesday 16 June 2015 11:07 AM, Josh Elser wrote:
>>>>> //matched the condition and put values to holder map.
>>>>
>

Re: Abnormal behaviour of custom iterator in getting entries

Posted by madhvi <ma...@orkash.com>.
Hi Josh,

I have changed HashMap to TreeMap which sorts lexicographically and I 
have inserted random values in column family and qualifier.Value of 
TreeMap in value.
Used scanner and batch scanner but getting results only with scanner.

Thanks
Madhvi

On Tuesday 16 June 2015 08:42 PM, Josh Elser wrote:
> Additionally, you're placing the Value into the ColumnQualifier and 
> dropping the ColumnFamily completely. Granted, that may not be a 
> problem for the specific data in your table, but it's not going to 
> work for any data.
>
> Christopher wrote:
>> You're iterating over a HashMap. That's not sorted.
>>
>> -- 
>> Christopher L Tubbs II
>> http://gravatar.com/ctubbsii
>>
>>
>> On Tue, Jun 16, 2015 at 1:58 AM, madhvi<ma...@orkash.com>  wrote:
>>> Hi Josh,
>>> Thanks for replying. I will enable remote debugger on my Accumulo 
>>> server.
>>>
>>> However I am slightly confused with your statement "you are not 
>>> returning
>>> your data in sorted order". Can you point the part in my iterator 
>>> code which
>>> seems innapropriate and any possible solution for that?
>>>
>>> Thanks
>>> Madhvi
>>>
>>>
>>> On Tuesday 16 June 2015 11:07 AM, Josh Elser wrote:
>>>> //matched the condition and put values to holder map.
>>>


Re: Abnormal behaviour of custom iterator in getting entries

Posted by Josh Elser <jo...@gmail.com>.
Additionally, you're placing the Value into the ColumnQualifier and 
dropping the ColumnFamily completely. Granted, that may not be a problem 
for the specific data in your table, but it's not going to work for any 
data.

Christopher wrote:
> You're iterating over a HashMap. That's not sorted.
>
> --
> Christopher L Tubbs II
> http://gravatar.com/ctubbsii
>
>
> On Tue, Jun 16, 2015 at 1:58 AM, madhvi<ma...@orkash.com>  wrote:
>> Hi Josh,
>> Thanks for replying. I will enable remote debugger on my Accumulo server.
>>
>> However I am slightly confused with your statement "you are not returning
>> your data in sorted order". Can you point the part in my iterator code which
>> seems innapropriate and any possible solution for that?
>>
>> Thanks
>> Madhvi
>>
>>
>> On Tuesday 16 June 2015 11:07 AM, Josh Elser wrote:
>>> //matched the condition and put values to holder map.
>>

Re: Abnormal behaviour of custom iterator in getting entries

Posted by Christopher <ct...@apache.org>.
You're iterating over a HashMap. That's not sorted.

--
Christopher L Tubbs II
http://gravatar.com/ctubbsii


On Tue, Jun 16, 2015 at 1:58 AM, madhvi <ma...@orkash.com> wrote:
> Hi Josh,
> Thanks for replying. I will enable remote debugger on my Accumulo server.
>
> However I am slightly confused with your statement "you are not returning
> your data in sorted order". Can you point the part in my iterator code which
> seems innapropriate and any possible solution for that?
>
> Thanks
> Madhvi
>
>
> On Tuesday 16 June 2015 11:07 AM, Josh Elser wrote:
>>
>> //matched the condition and put values to holder map.
>
>

Re: Abnormal behaviour of custom iterator in getting entries

Posted by madhvi <ma...@orkash.com>.
Hi Josh,
Thanks for replying. I will enable remote debugger on my Accumulo server.

However I am slightly confused with your statement "you are not 
returning your data in sorted order". Can you point the part in my 
iterator code which seems innapropriate and any possible solution for that?

Thanks
Madhvi

On Tuesday 16 June 2015 11:07 AM, Josh Elser wrote:
> //matched the condition and put values to holder map. 


Re: Abnormal behaviour of custom iterator in getting entries

Posted by Josh Elser <jo...@gmail.com>.
To enable remote debugging, in ACCUMULO_TSERVER_OPTS in accumulo-env.sh, 
add the following "-Xdebug 
-Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=8888"

In this case, you would then use the port 8888 in Eclipse to do a Remote 
Java Application debugging session. Your TServer would need to be 
running locally to do this. If it's running on a remote host, you could 
do some trickery setting up SSH tunnels.

--

One problem with your iterator is that you are not returning your data 
in sorted order. This is a very bad idea as it invalidates the contract 
of the SortedKeyValueIterator interface and will cause you trouble in 
the future.

I'm not certain if this is why you are having problems with the 
BatchScanner -- I would have thought this would be problematic in both 
the Scanner and BatchScanner. You may have just found a set of 
conditions that this happened to not fail using the Scanner when it 
should have failed.

The omitted code in your myFunction() is a little scary too. You do not 
want to consume all of the data in the Range at one time as you will 
cause the server to run out of memory. SKVIs are meant to be run over 
data in your table _without_ keeping all of the data in memory. Think 
more of iterators as functions being applied to a stream of Keys and Values.

You can buffer small amounts of data in an iterator in memory (for 
example, buffering a row is fairly common), however this also requires 
sufficient memory on the tablet server to keep any row in memory. e.g. 
if you have a row that has 100k key-values in it, you will run out of 
memory.

madhvi wrote:
> Thanks Josh.
>
> Outline of my code is:
>
> public class TestIterator extends WrappingIterator {
>
> HashMap<String, Integer> holder = new HashMap<>();
> private Iterator<Map.Entry<String, Integer>> entries=null;
> private Entry<String, Integer> entry=null;
> private Key emitKey;
> private Value emitValue;
>
> @Override
> public void seek(Range range, Collection<ByteSequence> columnFamilies,
> boolean inclusive) throws IOException {
> super.seek(range, columnFamilies, inclusive);
> myFunction();
> }
>
> myFunction()
> {
> while(super.hasTop())
> {
> //matched the condition and put values to holder map.
> }
> entries = holder.entrySet().iterator();//iterate the map holder.
> }
>
> @Override
> public Key getTopKey() {
> return emitKey;
> }
>
> @Override
> public Value getTopValue() {
> return emitValue;
> }
>
> @Override
> public boolean hasTop() {
> return entries.hasNext();
> }
>
> @Override
> public void next() throws IOException {
> try{
> entry = entries.next();
> //put the keys of map to rowid and values of map to columnqualifier
> through emitKey
> emitKey = new Key(new Text(entry.getKey()), new Text(), new
> Text(String.valueOf(entry.getValue())));
> //return 1 in emitValue.
> emitValue = new Value("1".getBytes());
> }
> catch(Exception e)
> {
> e.printStackTrace();
> }
> }
> }
>
> This code returning result while using scanner and but not in case of
> batchscanner.
> And how enable remote debugger in accumulo.
>
> Thanks
> Madhvi
>
> On Monday 15 June 2015 09:21 PM, Josh Elser wrote:
>> It's hard to remotely debug an iterator, especially when we don't know
>> what it's doing. If you can post the code, that would help
>> tremendously. Instead of dumping values to a text file, you may fare
>> better by attaching a remote debugger to the TabletServer and setting
>> a breakpoint on your SKVI.
>>
>> The only thing I can say is that a Scanner and BatchScanner should
>> return the same data, but the invocations in the server to fetch that
>> data are performed differently. It's likely that due to the
>> differences in the implementations, you uncovered a bug in your iterator.
>>
>> One common pitfall is incorrectly handling something we refer to as a
>> "re-seek". Hypothetically, take a query scanning over [0, 9], and we
>> have one key per number in the range (10 keys).
>>
>> As the name implies, the BatchScanner fetches batches from a server,
>> and suppose that after 3 keys, the server-side buffer fills up. Thus,
>> the client will get keys [0,2]. In the server, the next time you fetch
>> a batch, a new instance of the iterator will be constructed (via
>> deepCopy()). Seek() will then be called, but with a new range that
>> represents the previous data that was already returned. Thus, your
>> iterator would be seeked with (2,9] instead of [0,9] again.
>>
>> I can't say whether or not you're actually hitting this case, but it's
>> a common pitfall that affects devs.
>>
>> madhvi wrote:
>>> @josh
>>> If after hasTop and getTopKey, seek would have called then this should
>>> also be written in call hierarchy.
>>> Because i have written all the function hierarchy in a file.
>>> so the problem if i have called myFunction() in seek.
>>> And after seek getTopKey and getTopValue then hasTop and next should be
>>> called but what is happening sometime getTopValue is called sometime
>>> not. This is happening when i am reading entries through batchscanner.
>>> getTopValue function is called while scanning through scanner, Applying
>>> same iterator using scanner and batchsacnner, through scanner getting
>>> returned entries but getting no entries returned while using
>>> batchscanner.
>>>
>>> So can you please explain.
>

Re: Abnormal behaviour of custom iterator in getting entries

Posted by madhvi <ma...@orkash.com>.
Thanks Josh.

Outline of my code is:

public class TestIterator extends WrappingIterator {

HashMap<String, Integer> holder = new HashMap<>();
private Iterator<Map.Entry<String, Integer>> entries=null;
private Entry<String, Integer> entry=null;
private Key emitKey;
private Value emitValue;

@Override
public void seek(Range range, Collection<ByteSequence> columnFamilies, 
boolean inclusive) throws IOException {
         super.seek(range, columnFamilies, inclusive);
         myFunction();
}

myFunction()
{
while(super.hasTop())
{
//matched the condition and put values to holder map.
}
entries = holder.entrySet().iterator();//iterate the map holder.
}

  @Override
       public Key getTopKey() {
           return emitKey;
       }

@Override
       public Value getTopValue() {
         return emitValue;
       }

  @Override
       public boolean hasTop() {
           return entries.hasNext();
       }

  @Override
       public void next() throws IOException {
           try{
               entry = entries.next();
                //put the keys of map to rowid and values of map to 
columnqualifier through emitKey
               emitKey = new Key(new Text(entry.getKey()), new Text(), 
new Text(String.valueOf(entry.getValue())));
               //return 1 in emitValue.
               emitValue = new Value("1".getBytes());
           }
           catch(Exception e)
           {
               e.printStackTrace();
           }
       }
}

This code returning result while using scanner and but not in case of 
batchscanner.
And how enable remote debugger in accumulo.

Thanks
Madhvi

On Monday 15 June 2015 09:21 PM, Josh Elser wrote:
> It's hard to remotely debug an iterator, especially when we don't know 
> what it's doing. If you can post the code, that would help 
> tremendously. Instead of dumping values to a text file, you may fare 
> better by attaching a remote debugger to the TabletServer and setting 
> a breakpoint on your SKVI.
>
> The only thing I can say is that a Scanner and BatchScanner should 
> return the same data, but the invocations in the server to fetch that 
> data are performed differently. It's likely that due to the 
> differences in the implementations, you uncovered a bug in your iterator.
>
> One common pitfall is incorrectly handling something we refer to as a 
> "re-seek". Hypothetically, take a query scanning over [0, 9], and we 
> have one key per number in the range (10 keys).
>
> As the name implies, the BatchScanner fetches batches from a server, 
> and suppose that after 3 keys, the server-side buffer fills up. Thus, 
> the client will get keys [0,2]. In the server, the next time you fetch 
> a batch, a new instance of the iterator will be constructed (via 
> deepCopy()). Seek() will then be called, but with a new range that 
> represents the previous data that was already returned. Thus, your 
> iterator would be seeked with (2,9] instead of [0,9] again.
>
> I can't say whether or not you're actually hitting this case, but it's 
> a common pitfall that affects devs.
>
> madhvi wrote:
>> @josh
>> If after hasTop and getTopKey, seek would have called then this should
>> also be written in call hierarchy.
>> Because i have written all the function hierarchy in a file.
>> so the problem if i have called myFunction() in seek.
>> And after seek getTopKey and getTopValue then hasTop and next should be
>> called but what is happening sometime getTopValue is called sometime
>> not. This is happening when i am reading entries through batchscanner.
>> getTopValue function is called while scanning through scanner, Applying
>> same iterator using scanner and batchsacnner, through scanner getting
>> returned entries but getting no entries returned while using 
>> batchscanner.
>>
>> So can you please explain.


Re: Abnormal behaviour of custom iterator in getting entries

Posted by Josh Elser <jo...@gmail.com>.
It's hard to remotely debug an iterator, especially when we don't know 
what it's doing. If you can post the code, that would help tremendously. 
Instead of dumping values to a text file, you may fare better by 
attaching a remote debugger to the TabletServer and setting a breakpoint 
on your SKVI.

The only thing I can say is that a Scanner and BatchScanner should 
return the same data, but the invocations in the server to fetch that 
data are performed differently. It's likely that due to the differences 
in the implementations, you uncovered a bug in your iterator.

One common pitfall is incorrectly handling something we refer to as a 
"re-seek". Hypothetically, take a query scanning over [0, 9], and we 
have one key per number in the range (10 keys).

As the name implies, the BatchScanner fetches batches from a server, and 
suppose that after 3 keys, the server-side buffer fills up. Thus, the 
client will get keys [0,2]. In the server, the next time you fetch a 
batch, a new instance of the iterator will be constructed (via 
deepCopy()). Seek() will then be called, but with a new range that 
represents the previous data that was already returned. Thus, your 
iterator would be seeked with (2,9] instead of [0,9] again.

I can't say whether or not you're actually hitting this case, but it's a 
common pitfall that affects devs.

madhvi wrote:
> @josh
> If after hasTop and getTopKey, seek would have called then this should
> also be written in call hierarchy.
> Because i have written all the function hierarchy in a file.
> so the problem if i have called myFunction() in seek.
> And after seek getTopKey and getTopValue then hasTop and next should be
> called but what is happening sometime getTopValue is called sometime
> not. This is happening when i am reading entries through batchscanner.
> getTopValue function is called while scanning through scanner, Applying
> same iterator using scanner and batchsacnner, through scanner getting
> returned entries but getting no entries returned while using batchscanner.
>
> So can you please explain.

Re: Abnormal behaviour of custom iterator in getting entries

Posted by madhvi <ma...@orkash.com>.
Hi,

I am working with shweta, I have a similar query as shweta.

@william
i am not doing similar to  WholeRowIterator. I am calculating sum(count) 
of each unique columnQualifier.

@josh
If after hasTop and getTopKey, seek would have called then this should 
also be written in call hierarchy.
Because i have written all the function hierarchy in a file.
so the problem if i have called myFunction() in seek.
And after seek getTopKey and getTopValue then hasTop and next should be 
called but what is happening sometime getTopValue is called sometime 
not. This is happening when i am reading entries through batchscanner. 
getTopValue function is called while scanning through scanner, Applying 
same iterator using scanner and batchsacnner, through scanner getting 
returned entries but getting no entries returned while using batchscanner.

So can you please explain.

Thanks
Madhvi


On Friday 12 June 2015 09:07 PM, Josh Elser wrote:
> Possible explanation inline
>
> shweta.agrawal wrote:
>> Hi,
>>
>> I am making a custom iterator which returns multiple entries. For some
>> entries getTopValue function is called, sometimes skipped. Due to this
>> behaviour i am not getting all the entries at scan time which are to be
>> returned.
>>
>> I had written functions calling hierarchy in a text file which is:
>> hasTop
>> getTopKey
>
> A seek() might have happened here. Would explain why you there was no 
> getTopValue.
>
>> hasTop
>> getTopKey
>> getTopValue
>
> After the seek, we checked to see that there was a K/V and then polled it
>
>> next
>
> We then tried to advance the "pointer" to the next K/V
>
>> hasTop
>
> And checked to see if we have another pair.
>
>>
>> Thanks
>> Shweta
>>
>>


Re: Abnormal behaviour of custom iterator in getting entries

Posted by Josh Elser <jo...@gmail.com>.
Possible explanation inline

shweta.agrawal wrote:
> Hi,
>
> I am making a custom iterator which returns multiple entries. For some
> entries getTopValue function is called, sometimes skipped. Due to this
> behaviour i am not getting all the entries at scan time which are to be
> returned.
>
> I had written functions calling hierarchy in a text file which is:
> hasTop
> getTopKey

A seek() might have happened here. Would explain why you there was no 
getTopValue.

> hasTop
> getTopKey
> getTopValue

After the seek, we checked to see that there was a K/V and then polled it

> next

We then tried to advance the "pointer" to the next K/V

> hasTop

And checked to see if we have another pair.

>
> Thanks
> Shweta
>
>

Re: Abnormal behaviour of custom iterator in getting entries

Posted by William Slacum <ws...@gmail.com>.
What do you mean by "multiple entries"? Are you doing something similar to
the WholeRowIterator, which encodes all the entries for a given row into a
single key value?

Are you using any other iterators?

In general, calls to `hasTop()`, `getTopKey()` and `getTopValue()` should
not change the state of the iterator, so it should be safe to call them
repeatedly in between calls to `next()` and `seek()`.

On Fri, Jun 12, 2015 at 7:47 AM, shweta.agrawal <sh...@orkash.com>
wrote:

> Hi,
>
> I am making a custom iterator which returns multiple entries. For some
> entries getTopValue function is called, sometimes skipped. Due to this
> behaviour i am not getting all the entries at scan time which are to be
> returned.
>
> I had written functions calling hierarchy in a text file which is:
> hasTop
> getTopKey
> hasTop
> getTopKey
> getTopValue
> next
> hasTop
>
> Thanks
> Shweta
>
>
>