You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@kudu.apache.org by Ananth Gundabattula <ag...@gmail.com> on 2017/08/13 02:36:27 UTC

Question on consistent ordering of scanner rows

Hello All,


I was wondering if there is any guarantee from the kudu scanner that the rows returned from a single tablet scan are always in the same order basing on the following assumptions : 

- There was no change in the underlying kudu tablet for the given scan range when the reads are being performed multiple times for the same scan token
- I am using Java client
- I am using Kudu version 1.4.0
- The client code is using the KuduScanTokenBuilder API to plan the set of scans that can be performed for a given query.
- The client is using the nextRows() followed using hasNext() and next() methods in the corresponding iterators.
- There seems to be a variable called orderMode in the asyncScanner during a debug session but it looks like this property is not exposed yet as a public API. The default value seems to be that it is unordered. 


Perhaps the answer is no per the last point above but would like confirmation from the community. 

I am integrating Apache Apex with Apache kudu and am using the scan token builder API mechanism to plan the scans in a distributed way. While doing so, I would like to provide the end users of Apache Apex a mechanism to get a consistent scan ordering as a configurable approach. Given it is almost impossible to achieve this ordering in a true distributed fashion for downstream compute nodes, the aim is to provide consistent ordering within a single Apex partition. Apache apex with Kudu integration would be providing configurations to map one tablet to one or multiple apex partitions. While scanning in either of these mapping styles, I would like to provide further ordering guarantees. However I am not sure if Apache Kudu would provide a consistent ordering for the same scan provided the above assumptions hold good.  

Could you please advise regarding the ordering of scan rows for a single tablet across multiple launches of the same scan token ?

Regards,
Ananth

Re: Question on consistent ordering of scanner rows

Posted by da...@gmail.com.
Hi Ananth

  We've "hidden" the ordered scans API when we added hash partitioning since it wouldn't return a fully ordered scan across tablet servers anymore and we didn't want to confuse users.
  If all you want is a scan that always returns the same ordering (but not fully ordered rows) you can achieve that by making the scan fault-tolerant (https://kudu.apache.org/apidocs/org/apache/kudu/client/AbstractKuduScannerBuilder.html#setFaultTolerant-boolean-)
  Note that there might be a perf penalty for doing these kinds of scans.

HTH
-David 

Sent from my iPhone

> On Aug 12, 2017, at 7:36 PM, Ananth Gundabattula <ag...@gmail.com> wrote:
> 
> Hello All,
> 
> 
> I was wondering if there is any guarantee from the kudu scanner that the rows returned from a single tablet scan are always in the same order basing on the following assumptions : 
> 
> - There was no change in the underlying kudu tablet for the given scan range when the reads are being performed multiple times for the same scan token
> - I am using Java client
> - I am using Kudu version 1.4.0
> - The client code is using the KuduScanTokenBuilder API to plan the set of scans that can be performed for a given query.
> - The client is using the nextRows() followed using hasNext() and next() methods in the corresponding iterators.
> - There seems to be a variable called orderMode in the asyncScanner during a debug session but it looks like this property is not exposed yet as a public API. The default value seems to be that it is unordered. 
> 
> 
> Perhaps the answer is no per the last point above but would like confirmation from the community. 
> 
> I am integrating Apache Apex with Apache kudu and am using the scan token builder API mechanism to plan the scans in a distributed way. While doing so, I would like to provide the end users of Apache Apex a mechanism to get a consistent scan ordering as a configurable approach. Given it is almost impossible to achieve this ordering in a true distributed fashion for downstream compute nodes, the aim is to provide consistent ordering within a single Apex partition. Apache apex with Kudu integration would be providing configurations to map one tablet to one or multiple apex partitions. While scanning in either of these mapping styles, I would like to provide further ordering guarantees. However I am not sure if Apache Kudu would provide a consistent ordering for the same scan provided the above assumptions hold good.  
> 
> Could you please advise regarding the ordering of scan rows for a single tablet across multiple launches of the same scan token ?
> 
> Regards,
> Ananth