You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@accumulo.apache.org by Suresh Prajapati <su...@gmail.com> on 2017/04/27 11:09:06 UTC

Accumulo Table Sacanning Taking Time!!!

Hello Team

I am developing a client in accumulo to store geo-spatial information and
using geomesa for indexing on top of it. However i found that scanning *~1
million* records taking *2-3 sec*. I looked at indexes and query plan of
geomesa but not able to find cause of the problem. I am running accumulo as
single tablet-server(including master). I want to know -
what are the factors can affect accumulo scanning operation? how can I
optimise this time?

Thank You
Suresh Prajapati

Re: Accumulo Table Sacanning Taking Time!!!

Posted by Suresh Prajapati <su...@gmail.com>.

No, I don't see CPU utilisation going 100% (reaches upto ~40%). Here is the
accumulo table data size:
Table Name - aj_join
aj_join_attr_v4 <http://localhost:9995/tables?t=1f> - 79.79MB
aj_join_records_v2 <http://localhost:9995/tables?t=1e> - 58.25 MB

Scan(Entries/s) goes to - 200000
Disk Usage shows - ~10Mbps for Read while scan rate on Accumulo web
interface is very less. Here is the screen shot for the same

On Mon, May 1, 2017 at 8:22 PM, Keith Turner <ke...@deenlo.com> wrote:

> Do you know if the tablet server and/or client is CPU bound?  When you
> run the query, do you see either go to 100% CPU?
>
> For the *~1 million* records, what is the data size?  I Ask because I
> am curious what the data rate is?  For example is it 2MB/sec
> 500KB/sec?
>
> On Thu, Apr 27, 2017 at 7:09 AM, Suresh Prajapati
> <su...@gmail.com> wrote:
> > Hello Team
> >
> > I am developing a client in accumulo to store geo-spatial information and
> > using geomesa for indexing on top of it. However i found that scanning
> *~1
> > million* records taking *2-3 sec*. I looked at indexes and query plan of
> > geomesa but not able to find cause of the problem. I am running accumulo
> as
> > single tablet-server(including master). I want to know -
> > what are the factors can affect accumulo scanning operation? how can I
> > optimise this time?
> >
> > Thank You
> > Suresh Prajapati
>

Re: Accumulo Table Sacanning Taking Time!!!

Posted by Keith Turner <ke...@deenlo.com>.

Do you know if the tablet server and/or client is CPU bound?  When you
run the query, do you see either go to 100% CPU?

For the *~1 million* records, what is the data size?  I Ask because I
am curious what the data rate is?  For example is it 2MB/sec
500KB/sec?

On Thu, Apr 27, 2017 at 7:09 AM, Suresh Prajapati
<su...@gmail.com> wrote:
> Hello Team
>
> I am developing a client in accumulo to store geo-spatial information and
> using geomesa for indexing on top of it. However i found that scanning *~1
> million* records taking *2-3 sec*. I looked at indexes and query plan of
> geomesa but not able to find cause of the problem. I am running accumulo as
> single tablet-server(including master). I want to know -
> what are the factors can affect accumulo scanning operation? how can I
> optimise this time?
>
> Thank You
> Suresh Prajapati

Re: Accumulo Table Sacanning Taking Time!!!

Posted by Suresh Prajapati <su...@gmail.com>.

Hello Marc

Thanks for pointing out the area of problems. I tried changing
*table.scan.max.memory
*but didn't find any changes in performance.
I am trying to fetch matching records count for specified query by using
AccumuloDatastore(ds) stats. Here is my sample code:

public int getRideCount(Long rideId) throws Exception {

    if(rideId != null){

         return ((Long) (ds.stats().getCount(sft, CQL.toFilter("r=" + rideId),
true).get())).intValue();

    }

    return 0;

  }

I also tried using Iterator but this is even worst. Below is the sample
code:

public int getRideCount(Long rideId) throws Exception {

   int count = 0;

    if(rideId != null){

      Query q = new Query(tableName, CQL.toFilter("r=" + rideId));

      SimpleFeatureIterator it = sfs.getFeatures(q).features();

      while(it.hasNext()){

      it.next();

      count++;

      }

      it.close();

    }

    return count;

  }


For highlighting the *key structure*, here is my feature type description :


*r:Long:cardinality=high:index=join,*g:Point:srid=4326,di:Integer:index=join,al:Float,s:Float,b:Float,an:Float,he:Float,ve:Float,t:Float,m:Boolean,i:Boolean,ts:Long;geomesa.table.sharing='true',geomesa.indices='attr:4:3,records:2:3,z2:3:3',geomesa.table.sharing.prefix='\\u0001'*


Please feel free to ask for any further clarifications.

Thank You

Suresh Prajapati

On Thu, Apr 27, 2017 at 7:05 PM, Marc P. <ma...@gmail.com> wrote:

> Suresh,
>    There are a lot of configuration points that can have an impact. For
> example, there is a configuration option that dictates how much data is
> returned each "iteration," called table.scan.max.memory [0]. Increasing
> this will cause more work to be done in each RPC call to get data. Lowering
> this can have the illusion of improved response time since you get data
> faster. Playing with this might impact your use case. If your keys/values
> are large you might attempt to increase this configuration number.
>
> Further, scanning can be impacted by the size of the data and the way it is
> stored. Table block caching might have an improvement [1], but I'm curious
> about how the data is stored. Do you have example keys. Are you returning
> all 1 million records from Accumulo through the scanner to perform some
> logic client side or is the logic server side in an iterator? Could you do
> more work in an iterator? Iterating over 1 M keys likely won't take 2-3
> seconds when executed at the tablet server, depending on the size of the
> key. Providing some insight into what the key structure is might give us
> more insight into how to better configure your tablet server properties.
>
>    Finally, is the 2-3 seconds just the time to get the data or does that
> include time to inspect keys?
>
> [0]
> http://accumulo.apache.org/1.6/accumulo_user_manual#_table_scan_max_memory
> [1] http://accumulo.apache.org/1.6/accumulo_user_manual#_block_cache
>
> On Thu, Apr 27, 2017 at 7:09 AM, Suresh Prajapati <
> sureshpraja1234@gmail.com
> > wrote:
>
> > Hello Team
> >
> > I am developing a client in accumulo to store geo-spatial information and
> > using geomesa for indexing on top of it. However i found that scanning
> *~1
> > million* records taking *2-3 sec*. I looked at indexes and query plan of
> > geomesa but not able to find cause of the problem. I am running accumulo
> as
> > single tablet-server(including master). I want to know -
> > what are the factors can affect accumulo scanning operation? how can I
> > optimise this time?
> >
> > Thank You
> > Suresh Prajapati
> >
>

Re: Accumulo Table Sacanning Taking Time!!!

Posted by "Marc P." <ma...@gmail.com>.

Suresh,
   There are a lot of configuration points that can have an impact. For
example, there is a configuration option that dictates how much data is
returned each "iteration," called table.scan.max.memory [0]. Increasing
this will cause more work to be done in each RPC call to get data. Lowering
this can have the illusion of improved response time since you get data
faster. Playing with this might impact your use case. If your keys/values
are large you might attempt to increase this configuration number.

Further, scanning can be impacted by the size of the data and the way it is
stored. Table block caching might have an improvement [1], but I'm curious
about how the data is stored. Do you have example keys. Are you returning
all 1 million records from Accumulo through the scanner to perform some
logic client side or is the logic server side in an iterator? Could you do
more work in an iterator? Iterating over 1 M keys likely won't take 2-3
seconds when executed at the tablet server, depending on the size of the
key. Providing some insight into what the key structure is might give us
more insight into how to better configure your tablet server properties.

   Finally, is the 2-3 seconds just the time to get the data or does that
include time to inspect keys?

[0]
http://accumulo.apache.org/1.6/accumulo_user_manual#_table_scan_max_memory
[1] http://accumulo.apache.org/1.6/accumulo_user_manual#_block_cache

On Thu, Apr 27, 2017 at 7:09 AM, Suresh Prajapati <sureshpraja1234@gmail.com
> wrote:

> Hello Team
>
> I am developing a client in accumulo to store geo-spatial information and
> using geomesa for indexing on top of it. However i found that scanning *~1
> million* records taking *2-3 sec*. I looked at indexes and query plan of
> geomesa but not able to find cause of the problem. I am running accumulo as
> single tablet-server(including master). I want to know -
> what are the factors can affect accumulo scanning operation? how can I
> optimise this time?
>
> Thank You
> Suresh Prajapati
>

Re: Fwd: Accumulo Table Sacanning Taking Time!!!

Posted by Dave Marion <dl...@comcast.net>.

You could add more tablet servers and add splits to the table.

> On April 27, 2017 at 7:17 AM Suresh Prajapati <su...@gmail.com> wrote:
>
>
> ---------- Forwarded message ----------
> From: Suresh Prajapati <su...@gmail.com>
> Date: Thu, Apr 27, 2017 at 4:39 PM
> Subject: Accumulo Table Sacanning Taking Time!!!
> To: dev@accumulo.apache.org
>
>
> Hello Team
>
> I am developing a client in accumulo to store geo-spatial information and
> using geomesa for indexing on top of it. However i found that scanning *~1
> million* records taking *2-3 sec*. I looked at indexes and query plan of
> geomesa but not able to find cause of the problem. I am running accumulo as
> single tablet-server(including master). I want to know -
> what are the factors can affect accumulo scanning operation? how can I
> optimise this time?
>
> Thank You
> Suresh Prajapati

Fwd: Accumulo Table Sacanning Taking Time!!!

Posted by Suresh Prajapati <su...@gmail.com>.

---------- Forwarded message ----------
From: Suresh Prajapati <su...@gmail.com>
Date: Thu, Apr 27, 2017 at 4:39 PM
Subject: Accumulo Table Sacanning Taking Time!!!
To: dev@accumulo.apache.org

Hello Team

I am developing a client in accumulo to store geo-spatial information and
using geomesa for indexing on top of it. However i found that scanning *~1
million* records taking *2-3 sec*. I looked at indexes and query plan of
geomesa but not able to find cause of the problem. I am running accumulo as
single tablet-server(including master). I want to know -
what are the factors can affect accumulo scanning operation? how can I
optimise this time?

Thank You
Suresh Prajapati