You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@accumulo.apache.org by Michael Orr <mi...@gmail.com> on 2013/11/06 15:19:09 UTC

Optimizing Accumulo for read performance

Hello,

I’m working on an application that needs fast read performance. I’ve been
conducting some experiments starting with a single (pseudo-distributed)
cluster with the intent of scaling out. However, prior to doing so, I
wanted to get a good gauge for how fast a single tablet server can read.

The application processes and stores graph data with the following schema:

for nodes:
N|NodeID                ID:NodeID       EIN:EdgeID              EOUT:EdgeID
            .. lots of other attributes

there can be multiple EIN and EOUT CFs for each node

for edges
E|EdgeID                ID:NodeID       VIN:VertexID
 EOUT:VertexID   .. lots of other attributes


Scans into the system can be for entire graph or a subset of nodes and
edges. We generally pull navigational information first, then other
attributes later if needed. I’ve spent some time looking into using
locality groups but was curious if there are recommendations on backend
properties that could be set to increase read time particularly if memory
and space were not a concern.

Thanks for your help!

Mike

Re: Optimizing Accumulo for read performance

Posted by Michael Orr <mi...@gmail.com>.
Thanks for responding.


The RKEYs  for nodes are N|<NodeID> and we have CF:CQs for each edge. We
maintain the edge attributes as separate RKEYs using E<EdgeID>.


I’m not sure what you mean by repeating the node id..


Mike


On Wed, Nov 6, 2013 at 9:58 AM, William Slacum <
wilhelm.von.cloud@accumulo.net> wrote:

> When you say schema, do you mean key schema? If so, why are you repeating
> the node id?
>
> Locality groups would help if you have larger swaths of data you wanted to
> group together and query discretely from other locality groups. For
> instance, I've seen key schemas where "in" and "out" edges are grouped
> together.
>
> At a system level, if you know some information about the distribution of
> the row values (in this case, it looks like node id and edge id), you can
> pre split the table by taking some samples out of that space. This would
> distribute the tablets arounds, making queries using the batch scanner
> faster by increasing the parallelism. This would also increase the number
> of input splits generated by the input format if you wanted to do batch
> processing on the entire graph.
>
> On Wed, Nov 6, 2013 at 9:19 AM, Michael Orr <mi...@gmail.com>wrote:
>
>> Hello,
>>
>> I’m working on an application that needs fast read performance. I’ve been
>> conducting some experiments starting with a single (pseudo-distributed)
>> cluster with the intent of scaling out. However, prior to doing so, I
>> wanted to get a good gauge for how fast a single tablet server can read.
>>
>> The application processes and stores graph data with the following schema:
>>
>> for nodes:
>> N|NodeID                ID:NodeID       EIN:EdgeID
>>  EOUT:EdgeID             .. lots of other attributes
>>
>> there can be multiple EIN and EOUT CFs for each node
>>
>> for edges
>> E|EdgeID                ID:NodeID       VIN:VertexID
>>  EOUT:VertexID   .. lots of other attributes
>>
>>
>> Scans into the system can be for entire graph or a subset of nodes and
>> edges. We generally pull navigational information first, then other
>> attributes later if needed. I’ve spent some time looking into using
>> locality groups but was curious if there are recommendations on backend
>> properties that could be set to increase read time particularly if memory
>> and space were not a concern.
>>
>> Thanks for your help!
>>
>> Mike
>>
>
>

Re: Optimizing Accumulo for read performance

Posted by William Slacum <wi...@accumulo.net>.
When you say schema, do you mean key schema? If so, why are you repeating
the node id?

Locality groups would help if you have larger swaths of data you wanted to
group together and query discretely from other locality groups. For
instance, I've seen key schemas where "in" and "out" edges are grouped
together.

At a system level, if you know some information about the distribution of
the row values (in this case, it looks like node id and edge id), you can
pre split the table by taking some samples out of that space. This would
distribute the tablets arounds, making queries using the batch scanner
faster by increasing the parallelism. This would also increase the number
of input splits generated by the input format if you wanted to do batch
processing on the entire graph.

On Wed, Nov 6, 2013 at 9:19 AM, Michael Orr <mi...@gmail.com> wrote:

> Hello,
>
> I’m working on an application that needs fast read performance. I’ve been
> conducting some experiments starting with a single (pseudo-distributed)
> cluster with the intent of scaling out. However, prior to doing so, I
> wanted to get a good gauge for how fast a single tablet server can read.
>
> The application processes and stores graph data with the following schema:
>
> for nodes:
> N|NodeID                ID:NodeID       EIN:EdgeID
>  EOUT:EdgeID             .. lots of other attributes
>
> there can be multiple EIN and EOUT CFs for each node
>
> for edges
> E|EdgeID                ID:NodeID       VIN:VertexID
>  EOUT:VertexID   .. lots of other attributes
>
>
> Scans into the system can be for entire graph or a subset of nodes and
> edges. We generally pull navigational information first, then other
> attributes later if needed. I’ve spent some time looking into using
> locality groups but was curious if there are recommendations on backend
> properties that could be set to increase read time particularly if memory
> and space were not a concern.
>
> Thanks for your help!
>
> Mike
>