You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@rya.apache.org by Boris Pelakh <bo...@semanticarts.com> on 2019/05/08 16:01:18 UTC

Named graph performance issue

Hi,

One of our customers has an application that has highly localized graph patterns, so they partition their data into named graphs and primarily perform graph-level read and replace operations. They are trying to make a transition from RDF4J and Neptune to Rya and seen some performance issues. I stood a local Rya instance in my Ubuntu VM and got some measurements:


  1.  Loaded 11K datasets averaging about 120 triples each (total 1.4 million triples)
  2.  Post insertion named graph fetch - 3.9 seconds. (RDF4J time was less than a second)
  3.  Compacted all the tables, average fetch of a graph - 1.9 seconds
  4.  Rya stores the graph name in the column family, so a full fetch of a named graph is range-less scan with a specified column family. Removed Rya from the equation, wrote a small test program that did an equivalent column family scan. Average time - 1.9 seconds, so it appears Rya overhead is negligible. Tried variations with using a single range scanner, then a batch scanner with a single range specified, just column family - same results
  5.  Furthermore, query did not speed with repetition, i.e. no index warming effect
  6.  Modified my graph fetch query from
construct { ?s ?p ?o } where { graph <http://my/graph> { ?s ?p ?o }}
to
construct { ?s ?p ?o } where { graph <http://my/graph> { ?s a ?type; ?p ?o }}
(which produced the exact same RDF output)
This would execute as a range scan on the po table (using the rdf:type predicate prefix), followed by a guided batch scan on the spo table on the found subjects.
Total execution time = 0.85 seconds. After repetition = 0.46 seconds as the indices warmed

So, what I see is Accumulo is much better about a range scan than a column family scan, so much so that even running 2 scans and a join is still faster. It seems that if we wanted to get decent performance on graph fetches, we would have to generate a `gspo` table or something similar.

Any ideas of another approach to improve the performance of this type of query?

PS. Here is my test code,
import org.apache.accumulo.core.client.*;
import org.apache.accumulo.core.client.security.tokens.PasswordToken;
import org.apache.accumulo.core.data.Key;
import org.apache.accumulo.core.data.Range;
import org.apache.accumulo.core.data.Value;
import org.apache.accumulo.core.security.Authorizations;
import org.apache.hadoop.io.Text;

import java.util.Collections;
import java.util.Map;

public class ScanPerfTest {

    public static void main(String[] args) {
        String instanceName = "accumulo";
        String zooServers = "localhost";
        Instance inst = new ZooKeeperInstance(instanceName, zooServers);

        try {
            Connector con = inst.getConnector("rya", new PasswordToken("rya"));
            Scanner s = con.createScanner("sa_ts_spo", new Authorizations());
            try {
//                s.setRange(new Range(
//                        new Key(new Text(new byte[]{})),
//                        new Key(new Text(new byte[]{(byte) 0xff}))));
                s.fetchColumnFamily(new Text("http://my/graph"));
                long start = System.currentTimeMillis();
                int triples = 0;
                for (Map.Entry<Key, Value> e : s) {
                    // System.out.println(e.getKey().getRow().toString());
                    triples++;
                }
                System.out.println("Read " + triples + " triples in " + (System.currentTimeMillis() - start) + "ms");
            } finally {
                s.close();
            }
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}


Boris Pelakh
Ontologist, Developer, Software Architect
boris.pelakh@semanticarts.com<ma...@semanticarts.com>
+1-321-243-3804
[SemanticArtsLogo]


RE: Named graph performance issue

Posted by Boris Pelakh <bo...@semanticarts.com>.
Thanks for the idea, I would give it a shot. It's not really the intended use of locality groups, since instead of a statically partitioned list, the way you would have with traditional column families, the named graph list is dynamic and large. I am going to try and partition based on a short hash of the graph name, see if I can get a decent distribution, then measure how it affects performance.
I guess if that works, then newly created graphs (which would initially slot into the default locality group) can be reassigned periodically as a  batch job and cleaned up during compaction.

What was your motivation to switching to a Mongo-based storage schema? Does it have advantages over Accumulo with regard to scalability or features? Is there documentation of the storage schema in the code base?

Boris Pelakh
Ontologist, Developer, Software Architect
boris.pelakh@semanticarts.com
+1-321-243-3804


-----Original Message-----
From: Puja Valiyil <pu...@gmail.com> 
Sent: Wednesday, May 8, 2019 12:32 PM
To: dev@rya.incubator.apache.org
Subject: Re: Named graph performance issue

Hi Boris,
Did you try configuring accumulo to use locality groups?  I think that groups cf values in the same files, which may help in your case.  Sorry if I’m completely off base here— I’ve been in mongodb land for so long I may have lost touch on how the accumulo version of Rya works.
Thanks,
Puja

Sent from my iPhone

> On May 8, 2019, at 12:01 PM, Boris Pelakh <bo...@semanticarts.com> wrote:
> 
> Hi,
>  
> One of our customers has an application that has highly localized graph patterns, so they partition their data into named graphs and primarily perform graph-level read and replace operations. They are trying to make a transition from RDF4J and Neptune to Rya and seen some performance issues. I stood a local Rya instance in my Ubuntu VM and got some measurements:
>  
> Loaded 11K datasets averaging about 120 triples each (total 1.4 
> million triples) Post insertion named graph fetch - 3.9 seconds. 
> (RDF4J time was less than a second) Compacted all the tables, average 
> fetch of a graph - 1.9 seconds Rya stores the graph name in the column 
> family, so a full fetch of a named graph is range-less scan with a 
> specified column family. Removed Rya from the equation, wrote a small 
> test program that did an equivalent column family scan. Average time - 
> 1.9 seconds, so it appears Rya overhead is negligible. Tried 
> variations with using a single range scanner, then a batch scanner 
> with a single range specified, just column family - same results 
> Furthermore, query did not speed with repetition, i.e. no index warming effect Modified my graph fetch query from construct { ?s ?p ?o } where { graph <http://my/graph> { ?s ?p ?o }} to construct { ?s ?p ?o } where { graph <http://my/graph> { ?s a ?type; ?p ?o }} (which produced the exact same RDF output) This would execute as a range scan on the po table (using the rdf:type predicate prefix), followed by a guided batch scan on the spo table on the found subjects.
> Total execution time = 0.85 seconds. After repetition = 0.46 seconds 
> as the indices warmed
>  
> So, what I see is Accumulo is much better about a range scan than a column family scan, so much so that even running 2 scans and a join is still faster. It seems that if we wanted to get decent performance on graph fetches, we would have to generate a `gspo` table or something similar.
>  
> Any ideas of another approach to improve the performance of this type of query?
>  
> PS. Here is my test code,
> import org.apache.accumulo.core.client.*;
> import org.apache.accumulo.core.client.security.tokens.PasswordToken;
> import org.apache.accumulo.core.data.Key;
> import org.apache.accumulo.core.data.Range;
> import org.apache.accumulo.core.data.Value;
> import org.apache.accumulo.core.security.Authorizations;
> import org.apache.hadoop.io.Text;
>  
> import java.util.Collections;
> import java.util.Map;
>  
> public class ScanPerfTest {
>  
>     public static void main(String[] args) {
>         String instanceName = "accumulo";
>         String zooServers = "localhost";
>         Instance inst = new ZooKeeperInstance(instanceName, 
> zooServers);
>  
>         try {
>             Connector con = inst.getConnector("rya", new PasswordToken("rya"));
>             Scanner s = con.createScanner("sa_ts_spo", new Authorizations());
>             try {
> //                s.setRange(new Range(
> //                        new Key(new Text(new byte[]{})),
> //                        new Key(new Text(new byte[]{(byte) 0xff}))));
>                 s.fetchColumnFamily(new Text("http://my/graph"));
>                 long start = System.currentTimeMillis();
>                 int triples = 0;
>                 for (Map.Entry<Key, Value> e : s) {
>                     // System.out.println(e.getKey().getRow().toString());
>                     triples++;
>                 }
>                 System.out.println("Read " + triples + " triples in " + (System.currentTimeMillis() - start) + "ms");
>             } finally {
>                 s.close();
>             }
>         } catch (Exception e) {
>             e.printStackTrace();
>         }
>     }
> }
>  
>  
> Boris Pelakh
> Ontologist, Developer, Software Architect 
> boris.pelakh@semanticarts.com
> +1-321-243-3804
> 
>  

Re: Named graph performance issue

Posted by Puja Valiyil <pu...@gmail.com>.
Hi Boris,
Did you try configuring accumulo to use locality groups?  I think that groups cf values in the same files, which may help in your case.  Sorry if I’m completely off base here— I’ve been in mongodb land for so long I may have lost touch on how the accumulo version of Rya works.
Thanks,
Puja

Sent from my iPhone

> On May 8, 2019, at 12:01 PM, Boris Pelakh <bo...@semanticarts.com> wrote:
> 
> Hi,
>  
> One of our customers has an application that has highly localized graph patterns, so they partition their data into named graphs and primarily perform graph-level read and replace operations. They are trying to make a transition from RDF4J and Neptune to Rya and seen some performance issues. I stood a local Rya instance in my Ubuntu VM and got some measurements:
>  
> Loaded 11K datasets averaging about 120 triples each (total 1.4 million triples)
> Post insertion named graph fetch - 3.9 seconds. (RDF4J time was less than a second)
> Compacted all the tables, average fetch of a graph - 1.9 seconds
> Rya stores the graph name in the column family, so a full fetch of a named graph is range-less scan with a specified column family. Removed Rya from the equation, wrote a small test program that did an equivalent column family scan. Average time - 1.9 seconds, so it appears Rya overhead is negligible. Tried variations with using a single range scanner, then a batch scanner with a single range specified, just column family - same results
> Furthermore, query did not speed with repetition, i.e. no index warming effect
> Modified my graph fetch query from
> construct { ?s ?p ?o } where { graph <http://my/graph> { ?s ?p ?o }}
> to
> construct { ?s ?p ?o } where { graph <http://my/graph> { ?s a ?type; ?p ?o }}
> (which produced the exact same RDF output)
> This would execute as a range scan on the po table (using the rdf:type predicate prefix), followed by a guided batch scan on the spo table on the found subjects.
> Total execution time = 0.85 seconds. After repetition = 0.46 seconds as the indices warmed
>  
> So, what I see is Accumulo is much better about a range scan than a column family scan, so much so that even running 2 scans and a join is still faster. It seems that if we wanted to get decent performance on graph fetches, we would have to generate a `gspo` table or something similar.
>  
> Any ideas of another approach to improve the performance of this type of query?
>  
> PS. Here is my test code,
> import org.apache.accumulo.core.client.*;
> import org.apache.accumulo.core.client.security.tokens.PasswordToken;
> import org.apache.accumulo.core.data.Key;
> import org.apache.accumulo.core.data.Range;
> import org.apache.accumulo.core.data.Value;
> import org.apache.accumulo.core.security.Authorizations;
> import org.apache.hadoop.io.Text;
>  
> import java.util.Collections;
> import java.util.Map;
>  
> public class ScanPerfTest {
>  
>     public static void main(String[] args) {
>         String instanceName = "accumulo";
>         String zooServers = "localhost";
>         Instance inst = new ZooKeeperInstance(instanceName, zooServers);
>  
>         try {
>             Connector con = inst.getConnector("rya", new PasswordToken("rya"));
>             Scanner s = con.createScanner("sa_ts_spo", new Authorizations());
>             try {
> //                s.setRange(new Range(
> //                        new Key(new Text(new byte[]{})),
> //                        new Key(new Text(new byte[]{(byte) 0xff}))));
>                 s.fetchColumnFamily(new Text("http://my/graph"));
>                 long start = System.currentTimeMillis();
>                 int triples = 0;
>                 for (Map.Entry<Key, Value> e : s) {
>                     // System.out.println(e.getKey().getRow().toString());
>                     triples++;
>                 }
>                 System.out.println("Read " + triples + " triples in " + (System.currentTimeMillis() - start) + "ms");
>             } finally {
>                 s.close();
>             }
>         } catch (Exception e) {
>             e.printStackTrace();
>         }
>     }
> }
>  
>  
> Boris Pelakh
> Ontologist, Developer, Software Architect
> boris.pelakh@semanticarts.com
> +1-321-243-3804
> 
>