You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@jena.apache.org by afs <gi...@git.apache.org> on 2017/11/12 10:37:37 UTC

[GitHub] jena pull request #306: Algorithms for JENA-1414

GitHub user afs opened a pull request:

    https://github.com/apache/jena/pull/306

    Algorithms for JENA-1414

    This PR rstructures `GraphUtil.deleteFrom` to put the decision on whether to loop on the target (_dst_) graph or the source (_src_) graph.
    
    There are 3 algorithms for discussion:
    
    1. Use the size of the graphs - this is the current Jena 3.5.0 policy
    2. Use the size of the _src_ and iterate on the _dst_ to compare sizes
    3. Iterator on both _src_ and _dst_ to compare sizes. `Graph.size` is not used at all.
    
    The cost of `Graph.size()` can be small (already known) or large (needs to be calculated to be accurate). The latter is bad for large persistent graphs.
    
    The PR also adds javadoc to explain how to call a specific algorithm (_src_ loop or _dst_ loop).
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/afs/jena graph-deleteFrom

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/jena/pull/306.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #306
    
----
commit 7faca15490448af1679f2705ad820f0baa7f1efe
Author: Andy Seaborne <an...@apache.org>
Date:   2017-11-11T20:10:51Z

    Algorithms for JENA-1414

----


---

[GitHub] jena pull request #306: Algorithms for JENA-1414

Posted by ajs6f <gi...@git.apache.org>.
Github user ajs6f commented on a diff in the pull request:

    https://github.com/apache/jena/pull/306#discussion_r151692738
  
    --- Diff: jena-base/src/main/java/org/apache/jena/atlas/iterator/Iter.java ---
    @@ -351,6 +351,22 @@ public void remove() {
             return filter(iter, Objects::nonNull) ;
         }
     
    +    /** Step forward up to {@code steps} places.
    +     * <br/>Return number of steps taken.
    --- End diff --
    
    `@return number of steps taken`


---

[GitHub] jena pull request #306: Algorithms for JENA-1414

Posted by ajs6f <gi...@git.apache.org>.
Github user ajs6f commented on a diff in the pull request:

    https://github.com/apache/jena/pull/306#discussion_r152085996
  
    --- Diff: jena-core/src/main/java/org/apache/jena/graph/GraphUtil.java ---
    @@ -246,43 +282,214 @@ private static void deleteIteratorWorkerDirect(Graph graph, Iterator<Triple> it)
             }
         }
     
    -    private static final int sliceSize = 1000 ;
    -    /** A safe and cautious remove() function that converts the remove to
    -     *  a number of {@link Graph#delete(Triple)} operations. 
    +    private static int MIN_SRC_SIZE   = 1000 ;
    +    // If source and destination are large, limit the search for the best way round to "deleteFrom" 
    +    private static int MAX_SRC_SIZE   = 1000*1000 ;
    +    private static int DST_SRC_RATIO  = 2 ;
    +
    +    /**
    +     * Delete triples in the destination (arg 1) as given in the source (arg 2).
    +     *
    +     * @implNote
    +     *  This is designed for the case of {@code dstGraph} being comparable or much larger than
    +     *  {@code srcGraph} or {@code srcGraph} having a lot of triples to actually be
    +     *  deleted from {@code dstGraph}. This includes large, persistent {@code dstGraph}.
    +     *  <p>  
    +     *  It is not designed for a large {@code srcGraph} and large {@code dstGraph} 
    +     *  with only a few triples in common delete from {@code dstGraph}. It is better to
    +     *  calculate the difference in someway, and copy into a small graph to use as the {@srcGraph}.  
    --- End diff --
    
    typo: some way


---

[GitHub] jena pull request #306: Algorithms for JENA-1414

Posted by afs <gi...@git.apache.org>.
Github user afs commented on a diff in the pull request:

    https://github.com/apache/jena/pull/306#discussion_r151390296
  
    --- Diff: jena-core/src/main/java/org/apache/jena/graph/GraphUtil.java ---
    @@ -223,6 +258,116 @@ public static void deleteFrom(Graph dstGraph, Graph srcGraph) {
             dstGraph.getEventManager().notifyDeleteGraph(dstGraph, srcGraph) ;
         }
     
    +    private static final int CMP_GREATER = 1;
    +    private static final int CMP_EQUAL   = 0;
    +    private static final int CMP_LESS    = -1;
    +    private static final int CMP_UNKNOWN = -2;
    +
    +    // Decide whether to loop on dstGraph or srcGraph.
    +    private static boolean decideHowtoExecute(Graph dstGraph, Graph srcGraph) {
    +        if ( false ) {
    +            // Use dstGraph.size() and srcGraph.size()
    +            // Assumes dstGraph.size() (and to some extent srcGraph.size() are efficient.
    +            int dstSize = dstGraph.size();
    +            int srcSize = srcGraph.size();
    +
    +            // Loop on src if:
    +            // * srcGraph is below the threshold MIN_SRC_SIZE (a "Just Do it" number)
    +            // * dstGraph is "much" larger than src where "much" is given by DST_SRC_RATIO
    +            //     Assumes dstGraph is efficient.
    +            boolean loopOnSrc = (srcSize < MIN_SRC_SIZE || dstSize > DST_SRC_RATIO*srcSize) ;
    +            return loopOnSrc;
    +        }
    +        
    +        // Avoid dstGraph.size()
    +        // do |find()| > DST_SRC_RATIO*srcSize
    +        if ( false ) {
    +            int srcSize = srcGraph.size();
    +            boolean loopOnSrc = (srcSize < MIN_SRC_SIZE || compareSizeTo(dstGraph, DST_SRC_RATIO*srcSize) == CMP_GREATER) ;
    +            return loopOnSrc;
    +        }
    +        
    +        // Avoid dstGraph.size() and srcGraph.size()
    +        // do |find1()| > DST_SRC_RATIO*|find2()|
    +        // limit on |find2()|
    +        int cmp = compareSizeByFind(dstGraph, srcGraph, DST_SRC_RATIO, 1000*MIN_SRC_SIZE);
    +        boolean loopOnSrc = (cmp==CMP_GREATER || cmp==CMP_UNKNOWN) ; 
    +        return loopOnSrc;
    +    }
    +    
    +    private static int compareSizeTo(Graph g, int size) {
    +        int counter = 0;
    +        ExtendedIterator<Triple> it = g.find();
    +        try {
    +            while (it.hasNext()) {
    +                it.next();
    +                counter += 1;
    +                if (counter > size) {
    +                    return CMP_GREATER ;
    +                }
    +            }
    +            return counter == size ? CMP_EQUAL : CMP_LESS;
    +        } finally {
    +            it.close();
    +        }
    +    }
    +
    +    /** Compare the ration of graph sizes by co-iteration, up to some limit on the srcGraph size. */ 
    +    private static int compareSizeByFind(Graph dstGraph, Graph srcGraph, int ratio, int maxCompare) {
    +        ExtendedIterator<Triple> dstIter = dstGraph.find();
    +        ExtendedIterator<Triple> srcIter = srcGraph.find();
    +        try {
    +            // Step dstIter by "ratio" steps each time.
    +            return compareByLength(dstIter, srcIter, ratio, 1, maxCompare);
    +        } finally {
    +            dstIter.close();
    +            srcIter.close();
    +        }
    +    }
    +    
    +    /** Compare the length of two iterators.
    +     *  <b>This operation is destructive on the iterators<b>.
    +     */
    +    private static int compareByLength(Iterator<?> iter1, Iterator<?> iter2) {
    +        return compareByLength(iter1, iter2, 1, 1, Integer.MAX_VALUE);
    +    }
    +
    +    /** Compare the length of two iterators. By using the "step" argument, the app can compare rations.
    +     *  <b>This operation is destructive on the iterators<b>.
    +     */
    +    private static int compareByLength(Iterator<?> iter1, Iterator<?> iter2, int step1, int step2, int maxSteps) {
    +        if ( step1 <= 0 )
    +            throw new IllegalArgumentException("step size 'step1' is not positve: "+step1);
    +        if ( step2 <= 0 )
    +            throw new IllegalArgumentException("step size 'step2' is not positve: "+step2);
    +        
    +        int x = 0; 
    +        for ( ;; ) {
    +            x++ ;
    +            boolean ended1 = iter1.hasNext();
    +            boolean ended2 = iter2.hasNext();
    +            
    +            if ( ended1 )
    --- End diff --
    
    Good catch. It should be `boolean ended1 = ! iter1.hasNext();`
    
    This work needs to test the worker functions themselves because the wrong execution policy decision isn't going to show up in the test suite. That seems better than adding large scale testing which isn't black/white on this sort of thing.



---

[GitHub] jena issue #306: Algorithms for JENA-1414

Posted by afs <gi...@git.apache.org>.
Github user afs commented on the issue:

    https://github.com/apache/jena/pull/306
  
    Force-pushed the reduced code - algorithm 2 is the active one.  Algorithm 1 is there but not active.


---

[GitHub] jena pull request #306: Algorithms for JENA-1414

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/jena/pull/306


---

[GitHub] jena issue #306: Algorithms for JENA-1414

Posted by afs <gi...@git.apache.org>.
Github user afs commented on the issue:

    https://github.com/apache/jena/pull/306
  
    Revised version.
    
    The co-iteration algorithm (choice 3) is complicated. Determining the right thing to do each time is expensive and still a pragmatic choice with graph probing. The app may well know what to do so call the appropriate method as already documented.
    
    My suggestion is algorithm 2 (a slightly improved version of the patch on  JENA-1414), document what to do if src is a large persistent database and remove the complicate algorithm 3.
    
    If that is our decision, I will produce a much simpler commit for algorithm 2.



---

[GitHub] jena issue #306: Algorithms for JENA-1414

Posted by ajs6f <gi...@git.apache.org>.
Github user ajs6f commented on the issue:

    https://github.com/apache/jena/pull/306
  
    Okay, now I get it. Agreed that number 3 is "trying too hard" and on the proposal to provide number 2 and document appropriate usage.


---

[GitHub] jena pull request #306: Algorithms for JENA-1414

Posted by rvesse <gi...@git.apache.org>.
Github user rvesse commented on a diff in the pull request:

    https://github.com/apache/jena/pull/306#discussion_r151387036
  
    --- Diff: jena-core/src/main/java/org/apache/jena/graph/GraphUtil.java ---
    @@ -223,6 +258,116 @@ public static void deleteFrom(Graph dstGraph, Graph srcGraph) {
             dstGraph.getEventManager().notifyDeleteGraph(dstGraph, srcGraph) ;
         }
     
    +    private static final int CMP_GREATER = 1;
    +    private static final int CMP_EQUAL   = 0;
    +    private static final int CMP_LESS    = -1;
    +    private static final int CMP_UNKNOWN = -2;
    +
    +    // Decide whether to loop on dstGraph or srcGraph.
    +    private static boolean decideHowtoExecute(Graph dstGraph, Graph srcGraph) {
    +        if ( false ) {
    +            // Use dstGraph.size() and srcGraph.size()
    +            // Assumes dstGraph.size() (and to some extent srcGraph.size() are efficient.
    +            int dstSize = dstGraph.size();
    +            int srcSize = srcGraph.size();
    +
    +            // Loop on src if:
    +            // * srcGraph is below the threshold MIN_SRC_SIZE (a "Just Do it" number)
    +            // * dstGraph is "much" larger than src where "much" is given by DST_SRC_RATIO
    +            //     Assumes dstGraph is efficient.
    +            boolean loopOnSrc = (srcSize < MIN_SRC_SIZE || dstSize > DST_SRC_RATIO*srcSize) ;
    +            return loopOnSrc;
    +        }
    +        
    +        // Avoid dstGraph.size()
    +        // do |find()| > DST_SRC_RATIO*srcSize
    +        if ( false ) {
    +            int srcSize = srcGraph.size();
    +            boolean loopOnSrc = (srcSize < MIN_SRC_SIZE || compareSizeTo(dstGraph, DST_SRC_RATIO*srcSize) == CMP_GREATER) ;
    +            return loopOnSrc;
    +        }
    +        
    +        // Avoid dstGraph.size() and srcGraph.size()
    +        // do |find1()| > DST_SRC_RATIO*|find2()|
    +        // limit on |find2()|
    +        int cmp = compareSizeByFind(dstGraph, srcGraph, DST_SRC_RATIO, 1000*MIN_SRC_SIZE);
    +        boolean loopOnSrc = (cmp==CMP_GREATER || cmp==CMP_UNKNOWN) ; 
    +        return loopOnSrc;
    +    }
    +    
    +    private static int compareSizeTo(Graph g, int size) {
    +        int counter = 0;
    +        ExtendedIterator<Triple> it = g.find();
    +        try {
    +            while (it.hasNext()) {
    +                it.next();
    +                counter += 1;
    +                if (counter > size) {
    +                    return CMP_GREATER ;
    +                }
    +            }
    +            return counter == size ? CMP_EQUAL : CMP_LESS;
    +        } finally {
    +            it.close();
    +        }
    +    }
    +
    +    /** Compare the ration of graph sizes by co-iteration, up to some limit on the srcGraph size. */ 
    +    private static int compareSizeByFind(Graph dstGraph, Graph srcGraph, int ratio, int maxCompare) {
    +        ExtendedIterator<Triple> dstIter = dstGraph.find();
    +        ExtendedIterator<Triple> srcIter = srcGraph.find();
    +        try {
    +            // Step dstIter by "ratio" steps each time.
    +            return compareByLength(dstIter, srcIter, ratio, 1, maxCompare);
    +        } finally {
    +            dstIter.close();
    +            srcIter.close();
    +        }
    +    }
    +    
    +    /** Compare the length of two iterators.
    +     *  <b>This operation is destructive on the iterators<b>.
    +     */
    +    private static int compareByLength(Iterator<?> iter1, Iterator<?> iter2) {
    +        return compareByLength(iter1, iter2, 1, 1, Integer.MAX_VALUE);
    +    }
    +
    +    /** Compare the length of two iterators. By using the "step" argument, the app can compare rations.
    +     *  <b>This operation is destructive on the iterators<b>.
    +     */
    +    private static int compareByLength(Iterator<?> iter1, Iterator<?> iter2, int step1, int step2, int maxSteps) {
    +        if ( step1 <= 0 )
    +            throw new IllegalArgumentException("step size 'step1' is not positve: "+step1);
    +        if ( step2 <= 0 )
    +            throw new IllegalArgumentException("step size 'step2' is not positve: "+step2);
    +        
    +        int x = 0; 
    +        for ( ;; ) {
    +            x++ ;
    +            boolean ended1 = iter1.hasNext();
    +            boolean ended2 = iter2.hasNext();
    +            
    +            if ( ended1 )
    --- End diff --
    
    Shouldn't these be `! ended1` and `! ended2`?


---

[GitHub] jena issue #306: Algorithms for JENA-1414

Posted by afs <gi...@git.apache.org>.
Github user afs commented on the issue:

    https://github.com/apache/jena/pull/306
  
    > @afs I'm not sure which algo is number 3?
    
    _2. Use the size of the src and iterate on the dst to compare sizes
    3. Iterator on both src and dst to compare sizes. Graph.size is not used at all._
    



---