You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@jena.apache.org by afs <gi...@git.apache.org> on 2018/06/02 20:00:29 UTC

[GitHub] jena pull request #426: JENA-1552: Phased loader

GitHub user afs opened a pull request:

    https://github.com/apache/jena/pull/426

    JENA-1552: Phased loader

    This has become a general purpose loader (`LoaderMain`) which is controlled by a `LoaderPlan`. LoaderMain is both the parallel loader and the phased loader of JENA-1552.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/afs/jena phased-loader

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/jena/pull/426.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #426
    
----
commit 3b712638db710630882844fd4c266120fa2a49ca
Author: Andy Seaborne <an...@...>
Date:   2018-06-01T13:07:55Z

    JENA-1552: Phased loader

----


---

[GitHub] jena pull request #426: JENA-1552: Phased loader

Posted by rvesse <gi...@git.apache.org>.
Github user rvesse commented on a diff in the pull request:

    https://github.com/apache/jena/pull/426#discussion_r192680996
  
    --- Diff: jena-db/jena-tdb2/src/main/java/org/apache/jena/tdb2/loader/parallel/LoaderPlan.java ---
    @@ -0,0 +1,57 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.jena.tdb2.loader.parallel;
    +
    +import org.apache.jena.tdb2.store.NodeId;
    +
    +/** 
    + * A {@code LoaderPlan}
    + * <p>
    + * For triples and for quads there is a first phase to parse the input, 
    + * convert to tuples of {@link NodeId NodeIds}, including allocating the ids,
    + * and do at least one tuple index for each of triples quads to capture the input.
    + * <p>   
    + * After that, a number of phases builds the other indexes. 
    + * <p>
    + * The {@code mulithreadedInput} f;ag indicates whether the first phase is
    + * done in parallel (threads for parer, node table building and primary indexes)
    + * or as a single threaded process.
    + */
    +public class LoaderPlan {
    +    private final InputStage dataInput;
    +    private final String[] loadGroup3;
    +    private final String[] loadGroup4;
    +    private final String[][] secondaryGroups3;
    +    private final String[][] secondaryGroups4;
    +    
    +    public LoaderPlan(InputStage dataInput,
    +                String[] loadGroup3, String[] loadGroup4,
    +                String[][] secondaryGroups3, String[][] secondaryGroups4) {
    +        this.dataInput = dataInput;
    +        this.loadGroup3 = loadGroup3;
    +        this.loadGroup4 = loadGroup4;
    +        this.secondaryGroups3 = secondaryGroups3;
    +        this.secondaryGroups4 = secondaryGroups4;
    +    }
    +    public InputStage dataInputType()   { return dataInput; }
    +    public String[] primaryLoad3()          { return loadGroup3; }
    +    public String[] primaryLoad4()          { return loadGroup4; }
    +    public String[][] secondaryIndex3()     { return secondaryGroups3; }
    +    public String[][] secondaryIndex4()     { return secondaryGroups4; }
    --- End diff --
    
    The naming of these seems somewhat esoteric, I assume that the 3 and 4 refers to Triples and Quads respectively?


---

[GitHub] jena issue #426: JENA-1552: Phased loader

Posted by afs <gi...@git.apache.org>.
Github user afs commented on the issue:

    https://github.com/apache/jena/pull/426
  
    All technical work is finished for this iteration. I'll do some renaming ("parallel" is still used as a general name whereas it is now a profile of "LoaderMain") then this should be good to go.


---

[GitHub] jena issue #426: JENA-1552: Phased loader

Posted by afs <gi...@git.apache.org>.
Github user afs commented on the issue:

    https://github.com/apache/jena/pull/426
  
    Long term (!), I'd like to get some aspects TDB1 `tdbloader2` into the TDB2 but in Java for portability. `tdbloader2` is breaking up the task of building a single index into machine-resource sized chunks. `LoaderMain` does not consider multiple tasks to build one index; this also means `LoaderPlan`s are not completely stable as an API.
    
    In doing work on loading, I can see more possibilities which need experimentation to see if they are a good idea or a mad idea. This PR is breaking the "wait until perfect" effect. This loader should be good for a 100's millions triples (I haven't tested a billion yet) so it is a significant step forward in itself.


---

[GitHub] jena pull request #426: JENA-1552: Phased loader

Posted by rvesse <gi...@git.apache.org>.
Github user rvesse commented on a diff in the pull request:

    https://github.com/apache/jena/pull/426#discussion_r192681609
  
    --- Diff: jena-db/jena-tdb2/src/main/java/org/apache/jena/tdb2/loader/parallel/LoaderPlan.java ---
    @@ -0,0 +1,57 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.jena.tdb2.loader.parallel;
    +
    +import org.apache.jena.tdb2.store.NodeId;
    +
    +/** 
    + * A {@code LoaderPlan}
    + * <p>
    + * For triples and for quads there is a first phase to parse the input, 
    + * convert to tuples of {@link NodeId NodeIds}, including allocating the ids,
    + * and do at least one tuple index for each of triples quads to capture the input.
    + * <p>   
    + * After that, a number of phases builds the other indexes. 
    + * <p>
    + * The {@code mulithreadedInput} f;ag indicates whether the first phase is
    + * done in parallel (threads for parer, node table building and primary indexes)
    + * or as a single threaded process.
    + */
    +public class LoaderPlan {
    --- End diff --
    
    I really like this design, since the class has a public constructor end-users could potentially test out alternative strategies if they so desired?


---

[GitHub] jena pull request #426: JENA-1552: Phased loader

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/jena/pull/426


---

[GitHub] jena pull request #426: JENA-1552: Phased loader

Posted by rvesse <gi...@git.apache.org>.
Github user rvesse commented on a diff in the pull request:

    https://github.com/apache/jena/pull/426#discussion_r192680143
  
    --- Diff: jena-db/jena-tdb2/src/main/java/org/apache/jena/tdb2/loader/parallel/LoaderPlan.java ---
    @@ -0,0 +1,57 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.jena.tdb2.loader.parallel;
    +
    +import org.apache.jena.tdb2.store.NodeId;
    +
    +/** 
    + * A {@code LoaderPlan}
    + * <p>
    + * For triples and for quads there is a first phase to parse the input, 
    + * convert to tuples of {@link NodeId NodeIds}, including allocating the ids,
    + * and do at least one tuple index for each of triples quads to capture the input.
    + * <p>   
    + * After that, a number of phases builds the other indexes. 
    + * <p>
    + * The {@code mulithreadedInput} f;ag indicates whether the first phase is
    --- End diff --
    
    Typo `f;ag` -> `flag`


---