You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@jena.apache.org by afs <gi...@git.apache.org> on 2018/06/02 20:00:29 UTC
[GitHub] jena pull request #426: JENA-1552: Phased loader
GitHub user afs opened a pull request:
https://github.com/apache/jena/pull/426
JENA-1552: Phased loader
This has become a general purpose loader (`LoaderMain`) which is controlled by a `LoaderPlan`. LoaderMain is both the parallel loader and the phased loader of JENA-1552.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/afs/jena phased-loader
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/jena/pull/426.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #426
----
commit 3b712638db710630882844fd4c266120fa2a49ca
Author: Andy Seaborne <an...@...>
Date: 2018-06-01T13:07:55Z
JENA-1552: Phased loader
----
---
[GitHub] jena pull request #426: JENA-1552: Phased loader
Posted by rvesse <gi...@git.apache.org>.
Github user rvesse commented on a diff in the pull request:
https://github.com/apache/jena/pull/426#discussion_r192680996
--- Diff: jena-db/jena-tdb2/src/main/java/org/apache/jena/tdb2/loader/parallel/LoaderPlan.java ---
@@ -0,0 +1,57 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.jena.tdb2.loader.parallel;
+
+import org.apache.jena.tdb2.store.NodeId;
+
+/**
+ * A {@code LoaderPlan}
+ * <p>
+ * For triples and for quads there is a first phase to parse the input,
+ * convert to tuples of {@link NodeId NodeIds}, including allocating the ids,
+ * and do at least one tuple index for each of triples quads to capture the input.
+ * <p>
+ * After that, a number of phases builds the other indexes.
+ * <p>
+ * The {@code mulithreadedInput} f;ag indicates whether the first phase is
+ * done in parallel (threads for parer, node table building and primary indexes)
+ * or as a single threaded process.
+ */
+public class LoaderPlan {
+ private final InputStage dataInput;
+ private final String[] loadGroup3;
+ private final String[] loadGroup4;
+ private final String[][] secondaryGroups3;
+ private final String[][] secondaryGroups4;
+
+ public LoaderPlan(InputStage dataInput,
+ String[] loadGroup3, String[] loadGroup4,
+ String[][] secondaryGroups3, String[][] secondaryGroups4) {
+ this.dataInput = dataInput;
+ this.loadGroup3 = loadGroup3;
+ this.loadGroup4 = loadGroup4;
+ this.secondaryGroups3 = secondaryGroups3;
+ this.secondaryGroups4 = secondaryGroups4;
+ }
+ public InputStage dataInputType() { return dataInput; }
+ public String[] primaryLoad3() { return loadGroup3; }
+ public String[] primaryLoad4() { return loadGroup4; }
+ public String[][] secondaryIndex3() { return secondaryGroups3; }
+ public String[][] secondaryIndex4() { return secondaryGroups4; }
--- End diff --
The naming of these seems somewhat esoteric, I assume that the 3 and 4 refers to Triples and Quads respectively?
---
[GitHub] jena issue #426: JENA-1552: Phased loader
Posted by afs <gi...@git.apache.org>.
Github user afs commented on the issue:
https://github.com/apache/jena/pull/426
All technical work is finished for this iteration. I'll do some renaming ("parallel" is still used as a general name whereas it is now a profile of "LoaderMain") then this should be good to go.
---
[GitHub] jena issue #426: JENA-1552: Phased loader
Posted by afs <gi...@git.apache.org>.
Github user afs commented on the issue:
https://github.com/apache/jena/pull/426
Long term (!), I'd like to get some aspects TDB1 `tdbloader2` into the TDB2 but in Java for portability. `tdbloader2` is breaking up the task of building a single index into machine-resource sized chunks. `LoaderMain` does not consider multiple tasks to build one index; this also means `LoaderPlan`s are not completely stable as an API.
In doing work on loading, I can see more possibilities which need experimentation to see if they are a good idea or a mad idea. This PR is breaking the "wait until perfect" effect. This loader should be good for a 100's millions triples (I haven't tested a billion yet) so it is a significant step forward in itself.
---
[GitHub] jena pull request #426: JENA-1552: Phased loader
Posted by rvesse <gi...@git.apache.org>.
Github user rvesse commented on a diff in the pull request:
https://github.com/apache/jena/pull/426#discussion_r192681609
--- Diff: jena-db/jena-tdb2/src/main/java/org/apache/jena/tdb2/loader/parallel/LoaderPlan.java ---
@@ -0,0 +1,57 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.jena.tdb2.loader.parallel;
+
+import org.apache.jena.tdb2.store.NodeId;
+
+/**
+ * A {@code LoaderPlan}
+ * <p>
+ * For triples and for quads there is a first phase to parse the input,
+ * convert to tuples of {@link NodeId NodeIds}, including allocating the ids,
+ * and do at least one tuple index for each of triples quads to capture the input.
+ * <p>
+ * After that, a number of phases builds the other indexes.
+ * <p>
+ * The {@code mulithreadedInput} f;ag indicates whether the first phase is
+ * done in parallel (threads for parer, node table building and primary indexes)
+ * or as a single threaded process.
+ */
+public class LoaderPlan {
--- End diff --
I really like this design, since the class has a public constructor end-users could potentially test out alternative strategies if they so desired?
---
[GitHub] jena pull request #426: JENA-1552: Phased loader
Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:
https://github.com/apache/jena/pull/426
---
[GitHub] jena pull request #426: JENA-1552: Phased loader
Posted by rvesse <gi...@git.apache.org>.
Github user rvesse commented on a diff in the pull request:
https://github.com/apache/jena/pull/426#discussion_r192680143
--- Diff: jena-db/jena-tdb2/src/main/java/org/apache/jena/tdb2/loader/parallel/LoaderPlan.java ---
@@ -0,0 +1,57 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.jena.tdb2.loader.parallel;
+
+import org.apache.jena.tdb2.store.NodeId;
+
+/**
+ * A {@code LoaderPlan}
+ * <p>
+ * For triples and for quads there is a first phase to parse the input,
+ * convert to tuples of {@link NodeId NodeIds}, including allocating the ids,
+ * and do at least one tuple index for each of triples quads to capture the input.
+ * <p>
+ * After that, a number of phases builds the other indexes.
+ * <p>
+ * The {@code mulithreadedInput} f;ag indicates whether the first phase is
--- End diff --
Typo `f;ag` -> `flag`
---