You are viewing a plain text version of this content. The canonical link for it is here.
Posted to pr@jena.apache.org by GitBox <gi...@apache.org> on 2021/12/09 20:49:03 UTC

[GitHub] [jena] rvesse commented on a change in pull request #1127: Xloader (more)

rvesse commented on a change in pull request #1127:
URL: https://github.com/apache/jena/pull/1127#discussion_r766123451



##########
File path: apache-jena/bin/tdb2.xloader
##########
@@ -278,65 +287,94 @@ fi
 JAVA="${JAVA:-java}"
 
 info "Setup:"
-info "  Data:     $DATAFILES"
 info "  Database: $LOC"
-info "  Tmpdir:   $TMPDIR"
+info "  Data:     $DATAFILES"
+info "  TMPDIR:   $TMPDIR"
 
 # Large heap not required.
 JVM_ARGS="${JVM_ARGS:--Xmx2G}"
 
-## Time points.
+## Time point.
 
 TIME_START="$(now)"
 
-## Node table loading.
+## ======== Node table loading.
 if [ "$SYSTEM" == "TDB2" ]; then
+    ## TDB2 only.
+    info
     T="$(now)"
     info "Load node table"
     exec_java $PKG.CmdxBuildNodeTable --loc $LOC --tmpdir "$TMPDIR" $DATAFILES
     TIME_NODE_TABLE=$(($(now)-$T))
 fi
 
-## Ingest data, create workfiles
+## ======== Ingest data, creates workfiles
 info
 info "Ingest data"
 T="$(now)"
 exec_java $PKG.CmdxIngestData --loc $LOC --tmpdir "$TMPDIR" --triples "$TMPDIR/triples.tmp" --quads "$TMPDIR/quads.tmp" $DATAFILES
 TIME_INGEST=$(($(now)-$T))
 
-## @@ triples.tmp quads.tmp
+## ======== Indexes
+INFO="$TMPDIR/load.json"
+
+## Bash assocative array
+declare -A TIME_IDX
 
 function index() {
     local IDX="$1"
+    info
+    info "Build $IDX"
+    local T="$(now)"
     exec_java $PKG.CmdxBuildIndex --loc $LOC --tmpdir "$TMPDIR" --index $IDX \
 	      "$TMPDIR/triples.tmp" "$TMPDIR/quads.tmp"
+    local T_IDX=$(($(now)-$T))
+    TIME_IDX[$IDX]=$T_IDX
 }
 
-info
-info "Build SPO"
-T="$(now)"
-index SPO
-TIME_IDX_SPO=$(($(now)-$T))
+## Decide which indexes to generate.
+TRIPLES_DFT="SPO POS OSP"
+QUADS_DFT="GSPO GPOS GOSP SPOG POSG OSPG"
 
-info
-info "Build POS"
-T="$(now)"
-index POS
-TIME_IDX_POS=$(($(now)-$T))
+TRIPLES_IDX="${TRIPLES_IDX:-$TRIPLES_DFT}"
+QUADS_IDX="${QUADS_IDX:-$QUADS_DFT}"
 
-info
-info "Build OSP"
-T="$(now)"
-index OSP
-let TIME_IDX_OSP=$(($(now)-$T))
+if [ -e "$INFO" ] ; then
+    ## Skip a phase if there are no items to index.
+    TRIPLES="$(jq .triples < $INFO)"

Review comment:
       Do you want to verify that `jq` is actually available before trying to use it?
   
   While it's an increasingly common and popular tool it often isn't installed by default and in $dayjob that's often bitten us
   
   If it isn't available the subsequent `if` statements are liable to bomb out with Bash errors

##########
File path: jena-cmds/src/main/java/tdb2/xloader/AbstractCmdxLoad.java
##########
@@ -51,12 +51,17 @@
     protected static ArgDecl argTmpdir     = new ArgDecl(true, "tmpdir", "tmp");
     protected static ArgDecl argIndex      = new ArgDecl(true, "index");
 
-    protected static ArgDecl argSortArgs   = new ArgDecl(true, "sortArgs", "sortargs");
+    // If this is put back, note there are two different sorts - one for the ndoe table and several for the indexes.

Review comment:
       Typo `ndoe` -> `node`

##########
File path: jena-db/jena-tdb2/src/main/java/org/apache/jena/tdb2/xloader/ProcIndexBuildX.java
##########
@@ -143,31 +143,49 @@ private static long sort_build_index(Logger LOG, String datafile, DatasetGraph d
         Process proc2;
         OutputStream toSortOutputStream; // Not used. Input is a file.
         InputStream fromSortInputStream;
+
         try {
             //LOG.info("Step : external sort : "+indexName);
             //if ( sortArgs != null ) {}
+
+
             List<String> sortCmdBasics = Arrays.asList(
                  "sort",
                     "--temporary-directory="+TMPDIR, "--buffer-size=50%",
                     "--parallel=2", "--unique"
                     //, "--compress-program=/usr/bin/gzip"
-                );
+            );
             List<String> sortCmd = new ArrayList<>(sortCmdBasics);
+            if ( BulkLoaderX.CompressSortIndexFiles )
+                sortCmd.add("--compress-program=/usr/bin/gzip");

Review comment:
       Should this be configurable in case this is used on some system with a weird installation layout?  Or better still can this just be `gzip`?

##########
File path: jena-db/jena-tdb2/src/main/java/org/apache/jena/tdb2/xloader/ProcNodeTableBuilderX.java
##########
@@ -99,22 +104,27 @@
         Process proc2;
         try {
             //LOG.info("Step : external sort");
-            String[] sortCmd =
-                { "sort",
+            List<String> sortCmdBasics = Arrays.asList(
+                  "sort",
                     "--temporary-directory="+loaderFiles.TMPDIR, "--buffer-size=50%",
                     "--parallel=2", "--unique",

Review comment:
       Again `--parallel` not guaranteed to be available in all `sort`'s

##########
File path: jena-db/jena-tdb2/src/main/java/org/apache/jena/tdb2/xloader/ProcNodeTableBuilderX.java
##########
@@ -99,22 +104,27 @@
         Process proc2;
         try {
             //LOG.info("Step : external sort");
-            String[] sortCmd =
-                { "sort",
+            List<String> sortCmdBasics = Arrays.asList(
+                  "sort",
                     "--temporary-directory="+loaderFiles.TMPDIR, "--buffer-size=50%",
                     "--parallel=2", "--unique",
-                    // Don't compress. There will be enough space because later we work on indexes.
                     //"--compress-program=/usr/bin/gzip",
                     "--key=1,1"
-                };
-            if ( sortArgs != null ) {}
+                    );
+
+            List<String> sortCmd = new ArrayList<>(sortCmdBasics);
 
+            //if ( sortNodeTableArgs != null ) {}
+
+            // See javadoc for CompressSortNodeTableFiles - usually false
+            if ( BulkLoaderX.CompressSortNodeTableFiles )
+                sortCmd.add("--compress-program=/usr/bin/gzip");

Review comment:
       Again just use `gzip`?
   
   Also this should mean that the commented out line 111 can be deleted

##########
File path: jena-db/jena-tdb2/src/main/java/org/apache/jena/tdb2/xloader/ProcIndexBuildX.java
##########
@@ -143,31 +143,49 @@ private static long sort_build_index(Logger LOG, String datafile, DatasetGraph d
         Process proc2;
         OutputStream toSortOutputStream; // Not used. Input is a file.
         InputStream fromSortInputStream;
+
         try {
             //LOG.info("Step : external sort : "+indexName);
             //if ( sortArgs != null ) {}
+
+
             List<String> sortCmdBasics = Arrays.asList(
                  "sort",
                     "--temporary-directory="+TMPDIR, "--buffer-size=50%",
                     "--parallel=2", "--unique"

Review comment:
       Default `sort` on OS X doesn't have the `--parallel` option, should that be a configurable feature of the loader?
   
   (I know OS X systems are most likely to have SSDs and not need this loader so much)

##########
File path: apache-jena/bin/tdb2.xloader
##########
@@ -269,7 +278,7 @@ case "$SYSTEM" in
 	;;
 esac
 
-## Delete database!
+## Don't mess up an existsing database!

Review comment:
       Typo - `existsing` -> `existing`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: pr-unsubscribe@jena.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: pr-unsubscribe@jena.apache.org
For additional commands, e-mail: pr-help@jena.apache.org