You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@jackrabbit.apache.org by GitBox <gi...@apache.org> on 2022/09/21 10:03:58 UTC

[GitHub] [jackrabbit-oak] amit-jain opened a new pull request, #715: OAK-9790 - Implement parallel indexing for speeding up oak run indexing command

amit-jain opened a new pull request, #715:
URL: https://github.com/apache/jackrabbit-oak/pull/715

   - Parallel indexing enabled by splitting the FlatFileStore taking into account aggregation and then running multiple threads to do the indexing
   - LZ4 compression enabled
   
   Work over https://github.com/apache/jackrabbit-oak/pull/587


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@jackrabbit.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [jackrabbit-oak] joerghoh commented on a diff in pull request #715: OAK-9790 - Implement parallel indexing for speeding up oak run indexing command

Posted by GitBox <gi...@apache.org>.
joerghoh commented on code in PR #715:
URL: https://github.com/apache/jackrabbit-oak/pull/715#discussion_r996866172


##########
oak-core/src/main/java/org/apache/jackrabbit/oak/plugins/index/progress/IndexingProgressReporter.java:
##########
@@ -229,10 +231,12 @@ public IndexUpdateState(String indexPath, boolean reindex, long estimatedCount)
         }
 
         public void indexUpdate() throws CommitFailedException {
-            updateCount++;
-            if (updateCount % 10000 == 0) {
-                log.info("{} => Indexed {} nodes in {} ...", indexPath, updateCount, watch);
-                watch.reset().start();
+            long count = updateCount.incrementAndGet();
+            if (count % 10000 == 0) {
+                synchronized (this) {

Review Comment:
   Is there a need to synchronize this? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@jackrabbit.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [jackrabbit-oak] nit0906 commented on a diff in pull request #715: OAK-9790 - Implement parallel indexing for speeding up oak run indexing command

Posted by GitBox <gi...@apache.org>.
nit0906 commented on code in PR #715:
URL: https://github.com/apache/jackrabbit-oak/pull/715#discussion_r982189273


##########
oak-lucene/src/main/java/org/apache/jackrabbit/oak/plugins/index/lucene/writer/IndexWriterUtils.java:
##########
@@ -70,6 +73,10 @@ public static IndexWriterConfig getIndexWriterConfig(LuceneIndexDefinition defin
             IndexWriterConfig config = new IndexWriterConfig(VERSION, analyzer);
             if (remoteDir) {
                 config.setMergeScheduler(new SerialMergeScheduler());
+            } else {

Review Comment:
   ah - ok. I think it would be difficult to determine here whether this is being called from out-of-band indexing or redular async indexing. So I think the best we can do is to set the default to 1 and add a comment that the default value should be changed only after thorough analysis. 



##########
oak-lucene/src/main/java/org/apache/jackrabbit/oak/plugins/index/lucene/writer/IndexWriterUtils.java:
##########
@@ -70,6 +73,10 @@ public static IndexWriterConfig getIndexWriterConfig(LuceneIndexDefinition defin
             IndexWriterConfig config = new IndexWriterConfig(VERSION, analyzer);
             if (remoteDir) {
                 config.setMergeScheduler(new SerialMergeScheduler());
+            } else {

Review Comment:
   ah - ok. I think it would be difficult to determine here whether this is being called from out-of-band indexing or regular async indexing. So I think the best we can do is to set the default to 1 and add a comment that the default value should be changed only after thorough analysis. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@jackrabbit.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [jackrabbit-oak] nit0906 commented on a diff in pull request #715: OAK-9790 - Implement parallel indexing for speeding up oak run indexing command

Posted by GitBox <gi...@apache.org>.
nit0906 commented on code in PR #715:
URL: https://github.com/apache/jackrabbit-oak/pull/715#discussion_r982132294


##########
oak-lucene/src/main/java/org/apache/jackrabbit/oak/plugins/index/lucene/writer/IndexWriterUtils.java:
##########
@@ -70,6 +73,10 @@ public static IndexWriterConfig getIndexWriterConfig(LuceneIndexDefinition defin
             IndexWriterConfig config = new IndexWriterConfig(VERSION, analyzer);
             if (remoteDir) {
                 config.setMergeScheduler(new SerialMergeScheduler());
+            } else {

Review Comment:
   > The default is 1 thread so essentially the concurrent will default to serial ofcourse if the property is not overriden
   
   How's the default 1 ? The max threads are hardcoded to 8 as of now - or did I miss something ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@jackrabbit.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [jackrabbit-oak] amit-jain commented on pull request #715: OAK-9790 - Implement parallel indexing for speeding up oak run indexing command

Posted by GitBox <gi...@apache.org>.
amit-jain commented on PR #715:
URL: https://github.com/apache/jackrabbit-oak/pull/715#issuecomment-1280427905

   Alternate [changes](https://github.com/amit-jain/jackrabbit-oak/commit/292fd08b1d282f47707464e633e2b3b57719bf85) to remove dependency on the LZR for other modules and only introduce it for oak-run-commons/oak-run 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@jackrabbit.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [jackrabbit-oak] amit-jain commented on a diff in pull request #715: OAK-9790 - Implement parallel indexing for speeding up oak run indexing command

Posted by GitBox <gi...@apache.org>.
amit-jain commented on code in PR #715:
URL: https://github.com/apache/jackrabbit-oak/pull/715#discussion_r976324312


##########
oak-run-commons/src/main/java/org/apache/jackrabbit/oak/index/indexer/document/DocumentStoreIndexerBase.java:
##########
@@ -265,6 +271,55 @@ public void reindex() throws CommitFailedException, IOException {
         indexerSupport.postIndexWork(copyOnWriteStore);
     }
 
+    private void indexParallel(List<FlatFileStore> storeList, CompositeIndexer indexer, IndexingProgressReporter progressReporter) {
+        ExecutorService service = Executors.newFixedThreadPool(IndexerConfiguration.indexThreadPoolSize());
+        List<Future> futureList = new ArrayList<>();
+
+        for (FlatFileStore item : storeList) {
+            Future future = service.submit(new Callable<Boolean>() {
+                @Override
+                public Boolean call() throws IOException, CommitFailedException {
+                    for (NodeStateEntry entry : item) {
+                        reportDocumentRead(entry.getPath(), progressReporter);
+                        log.trace("Indexing : {}", entry.getPath());
+                        indexer.index(entry);
+                    }
+                    return true;
+                }
+            });
+            futureList.add(future);
+        }
+
+        try {
+            for (Future future : futureList) {
+                future.get();
+            }
+            log.info("All {} indexing jobs are done", storeList.size());
+        } catch (InterruptedException | ExecutionException e) {
+            log.error("Failure getting indexing job result", e);

Review Comment:
   @fabriziofortino You had a comment in the original PR which I forgot to address. It might be   better to throw in case of exceptions as it affects correctness. @thomasmueller wdyt?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@jackrabbit.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [jackrabbit-oak] amit-jain commented on a diff in pull request #715: OAK-9790 - Implement parallel indexing for speeding up oak run indexing command

Posted by GitBox <gi...@apache.org>.
amit-jain commented on code in PR #715:
URL: https://github.com/apache/jackrabbit-oak/pull/715#discussion_r982115197


##########
oak-search-elastic/src/main/java/org/apache/jackrabbit/oak/plugins/index/elastic/index/ElasticBulkProcessorHandler.java:
##########
@@ -151,6 +153,7 @@ private BulkProcessor initBulkProcessor() {
         return BulkProcessor.builder(requestConsumer(),
                 new OakBulkProcessorListener(), this.indexName + "-bulk-processor")
                 .setBulkActions(indexDefinition.bulkActions)
+                .setConcurrentRequests(BULK_PROCESSOR_CONCURRENCY)

Review Comment:
   This config does not seem suitable to be included in the IndexDefinition as it is affecting the processing in general and not per index. @nit0906 wdyt?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@jackrabbit.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [jackrabbit-oak] fabriziofortino commented on a diff in pull request #715: OAK-9790 - Implement parallel indexing for speeding up oak run indexing command

Posted by GitBox <gi...@apache.org>.
fabriziofortino commented on code in PR #715:
URL: https://github.com/apache/jackrabbit-oak/pull/715#discussion_r990166677


##########
oak-run-commons/src/main/java/org/apache/jackrabbit/oak/index/indexer/document/DocumentStoreIndexerBase.java:
##########
@@ -265,6 +271,55 @@ public void reindex() throws CommitFailedException, IOException {
         indexerSupport.postIndexWork(copyOnWriteStore);
     }
 
+    private void indexParallel(List<FlatFileStore> storeList, CompositeIndexer indexer, IndexingProgressReporter progressReporter) {
+        ExecutorService service = Executors.newFixedThreadPool(IndexerConfiguration.indexThreadPoolSize());
+        List<Future> futureList = new ArrayList<>();
+
+        for (FlatFileStore item : storeList) {
+            Future future = service.submit(new Callable<Boolean>() {
+                @Override
+                public Boolean call() throws IOException, CommitFailedException {
+                    for (NodeStateEntry entry : item) {
+                        reportDocumentRead(entry.getPath(), progressReporter);
+                        log.trace("Indexing : {}", entry.getPath());
+                        indexer.index(entry);
+                    }
+                    return true;
+                }
+            });
+            futureList.add(future);
+        }
+
+        try {
+            for (Future future : futureList) {
+                future.get();
+            }
+            log.info("All {} indexing jobs are done", storeList.size());
+        } catch (InterruptedException | ExecutionException e) {
+            log.error("Failure getting indexing job result", e);

Review Comment:
   I would propagate the exception and fail the indexing job.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@jackrabbit.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [jackrabbit-oak] nit0906 commented on a diff in pull request #715: OAK-9790 - Implement parallel indexing for speeding up oak run indexing command

Posted by GitBox <gi...@apache.org>.
nit0906 commented on code in PR #715:
URL: https://github.com/apache/jackrabbit-oak/pull/715#discussion_r981911157


##########
oak-lucene/src/main/java/org/apache/jackrabbit/oak/plugins/index/lucene/writer/IndexWriterUtils.java:
##########
@@ -70,6 +73,10 @@ public static IndexWriterConfig getIndexWriterConfig(LuceneIndexDefinition defin
             IndexWriterConfig config = new IndexWriterConfig(VERSION, analyzer);
             if (remoteDir) {
                 config.setMergeScheduler(new SerialMergeScheduler());
+            } else {

Review Comment:
   one reason why remoteDir could be true is because copyOnWrite is disabled (which usually isn't the case), which would imply we would most probably run into the else block here even during normal indexing.
   
   So effectively the new code change here (the else block) will impact normal indexing as well (not just the out of the band indexing from oak-run), from what I can tell - was that taken into account ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@jackrabbit.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [jackrabbit-oak] amit-jain commented on a diff in pull request #715: OAK-9790 - Implement parallel indexing for speeding up oak run indexing command

Posted by GitBox <gi...@apache.org>.
amit-jain commented on code in PR #715:
URL: https://github.com/apache/jackrabbit-oak/pull/715#discussion_r982092548


##########
oak-run-commons/src/main/java/org/apache/jackrabbit/oak/index/indexer/document/flatfile/FlatFileSplitter.java:
##########
@@ -0,0 +1,255 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.jackrabbit.oak.index.indexer.document.flatfile;
+
+import org.apache.commons.io.FileUtils;
+import org.apache.jackrabbit.oak.api.PropertyState;
+import org.apache.jackrabbit.oak.api.Type;
+import org.apache.jackrabbit.oak.commons.Compression;
+import org.apache.jackrabbit.oak.index.indexer.document.IndexerConfiguration;
+import org.apache.jackrabbit.oak.index.indexer.document.NodeStateEntry;
+import org.apache.jackrabbit.oak.plugins.index.search.Aggregate;
+import org.apache.jackrabbit.oak.plugins.index.search.IndexDefinition;
+import org.apache.jackrabbit.oak.query.ast.NodeTypeInfo;
+import org.apache.jackrabbit.oak.query.ast.NodeTypeInfoProvider;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.io.BufferedReader;
+import java.io.BufferedWriter;
+import java.io.File;
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Map;
+import java.util.Objects;
+import java.util.Set;
+import java.util.Stack;
+import java.util.stream.Collectors;
+
+import static org.apache.jackrabbit.JcrConstants.JCR_PRIMARYTYPE;
+import static org.apache.jackrabbit.JcrConstants.NT_BASE;
+import static org.apache.jackrabbit.oak.index.indexer.document.flatfile.FlatFileNodeStoreBuilder.OAK_INDEXER_USE_LZ4;
+import static org.apache.jackrabbit.oak.index.indexer.document.flatfile.FlatFileNodeStoreBuilder.OAK_INDEXER_USE_ZIP;
+import static org.apache.jackrabbit.oak.index.indexer.document.flatfile.FlatFileStoreUtils.createReader;
+import static org.apache.jackrabbit.oak.index.indexer.document.flatfile.FlatFileStoreUtils.createWriter;
+import static org.apache.jackrabbit.oak.index.indexer.document.flatfile.FlatFileStoreUtils.getSortedStoreFileName;
+
+/**
+ * This class is being used when {@value IndexerConfiguration#PROP_OAK_INDEXER_PARALLEL_INDEX} is set to true.
+ * It will split a flat file safely by checking the index definitions. An entry is considered safe to split if only
+ * none of the parent directories contains nodes in indexRule and aggregate fields of the provided index definitions.
+ */

Review Comment:
   Examples are available as part of the tests, not sure it makes sense to list them here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@jackrabbit.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [jackrabbit-oak] gnaresh19 commented on a diff in pull request #715: OAK-9790 - Implement parallel indexing for speeding up oak run indexing command

Posted by GitBox <gi...@apache.org>.
gnaresh19 commented on code in PR #715:
URL: https://github.com/apache/jackrabbit-oak/pull/715#discussion_r978985478


##########
oak-search-elastic/src/main/java/org/apache/jackrabbit/oak/plugins/index/elastic/index/ElasticBulkProcessorHandler.java:
##########
@@ -151,6 +153,7 @@ private BulkProcessor initBulkProcessor() {
         return BulkProcessor.builder(requestConsumer(),
                 new OakBulkProcessorListener(), this.indexName + "-bulk-processor")
                 .setBulkActions(indexDefinition.bulkActions)
+                .setConcurrentRequests(BULK_PROCESSOR_CONCURRENCY)

Review Comment:
   @amit-jain : Can the concurrentRequests property be moved to index definition, similar to other properties like bulkSize, which would enable the caller to configure appropriate concurrentRequests in elastic index definition.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@jackrabbit.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [jackrabbit-oak] nit0906 commented on a diff in pull request #715: OAK-9790 - Implement parallel indexing for speeding up oak run indexing command

Posted by GitBox <gi...@apache.org>.
nit0906 commented on code in PR #715:
URL: https://github.com/apache/jackrabbit-oak/pull/715#discussion_r981910827


##########
oak-lucene/src/main/java/org/apache/jackrabbit/oak/plugins/index/lucene/writer/IndexWriterUtils.java:
##########
@@ -70,6 +73,10 @@ public static IndexWriterConfig getIndexWriterConfig(LuceneIndexDefinition defin
             IndexWriterConfig config = new IndexWriterConfig(VERSION, analyzer);
             if (remoteDir) {

Review Comment:
   one reason why remoteDir could be true is because copyOnWrite is disabled (which usually isn't the case).
   
   So effectively the new code change here (the else block) will impact normal indexing as well (not just the out of the band indexing from oak-run), from what I can tell - was that taken into account ? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@jackrabbit.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [jackrabbit-oak] amit-jain commented on a diff in pull request #715: OAK-9790 - Implement parallel indexing for speeding up oak run indexing command

Posted by GitBox <gi...@apache.org>.
amit-jain commented on code in PR #715:
URL: https://github.com/apache/jackrabbit-oak/pull/715#discussion_r982093897


##########
oak-run-commons/src/main/java/org/apache/jackrabbit/oak/index/indexer/document/flatfile/FlatFileSplitter.java:
##########
@@ -0,0 +1,255 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.jackrabbit.oak.index.indexer.document.flatfile;
+
+import org.apache.commons.io.FileUtils;
+import org.apache.jackrabbit.oak.api.PropertyState;
+import org.apache.jackrabbit.oak.api.Type;
+import org.apache.jackrabbit.oak.commons.Compression;
+import org.apache.jackrabbit.oak.index.indexer.document.IndexerConfiguration;
+import org.apache.jackrabbit.oak.index.indexer.document.NodeStateEntry;
+import org.apache.jackrabbit.oak.plugins.index.search.Aggregate;
+import org.apache.jackrabbit.oak.plugins.index.search.IndexDefinition;
+import org.apache.jackrabbit.oak.query.ast.NodeTypeInfo;
+import org.apache.jackrabbit.oak.query.ast.NodeTypeInfoProvider;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.io.BufferedReader;
+import java.io.BufferedWriter;
+import java.io.File;
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Map;
+import java.util.Objects;
+import java.util.Set;
+import java.util.Stack;
+import java.util.stream.Collectors;
+
+import static org.apache.jackrabbit.JcrConstants.JCR_PRIMARYTYPE;
+import static org.apache.jackrabbit.JcrConstants.NT_BASE;
+import static org.apache.jackrabbit.oak.index.indexer.document.flatfile.FlatFileNodeStoreBuilder.OAK_INDEXER_USE_LZ4;
+import static org.apache.jackrabbit.oak.index.indexer.document.flatfile.FlatFileNodeStoreBuilder.OAK_INDEXER_USE_ZIP;
+import static org.apache.jackrabbit.oak.index.indexer.document.flatfile.FlatFileStoreUtils.createReader;
+import static org.apache.jackrabbit.oak.index.indexer.document.flatfile.FlatFileStoreUtils.createWriter;
+import static org.apache.jackrabbit.oak.index.indexer.document.flatfile.FlatFileStoreUtils.getSortedStoreFileName;
+
+/**
+ * This class is being used when {@value IndexerConfiguration#PROP_OAK_INDEXER_PARALLEL_INDEX} is set to true.
+ * It will split a flat file safely by checking the index definitions. An entry is considered safe to split if only
+ * none of the parent directories contains nodes in indexRule and aggregate fields of the provided index definitions.
+ */
+public class FlatFileSplitter {
+    private static final Logger LOG = LoggerFactory.getLogger(FlatFileSplitter.class);
+
+    private static final String SPLIT_DIR_NAME = "split";
+
+    private final File workDir;
+    private final NodeTypeInfoProvider infoProvider;
+    private final File flatFile;
+    private final NodeStateEntryReader entryReader;
+    private final Compression algorithm;
+    private final Set<IndexDefinition> indexDefinitions;
+    private Set<String> splitNodeTypeNames;
+    private boolean useCompression = Boolean.parseBoolean(System.getProperty(OAK_INDEXER_USE_ZIP, "true"));
+    private boolean useLZ4 = Boolean.parseBoolean(System.getProperty(OAK_INDEXER_USE_LZ4, "false"));
+
+    public FlatFileSplitter(File flatFile, File workdir, NodeTypeInfoProvider infoProvider, NodeStateEntryReader entryReader,
+            Set<IndexDefinition> indexDefinitions) {
+        this.flatFile = flatFile;
+        this.indexDefinitions = indexDefinitions;
+        this.workDir = new File(workdir, SPLIT_DIR_NAME);
+
+        this.infoProvider = infoProvider;
+        this.entryReader = entryReader;
+
+        Compression algorithm = Compression.GZIP;
+        if (!useCompression) {
+            algorithm = Compression.NONE;
+        } else if (useLZ4) {
+            algorithm = Compression.LZ4;
+        }
+        this.algorithm = algorithm;
+    }
+
+    private List<File> returnOriginalFlatFile() {
+        return Collections.singletonList(flatFile);
+    }
+
+    public List<File> split() throws IOException {
+        return split(true);
+    }
+
+    public List<File> split(boolean deleteOriginal) throws IOException {
+        List<File> splitFlatFiles = new ArrayList<>();
+        try {
+            FileUtils.forceMkdir(workDir);
+        } catch (IOException e) {
+            LOG.error("failed to create split directory {}", workDir.getAbsolutePath());
+            return returnOriginalFlatFile();
+        }
+
+        long fileSizeInBytes = flatFile.length();
+        long splitThreshold = Math.round((double) (fileSizeInBytes / IndexerConfiguration.splitSize()));
+        LOG.info("original flat file size: ~{}", FileUtils.byteCountToDisplaySize(fileSizeInBytes));
+        LOG.info("split threshold is ~{} bytes, estimate split size >={} files", FileUtils.byteCountToDisplaySize(splitThreshold), IndexerConfiguration.splitSize());
+
+        // return original if file too small or split size equals 1
+        if (splitThreshold < IndexerConfiguration.minSplitThreshold() || IndexerConfiguration.splitSize() <= 1) {
+            LOG.info("split is not necessary, skip splitting");
+            return returnOriginalFlatFile();
+        }
+
+        Set<String>splitNodeTypesName = getSplitNodeTypeNames();
+        LOG.info("unsafe split types: {}", splitNodeTypesName);
+        if (splitNodeTypesName.contains(NT_BASE)) {
+            LOG.info("Skipping split because split node types set contains {}", NT_BASE);
+            return returnOriginalFlatFile();
+        }
+
+        try (BufferedReader reader = createReader(flatFile, algorithm)) {
+            long readPos = 0;
+            int outFileIndex = 1;
+            File currentFile = new File(workDir, "split-" + outFileIndex + "-" + getSortedStoreFileName(algorithm));
+            BufferedWriter writer = createWriter(currentFile, algorithm);
+            splitFlatFiles.add(currentFile);
+
+            String line;
+            int lineCount = 0;
+            Stack<String> nodeTypeNameStack = new Stack<>();
+            while ((line = reader.readLine()) != null) {
+                updateNodeTypeStack(nodeTypeNameStack, line);
+                boolean shouldSplit = (readPos > splitThreshold);
+                if (shouldSplit && canSplit(splitNodeTypesName, nodeTypeNameStack)) {
+                    writer.close();
+                    LOG.info("created split flat file {} with size {}", currentFile.getAbsolutePath(), FileUtils.byteCountToDisplaySize(currentFile.length()));
+                    readPos = 0;
+                    outFileIndex++;
+                    currentFile = new File(workDir, "split-" + outFileIndex + "-" + getSortedStoreFileName(algorithm));
+                    writer = createWriter(currentFile, algorithm);
+                    splitFlatFiles.add(currentFile);
+                    LOG.info("split position found at line {}, creating new split file {}", lineCount, currentFile.getAbsolutePath());
+                }
+                writer.append(line);
+                writer.newLine();
+                readPos += line.length() + 1;
+                lineCount++;
+            }
+            writer.close();
+            LOG.info("created split flat file {} with size {}", currentFile.getAbsolutePath(), FileUtils.byteCountToDisplaySize(currentFile.length()));
+
+            LOG.info("split total line count: {}", lineCount);
+        }
+
+        if (deleteOriginal) {
+            LOG.info("removing original flat file {} after splitting into {} files", flatFile.getAbsolutePath(), splitFlatFiles);
+            flatFile.delete();
+        }
+
+        return splitFlatFiles;
+    }
+
+    private void updateNodeTypeStack(Stack<String> parentNodeTypeNames, String line) {

Review Comment:
   Possibly we can, but these steps should mostly be covered in the FlatFileSplitterTest. Do you think we are missing tests here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@jackrabbit.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [jackrabbit-oak] amit-jain commented on a diff in pull request #715: OAK-9790 - Implement parallel indexing for speeding up oak run indexing command

Posted by GitBox <gi...@apache.org>.
amit-jain commented on code in PR #715:
URL: https://github.com/apache/jackrabbit-oak/pull/715#discussion_r996797821


##########
oak-commons/pom.xml:
##########
@@ -58,6 +58,9 @@
               org.apache.jackrabbit.oak.commons.sort,
               org.apache.jackrabbit.oak.commons.properties
             </Export-Package>
+            <Embed-Dependency>
+              lz4-java;scope=compile|runtime
+            </Embed-Dependency>

Review Comment:
   > This is a big change, and I think we need to be careful to not break anything. I think we will need more reviewers.
   
   This is a PR to incorporate changes from review of https://github.com/apache/jackrabbit-oak/pull/587. So, its been reviewed by many reviewers already



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@jackrabbit.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [jackrabbit-oak] amit-jain commented on a diff in pull request #715: OAK-9790 - Implement parallel indexing for speeding up oak run indexing command

Posted by GitBox <gi...@apache.org>.
amit-jain commented on code in PR #715:
URL: https://github.com/apache/jackrabbit-oak/pull/715#discussion_r982096471


##########
oak-lucene/src/main/java/org/apache/jackrabbit/oak/plugins/index/lucene/writer/IndexWriterUtils.java:
##########
@@ -70,6 +73,10 @@ public static IndexWriterConfig getIndexWriterConfig(LuceneIndexDefinition defin
             IndexWriterConfig config = new IndexWriterConfig(VERSION, analyzer);
             if (remoteDir) {
                 config.setMergeScheduler(new SerialMergeScheduler());
+            } else {

Review Comment:
   I am not sure. Can you suggest what change would be advisable here to ccount for that. The default is 1 thread so essentially the concurrent will default to serial ofcourse if the property is not overriden



##########
oak-lucene/src/main/java/org/apache/jackrabbit/oak/plugins/index/lucene/writer/IndexWriterUtils.java:
##########
@@ -33,12 +35,13 @@
 import org.apache.lucene.analysis.Analyzer;
 import org.apache.lucene.analysis.miscellaneous.PerFieldAnalyzerWrapper;
 import org.apache.lucene.analysis.shingle.ShingleAnalyzerWrapper;
+import org.apache.lucene.index.ConcurrentMergeScheduler;
 import org.apache.lucene.index.IndexWriterConfig;
 import org.apache.lucene.index.SerialMergeScheduler;
 
-import static org.apache.jackrabbit.oak.plugins.index.lucene.LuceneIndexConstants.VERSION;
-
 public class IndexWriterUtils {
+    private static final int INDEX_WRITER_MAX_MERGE = 8;

Review Comment:
   Yes, makes sense.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@jackrabbit.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [jackrabbit-oak] amit-jain commented on a diff in pull request #715: OAK-9790 - Implement parallel indexing for speeding up oak run indexing command

Posted by GitBox <gi...@apache.org>.
amit-jain commented on code in PR #715:
URL: https://github.com/apache/jackrabbit-oak/pull/715#discussion_r982145887


##########
oak-lucene/src/main/java/org/apache/jackrabbit/oak/plugins/index/lucene/writer/IndexWriterUtils.java:
##########
@@ -70,6 +73,10 @@ public static IndexWriterConfig getIndexWriterConfig(LuceneIndexDefinition defin
             IndexWriterConfig config = new IndexWriterConfig(VERSION, analyzer);
             if (remoteDir) {
                 config.setMergeScheduler(new SerialMergeScheduler());
+            } else {

Review Comment:
   As you suggested the properties should be configurable and if that's the case then the default should be 1. But question is what can be done if in future this default changes?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@jackrabbit.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [jackrabbit-oak] nit0906 commented on a diff in pull request #715: OAK-9790 - Implement parallel indexing for speeding up oak run indexing command

Posted by GitBox <gi...@apache.org>.
nit0906 commented on code in PR #715:
URL: https://github.com/apache/jackrabbit-oak/pull/715#discussion_r981908519


##########
oak-lucene/src/main/java/org/apache/jackrabbit/oak/plugins/index/lucene/writer/IndexWriterUtils.java:
##########
@@ -33,12 +35,13 @@
 import org.apache.lucene.analysis.Analyzer;
 import org.apache.lucene.analysis.miscellaneous.PerFieldAnalyzerWrapper;
 import org.apache.lucene.analysis.shingle.ShingleAnalyzerWrapper;
+import org.apache.lucene.index.ConcurrentMergeScheduler;
 import org.apache.lucene.index.IndexWriterConfig;
 import org.apache.lucene.index.SerialMergeScheduler;
 
-import static org.apache.jackrabbit.oak.plugins.index.lucene.LuceneIndexConstants.VERSION;
-
 public class IndexWriterUtils {
+    private static final int INDEX_WRITER_MAX_MERGE = 8;

Review Comment:
   this should be configurable imo



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@jackrabbit.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [jackrabbit-oak] amit-jain merged pull request #715: OAK-9790 - Implement parallel indexing for speeding up oak run indexing command

Posted by GitBox <gi...@apache.org>.
amit-jain merged PR #715:
URL: https://github.com/apache/jackrabbit-oak/pull/715


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@jackrabbit.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [jackrabbit-oak] amit-jain commented on a diff in pull request #715: OAK-9790 - Implement parallel indexing for speeding up oak run indexing command

Posted by GitBox <gi...@apache.org>.
amit-jain commented on code in PR #715:
URL: https://github.com/apache/jackrabbit-oak/pull/715#discussion_r982096471


##########
oak-lucene/src/main/java/org/apache/jackrabbit/oak/plugins/index/lucene/writer/IndexWriterUtils.java:
##########
@@ -70,6 +73,10 @@ public static IndexWriterConfig getIndexWriterConfig(LuceneIndexDefinition defin
             IndexWriterConfig config = new IndexWriterConfig(VERSION, analyzer);
             if (remoteDir) {
                 config.setMergeScheduler(new SerialMergeScheduler());
+            } else {

Review Comment:
   I am not sure. Can you suggest what change would be advisable here to account for that. The default is 1 thread so essentially the concurrent will default to serial ofcourse if the property is not overriden



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@jackrabbit.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [jackrabbit-oak] nit0906 commented on a diff in pull request #715: OAK-9790 - Implement parallel indexing for speeding up oak run indexing command

Posted by GitBox <gi...@apache.org>.
nit0906 commented on code in PR #715:
URL: https://github.com/apache/jackrabbit-oak/pull/715#discussion_r982429383


##########
oak-search-elastic/src/main/java/org/apache/jackrabbit/oak/plugins/index/elastic/index/ElasticBulkProcessorHandler.java:
##########
@@ -151,6 +153,7 @@ private BulkProcessor initBulkProcessor() {
         return BulkProcessor.builder(requestConsumer(),
                 new OakBulkProcessorListener(), this.indexName + "-bulk-processor")
                 .setBulkActions(indexDefinition.bulkActions)
+                .setConcurrentRequests(BULK_PROCESSOR_CONCURRENCY)

Review Comment:
   @amit-jain  - agreed.
   
   the bulkProcessor instance is actually linked to index definition and all other configurations like bulk size threshold , flush interval etc are also defined/configured at index definition level. So if we want to keep configurations in the same line - then it would seem the concurrency can also go in the index definition. 
   
   But this is a different case where we would want to distinguish this on the basis of whether indexing is running from oak-run or otherwise - so configuring this in index definition won't be possible (since the index def would remain the same in both cases but we would want the concurrency to be different - so a system property is the only solution here.) 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@jackrabbit.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [jackrabbit-oak] thomasmueller commented on a diff in pull request #715: OAK-9790 - Implement parallel indexing for speeding up oak run indexing command

Posted by GitBox <gi...@apache.org>.
thomasmueller commented on code in PR #715:
URL: https://github.com/apache/jackrabbit-oak/pull/715#discussion_r996787925


##########
oak-commons/pom.xml:
##########
@@ -58,6 +58,9 @@
               org.apache.jackrabbit.oak.commons.sort,
               org.apache.jackrabbit.oak.commons.properties
             </Export-Package>
+            <Embed-Dependency>
+              lz4-java;scope=compile|runtime
+            </Embed-Dependency>

Review Comment:
   For this change, I think we should ask others in the Oak team... @reschke , @mreutegg , @anchela 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@jackrabbit.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [jackrabbit-oak] amit-jain commented on a diff in pull request #715: OAK-9790 - Implement parallel indexing for speeding up oak run indexing command

Posted by GitBox <gi...@apache.org>.
amit-jain commented on code in PR #715:
URL: https://github.com/apache/jackrabbit-oak/pull/715#discussion_r996908837


##########
oak-core/src/main/java/org/apache/jackrabbit/oak/plugins/index/progress/IndexingProgressReporter.java:
##########
@@ -229,10 +231,12 @@ public IndexUpdateState(String indexPath, boolean reindex, long estimatedCount)
         }
 
         public void indexUpdate() throws CommitFailedException {
-            updateCount++;
-            if (updateCount % 10000 == 0) {
-                log.info("{} => Indexed {} nodes in {} ...", indexPath, updateCount, watch);
-                watch.reset().start();
+            long count = updateCount.incrementAndGet();
+            if (count % 10000 == 0) {
+                synchronized (this) {

Review Comment:
   The Stopwatch class is not thread-safe so, potentially could result in instability otherwise



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@jackrabbit.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [jackrabbit-oak] nit0906 commented on a diff in pull request #715: OAK-9790 - Implement parallel indexing for speeding up oak run indexing command

Posted by GitBox <gi...@apache.org>.
nit0906 commented on code in PR #715:
URL: https://github.com/apache/jackrabbit-oak/pull/715#discussion_r978311689


##########
oak-run-commons/src/main/java/org/apache/jackrabbit/oak/index/indexer/document/flatfile/FlatFileSplitter.java:
##########
@@ -0,0 +1,255 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.jackrabbit.oak.index.indexer.document.flatfile;
+
+import org.apache.commons.io.FileUtils;
+import org.apache.jackrabbit.oak.api.PropertyState;
+import org.apache.jackrabbit.oak.api.Type;
+import org.apache.jackrabbit.oak.commons.Compression;
+import org.apache.jackrabbit.oak.index.indexer.document.IndexerConfiguration;
+import org.apache.jackrabbit.oak.index.indexer.document.NodeStateEntry;
+import org.apache.jackrabbit.oak.plugins.index.search.Aggregate;
+import org.apache.jackrabbit.oak.plugins.index.search.IndexDefinition;
+import org.apache.jackrabbit.oak.query.ast.NodeTypeInfo;
+import org.apache.jackrabbit.oak.query.ast.NodeTypeInfoProvider;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.io.BufferedReader;
+import java.io.BufferedWriter;
+import java.io.File;
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Map;
+import java.util.Objects;
+import java.util.Set;
+import java.util.Stack;
+import java.util.stream.Collectors;
+
+import static org.apache.jackrabbit.JcrConstants.JCR_PRIMARYTYPE;
+import static org.apache.jackrabbit.JcrConstants.NT_BASE;
+import static org.apache.jackrabbit.oak.index.indexer.document.flatfile.FlatFileNodeStoreBuilder.OAK_INDEXER_USE_LZ4;
+import static org.apache.jackrabbit.oak.index.indexer.document.flatfile.FlatFileNodeStoreBuilder.OAK_INDEXER_USE_ZIP;
+import static org.apache.jackrabbit.oak.index.indexer.document.flatfile.FlatFileStoreUtils.createReader;
+import static org.apache.jackrabbit.oak.index.indexer.document.flatfile.FlatFileStoreUtils.createWriter;
+import static org.apache.jackrabbit.oak.index.indexer.document.flatfile.FlatFileStoreUtils.getSortedStoreFileName;
+
+/**
+ * This class is being used when {@value IndexerConfiguration#PROP_OAK_INDEXER_PARALLEL_INDEX} is set to true.
+ * It will split a flat file safely by checking the index definitions. An entry is considered safe to split if only
+ * none of the parent directories contains nodes in indexRule and aggregate fields of the provided index definitions.
+ */
+public class FlatFileSplitter {
+    private static final Logger LOG = LoggerFactory.getLogger(FlatFileSplitter.class);
+
+    private static final String SPLIT_DIR_NAME = "split";
+
+    private final File workDir;
+    private final NodeTypeInfoProvider infoProvider;
+    private final File flatFile;
+    private final NodeStateEntryReader entryReader;
+    private final Compression algorithm;
+    private final Set<IndexDefinition> indexDefinitions;
+    private Set<String> splitNodeTypeNames;
+    private boolean useCompression = Boolean.parseBoolean(System.getProperty(OAK_INDEXER_USE_ZIP, "true"));
+    private boolean useLZ4 = Boolean.parseBoolean(System.getProperty(OAK_INDEXER_USE_LZ4, "false"));
+
+    public FlatFileSplitter(File flatFile, File workdir, NodeTypeInfoProvider infoProvider, NodeStateEntryReader entryReader,
+            Set<IndexDefinition> indexDefinitions) {
+        this.flatFile = flatFile;
+        this.indexDefinitions = indexDefinitions;
+        this.workDir = new File(workdir, SPLIT_DIR_NAME);
+
+        this.infoProvider = infoProvider;
+        this.entryReader = entryReader;
+
+        Compression algorithm = Compression.GZIP;
+        if (!useCompression) {
+            algorithm = Compression.NONE;
+        } else if (useLZ4) {
+            algorithm = Compression.LZ4;
+        }
+        this.algorithm = algorithm;
+    }
+
+    private List<File> returnOriginalFlatFile() {
+        return Collections.singletonList(flatFile);
+    }
+
+    public List<File> split() throws IOException {
+        return split(true);
+    }
+
+    public List<File> split(boolean deleteOriginal) throws IOException {
+        List<File> splitFlatFiles = new ArrayList<>();
+        try {
+            FileUtils.forceMkdir(workDir);
+        } catch (IOException e) {
+            LOG.error("failed to create split directory {}", workDir.getAbsolutePath());
+            return returnOriginalFlatFile();
+        }
+
+        long fileSizeInBytes = flatFile.length();
+        long splitThreshold = Math.round((double) (fileSizeInBytes / IndexerConfiguration.splitSize()));
+        LOG.info("original flat file size: ~{}", FileUtils.byteCountToDisplaySize(fileSizeInBytes));
+        LOG.info("split threshold is ~{} bytes, estimate split size >={} files", FileUtils.byteCountToDisplaySize(splitThreshold), IndexerConfiguration.splitSize());
+
+        // return original if file too small or split size equals 1
+        if (splitThreshold < IndexerConfiguration.minSplitThreshold() || IndexerConfiguration.splitSize() <= 1) {
+            LOG.info("split is not necessary, skip splitting");
+            return returnOriginalFlatFile();
+        }
+
+        Set<String>splitNodeTypesName = getSplitNodeTypeNames();
+        LOG.info("unsafe split types: {}", splitNodeTypesName);
+        if (splitNodeTypesName.contains(NT_BASE)) {
+            LOG.info("Skipping split because split node types set contains {}", NT_BASE);
+            return returnOriginalFlatFile();
+        }
+
+        try (BufferedReader reader = createReader(flatFile, algorithm)) {
+            long readPos = 0;
+            int outFileIndex = 1;
+            File currentFile = new File(workDir, "split-" + outFileIndex + "-" + getSortedStoreFileName(algorithm));
+            BufferedWriter writer = createWriter(currentFile, algorithm);
+            splitFlatFiles.add(currentFile);
+
+            String line;
+            int lineCount = 0;
+            Stack<String> nodeTypeNameStack = new Stack<>();
+            while ((line = reader.readLine()) != null) {
+                updateNodeTypeStack(nodeTypeNameStack, line);
+                boolean shouldSplit = (readPos > splitThreshold);
+                if (shouldSplit && canSplit(splitNodeTypesName, nodeTypeNameStack)) {
+                    writer.close();
+                    LOG.info("created split flat file {} with size {}", currentFile.getAbsolutePath(), FileUtils.byteCountToDisplaySize(currentFile.length()));
+                    readPos = 0;
+                    outFileIndex++;
+                    currentFile = new File(workDir, "split-" + outFileIndex + "-" + getSortedStoreFileName(algorithm));
+                    writer = createWriter(currentFile, algorithm);
+                    splitFlatFiles.add(currentFile);
+                    LOG.info("split position found at line {}, creating new split file {}", lineCount, currentFile.getAbsolutePath());
+                }
+                writer.append(line);
+                writer.newLine();
+                readPos += line.length() + 1;
+                lineCount++;
+            }
+            writer.close();
+            LOG.info("created split flat file {} with size {}", currentFile.getAbsolutePath(), FileUtils.byteCountToDisplaySize(currentFile.length()));
+
+            LOG.info("split total line count: {}", lineCount);
+        }
+
+        if (deleteOriginal) {
+            LOG.info("removing original flat file {} after splitting into {} files", flatFile.getAbsolutePath(), splitFlatFiles);
+            flatFile.delete();
+        }
+
+        return splitFlatFiles;
+    }
+
+    private void updateNodeTypeStack(Stack<String> parentNodeTypeNames, String line) {

Review Comment:
   There are a few util methods in this class like this one - I wonder if it might make sense to make these static to be able to easily write test cases for these ? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@jackrabbit.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [jackrabbit-oak] nit0906 commented on a diff in pull request #715: OAK-9790 - Implement parallel indexing for speeding up oak run indexing command

Posted by GitBox <gi...@apache.org>.
nit0906 commented on code in PR #715:
URL: https://github.com/apache/jackrabbit-oak/pull/715#discussion_r978288503


##########
oak-run-commons/src/main/java/org/apache/jackrabbit/oak/index/indexer/document/flatfile/FlatFileSplitter.java:
##########
@@ -0,0 +1,255 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.jackrabbit.oak.index.indexer.document.flatfile;
+
+import org.apache.commons.io.FileUtils;
+import org.apache.jackrabbit.oak.api.PropertyState;
+import org.apache.jackrabbit.oak.api.Type;
+import org.apache.jackrabbit.oak.commons.Compression;
+import org.apache.jackrabbit.oak.index.indexer.document.IndexerConfiguration;
+import org.apache.jackrabbit.oak.index.indexer.document.NodeStateEntry;
+import org.apache.jackrabbit.oak.plugins.index.search.Aggregate;
+import org.apache.jackrabbit.oak.plugins.index.search.IndexDefinition;
+import org.apache.jackrabbit.oak.query.ast.NodeTypeInfo;
+import org.apache.jackrabbit.oak.query.ast.NodeTypeInfoProvider;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.io.BufferedReader;
+import java.io.BufferedWriter;
+import java.io.File;
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Map;
+import java.util.Objects;
+import java.util.Set;
+import java.util.Stack;
+import java.util.stream.Collectors;
+
+import static org.apache.jackrabbit.JcrConstants.JCR_PRIMARYTYPE;
+import static org.apache.jackrabbit.JcrConstants.NT_BASE;
+import static org.apache.jackrabbit.oak.index.indexer.document.flatfile.FlatFileNodeStoreBuilder.OAK_INDEXER_USE_LZ4;
+import static org.apache.jackrabbit.oak.index.indexer.document.flatfile.FlatFileNodeStoreBuilder.OAK_INDEXER_USE_ZIP;
+import static org.apache.jackrabbit.oak.index.indexer.document.flatfile.FlatFileStoreUtils.createReader;
+import static org.apache.jackrabbit.oak.index.indexer.document.flatfile.FlatFileStoreUtils.createWriter;
+import static org.apache.jackrabbit.oak.index.indexer.document.flatfile.FlatFileStoreUtils.getSortedStoreFileName;
+
+/**
+ * This class is being used when {@value IndexerConfiguration#PROP_OAK_INDEXER_PARALLEL_INDEX} is set to true.
+ * It will split a flat file safely by checking the index definitions. An entry is considered safe to split if only
+ * none of the parent directories contains nodes in indexRule and aggregate fields of the provided index definitions.
+ */

Review Comment:
   An example would be beneficial here. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@jackrabbit.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [jackrabbit-oak] amit-jain commented on pull request #715: OAK-9790 - Implement parallel indexing for speeding up oak run indexing command

Posted by GitBox <gi...@apache.org>.
amit-jain commented on PR #715:
URL: https://github.com/apache/jackrabbit-oak/pull/715#issuecomment-1281846587

   LZ4 support moved to issue OAK-9968 & commit https://github.com/amit-jain/jackrabbit-oak/commit/e8b4a0617fc857f281d6dec967cd48f7e5a16450 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@jackrabbit.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org