You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@kylin.apache.org by GitBox <gi...@apache.org> on 2021/02/10 16:22:49 UTC

[GitHub] [kylin] zhengshengjun opened a new pull request #1583: KYLIN-4903 cache parent datasource to accelerate next layer's cuboid building

zhengshengjun opened a new pull request #1583:
URL: https://github.com/apache/kylin/pull/1583


   …building
   
   ## Proposed changes
   
   Describe the big picture of your changes here to communicate to the maintainers why we should accept this pull request. If it fixes a bug or resolves a feature request, be sure to link to that issue.
   
   ## Types of changes
   
   What types of changes does your code introduce to Kylin?
   _Put an `x` in the boxes that apply_
   
   - [ ] Bugfix (non-breaking change which fixes an issue)
   - [x] New feature (non-breaking change which adds functionality)
   - [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
   - [ ] Documentation Update (if none of the other choices apply)
   
   ## Checklist
   
   _Put an `x` in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your code._
   
   - [x] I have create an issue on [Kylin's jira](https://issues.apache.org/jira/browse/KYLIN), and have described the bug/feature there in detail
   - [x] Commit messages in my PR start with the related jira ID, like "KYLIN-0000 Make Kylin project open-source"
   - [x] Compiling and unit tests pass locally with my changes
   - [ ] I have added tests that prove my fix is effective or that my feature works
   - [ ] If this change need a document change, I will prepare another pr against the `document` branch
   - [ ] Any dependent changes have been merged
   
   ## Further comments
   
   If this is a relatively large or complex change, kick off the discussion at user@kylin or dev@kylin by explaining why you chose the solution you did and what alternatives you considered, etc...
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kylin] zzcclp commented on a change in pull request #1583: KYLIN-4903 cache parent datasource to accelerate next layer's cuboid building

Posted by GitBox <gi...@apache.org>.
zzcclp commented on a change in pull request #1583:
URL: https://github.com/apache/kylin/pull/1583#discussion_r607687388



##########
File path: kylin-spark-project/kylin-spark-engine/src/main/scala/org/apache/kylin/engine/spark/job/BuildLayoutWithUpdate.java
##########
@@ -38,8 +46,45 @@
     private ExecutorService pool = Executors.newCachedThreadPool();
     private CompletionService<JobResult> completionService = new ExecutorCompletionService<>(pool);
     private int currentLayoutsNum = 0;
+    private Map<Long, AtomicLong> toBuildCuboidSize = new ConcurrentHashMap<>();
+    private Semaphore semaphore;
+    private Map<Long, Dataset<Row>> layout2DataSet = new ConcurrentHashMap<>();
+    private KylinConfig kylinConfig;
+    private boolean persistParentDataset;
+
+    public BuildLayoutWithUpdate(KylinConfig kylinConfig) {
+        this.kylinConfig = kylinConfig;
+        this.persistParentDataset = !kylinConfig.getParentDatasetStorageLevel().equals(StorageLevel.NONE());
+        if (this.persistParentDataset) {
+            this.semaphore = new Semaphore(kylinConfig.getMaxParentDatasetPersistCount());
+        }
+    }
+
+    public void cacheAndRegister(long layoutId, Dataset<Row> dataset) throws InterruptedException{
+        if (!persistParentDataset) {
+            return;
+        }
+        logger.info("persist dataset of layout: {}", layoutId);
+        semaphore.acquire();
+        layout2DataSet.put(layoutId, dataset);
+        dataset.persist(StorageLevel.fromString(kylinConfig.getParentDatasetStorageLevel()));

Review comment:
       Get the value of "kylinConfig.getParentDatasetStorageLevel()" in the BuildLayoutWithUpdate initialization method, and then check the validity of the value.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kylin] zzcclp commented on a change in pull request #1583: KYLIN-4903 cache parent datasource to accelerate next layer's cuboid building

Posted by GitBox <gi...@apache.org>.
zzcclp commented on a change in pull request #1583:
URL: https://github.com/apache/kylin/pull/1583#discussion_r577378958



##########
File path: core-common/src/main/java/org/apache/kylin/common/KylinConfigBase.java
##########
@@ -3084,4 +3084,8 @@ public String getKerberosJaasConfPath() {
     public String getKerberosPrincipal() {
         return getOptional("kylin.kerberos.principal");
     }
+
+    public boolean isReuseParentDataSource() {
+        return Boolean.parseBoolean(getOptional("kylin.engine.spark.reuse.parent.rdd", FALSE));
+    }

Review comment:
       IMO, don't need this configuration, we can add a configuration named 'kylin.engine.spark.parent-dataset.storage.level' and set the storage level directly, if the storage level is 'NONE', it means it doesn't need to cache parent dataset. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kylin] zhengshengjun commented on pull request #1583: KYLIN-4903 cache parent datasource to accelerate next layer's cuboid building

Posted by GitBox <gi...@apache.org>.
zhengshengjun commented on pull request #1583:
URL: https://github.com/apache/kylin/pull/1583#issuecomment-789025973


   thanks for @zzcclp's advice, i'll refine the implementation.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kylin] zzcclp commented on a change in pull request #1583: KYLIN-4903 cache parent datasource to accelerate next layer's cuboid building

Posted by GitBox <gi...@apache.org>.
zzcclp commented on a change in pull request #1583:
URL: https://github.com/apache/kylin/pull/1583#discussion_r577377137



##########
File path: kylin-spark-project/kylin-spark-engine/src/main/scala/org/apache/kylin/engine/spark/job/BuildLayerWithUpdate.java
##########
@@ -0,0 +1,156 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.kylin.engine.spark.job;
+
+import org.apache.kylin.common.KylinConfig;
+import org.apache.kylin.engine.spark.builder.NBuildSourceInfo;
+import org.apache.kylin.engine.spark.metadata.SegmentInfo;
+import org.apache.kylin.engine.spark.metadata.cube.model.LayoutEntity;
+import org.apache.kylin.engine.spark.metadata.cube.model.SpanningTree;
+import org.apache.kylin.shaded.com.google.common.base.Preconditions;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.io.IOException;
+import java.util.Collection;
+import java.util.Map;
+import java.util.concurrent.*;
+
+public class BuildLayerWithUpdate {
+
+    protected static final Logger logger = LoggerFactory.getLogger(BuildLayoutWithUpdate.class);
+    private ExecutorService pool = Executors.newCachedThreadPool();
+    private CompletionService<LayerJobResult> completionService = new ExecutorCompletionService<>(pool);
+    private int currentBuildInfoNum = 0;
+
+    public void submit(LayerJobEntity job) {
+        completionService.submit(new Callable<LayerJobResult>() {
+            @Override
+            public LayerJobResult call() throws Exception {
+                KylinConfig.setAndUnsetThreadLocalConfig(job.config);
+                Thread.currentThread().setName("thread-build-layer" + job.info.getLayoutId());
+                try {
+                    return job.build();
+                } catch (Throwable t) {
+                    logger.error("Error occurred when run " + job.info.getLayoutId(), t);
+                    return new LayerJobResult( t);
+                }
+            }
+        });
+        currentBuildInfoNum ++;
+    }
+
+    public void updateLayer() {
+        for (int i = 0; i < currentBuildInfoNum; i++) {
+            try {
+                logger.info("Wait to take layer result.");
+                LayerJobResult result = completionService.take().get();
+                logger.info("Take job layer successful.");
+                if (result.isFailed()) {
+                    shutDownPool();
+                    throw new RuntimeException(result.getThrowable());
+                }
+            } catch (InterruptedException | ExecutionException e) {
+                shutDownPool();
+                throw new RuntimeException(e);
+            }
+        }
+        currentBuildInfoNum = 0;
+    }
+
+    private void shutDownPool() {
+        pool.shutdown();
+        try {
+            pool.awaitTermination(10, TimeUnit.SECONDS);
+        } catch (InterruptedException e) {
+            logger.warn("Error occurred when shutdown thread pool.", e);
+            pool.shutdownNow();
+        }
+    }
+
+    private static class LayerJobResult {
+        private Throwable throwable;
+
+        LayerJobResult(Throwable throwable) {
+            this.throwable = throwable;
+        }
+
+        boolean isFailed() {
+            return throwable != null;
+        }
+
+        Throwable getThrowable() {
+            return throwable;
+        }
+
+    }
+
+    public static class LayerJobEntity {
+
+        NBuildSourceInfo info;
+        SpanningTree st;
+        SegmentInfo seg;
+        CubeBuildJob job;
+        KylinConfig config;
+        Map<Long, Long> cuboidsRowCount;
+
+        public LayerJobEntity(NBuildSourceInfo info, SpanningTree st, KylinConfig config,
+                              CubeBuildJob job, SegmentInfo seg, Map<Long, Long> cuboidsRowCount){
+            this.info = info;
+            this.config = config;
+            this.job = job;
+            this.st = st;
+            this.seg = seg;
+            this.cuboidsRowCount = cuboidsRowCount;
+        }
+
+        public LayerJobResult build() {
+            Collection<LayoutEntity> toBuildCuboids = info.getToBuildCuboids();
+            Dataset<Row> parentDS = info.getParentDS();
+            if (toBuildCuboids.size() > 1) {
+                parentDS.cache();

Review comment:
       How to set other storage level? It's better to add a configuration to support setting other storage level.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kylin] lgtm-com[bot] commented on pull request #1583: KYLIN-4903 cache parent datasource to accelerate next layer's cuboid building

Posted by GitBox <gi...@apache.org>.
lgtm-com[bot] commented on pull request #1583:
URL: https://github.com/apache/kylin/pull/1583#issuecomment-790467593


   This pull request **introduces 1 alert** when merging b83f9b47b9421f11448ee714d98541ebb79d5bec into 01036afd898e313a7d3b31f326dc780fd47f8fd0 - [view on LGTM.com](https://lgtm.com/projects/g/apache/kylin/rev/pr-9fe6f8162d872e4244010ad277c672a8e8572881)
   
   **new alerts:**
   
   * 1 for Equals on incomparable types


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kylin] zzcclp merged pull request #1583: KYLIN-4903 cache parent datasource to accelerate next layer's cuboid building

Posted by GitBox <gi...@apache.org>.
zzcclp merged pull request #1583:
URL: https://github.com/apache/kylin/pull/1583


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kylin] zhengshengjun commented on a change in pull request #1583: KYLIN-4903 cache parent datasource to accelerate next layer's cuboid building

Posted by GitBox <gi...@apache.org>.
zhengshengjun commented on a change in pull request #1583:
URL: https://github.com/apache/kylin/pull/1583#discussion_r608301752



##########
File path: kylin-spark-project/kylin-spark-engine/src/main/scala/org/apache/kylin/engine/spark/job/CubeBuildJob.java
##########
@@ -330,6 +330,11 @@ private void build(Collection<NBuildSourceInfo> buildSourceInfos, SegmentInfo se
             cuboidsNumInLayer += toBuildCuboids.size();
             Preconditions.checkState(!toBuildCuboids.isEmpty(), "To be built cuboids is empty.");
             Dataset<Row> parentDS = info.getParentDS();
+
+            if (toBuildCuboids.size() >= 1) {

Review comment:
       you are right, fixed.

##########
File path: kylin-spark-project/kylin-spark-engine/src/main/scala/org/apache/kylin/engine/spark/job/BuildLayoutWithUpdate.java
##########
@@ -38,8 +46,45 @@
     private ExecutorService pool = Executors.newCachedThreadPool();
     private CompletionService<JobResult> completionService = new ExecutorCompletionService<>(pool);
     private int currentLayoutsNum = 0;
+    private Map<Long, AtomicLong> toBuildCuboidSize = new ConcurrentHashMap<>();
+    private Semaphore semaphore;
+    private Map<Long, Dataset<Row>> layout2DataSet = new ConcurrentHashMap<>();
+    private KylinConfig kylinConfig;
+    private boolean persistParentDataset;
+
+    public BuildLayoutWithUpdate(KylinConfig kylinConfig) {
+        this.kylinConfig = kylinConfig;
+        this.persistParentDataset = !kylinConfig.getParentDatasetStorageLevel().equals(StorageLevel.NONE());

Review comment:
       It's my fault ,thanks for finding it ~

##########
File path: kylin-spark-project/kylin-spark-engine/src/main/scala/org/apache/kylin/engine/spark/job/BuildLayoutWithUpdate.java
##########
@@ -38,8 +46,45 @@
     private ExecutorService pool = Executors.newCachedThreadPool();
     private CompletionService<JobResult> completionService = new ExecutorCompletionService<>(pool);
     private int currentLayoutsNum = 0;
+    private Map<Long, AtomicLong> toBuildCuboidSize = new ConcurrentHashMap<>();
+    private Semaphore semaphore;
+    private Map<Long, Dataset<Row>> layout2DataSet = new ConcurrentHashMap<>();
+    private KylinConfig kylinConfig;
+    private boolean persistParentDataset;
+
+    public BuildLayoutWithUpdate(KylinConfig kylinConfig) {
+        this.kylinConfig = kylinConfig;
+        this.persistParentDataset = !kylinConfig.getParentDatasetStorageLevel().equals(StorageLevel.NONE());
+        if (this.persistParentDataset) {
+            this.semaphore = new Semaphore(kylinConfig.getMaxParentDatasetPersistCount());
+        }
+    }
+
+    public void cacheAndRegister(long layoutId, Dataset<Row> dataset) throws InterruptedException{
+        if (!persistParentDataset) {
+            return;
+        }
+        logger.info("persist dataset of layout: {}", layoutId);
+        semaphore.acquire();
+        layout2DataSet.put(layoutId, dataset);
+        dataset.persist(StorageLevel.fromString(kylinConfig.getParentDatasetStorageLevel()));

Review comment:
       good suggestion 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kylin] zzcclp commented on a change in pull request #1583: KYLIN-4903 cache parent datasource to accelerate next layer's cuboid building

Posted by GitBox <gi...@apache.org>.
zzcclp commented on a change in pull request #1583:
URL: https://github.com/apache/kylin/pull/1583#discussion_r607641622



##########
File path: kylin-spark-project/kylin-spark-engine/src/main/scala/org/apache/kylin/engine/spark/job/BuildLayoutWithUpdate.java
##########
@@ -38,8 +46,45 @@
     private ExecutorService pool = Executors.newCachedThreadPool();
     private CompletionService<JobResult> completionService = new ExecutorCompletionService<>(pool);
     private int currentLayoutsNum = 0;
+    private Map<Long, AtomicLong> toBuildCuboidSize = new ConcurrentHashMap<>();
+    private Semaphore semaphore;
+    private Map<Long, Dataset<Row>> layout2DataSet = new ConcurrentHashMap<>();
+    private KylinConfig kylinConfig;
+    private boolean persistParentDataset;
+
+    public BuildLayoutWithUpdate(KylinConfig kylinConfig) {
+        this.kylinConfig = kylinConfig;
+        this.persistParentDataset = !kylinConfig.getParentDatasetStorageLevel().equals(StorageLevel.NONE());

Review comment:
       `kylinConfig.getParentDatasetStorageLevel().equals(StorageLevel.NONE())` will always return false , because 
    string can't equal to StorageLevel, it needs to use `StorageLevel.fromString(value).equals(StorageLevel.NONE())`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kylin] zhengshengjun commented on a change in pull request #1583: KYLIN-4903 cache parent datasource to accelerate next layer's cuboid building

Posted by GitBox <gi...@apache.org>.
zhengshengjun commented on a change in pull request #1583:
URL: https://github.com/apache/kylin/pull/1583#discussion_r577578007



##########
File path: core-common/src/main/java/org/apache/kylin/common/KylinConfigBase.java
##########
@@ -3084,4 +3084,8 @@ public String getKerberosJaasConfPath() {
     public String getKerberosPrincipal() {
         return getOptional("kylin.kerberos.principal");
     }
+
+    public boolean isReuseParentDataSource() {
+        return Boolean.parseBoolean(getOptional("kylin.engine.spark.reuse.parent.rdd", FALSE));
+    }

Review comment:
       thanks,  good advice indeed ~




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kylin] zzcclp commented on a change in pull request #1583: KYLIN-4903 cache parent datasource to accelerate next layer's cuboid building

Posted by GitBox <gi...@apache.org>.
zzcclp commented on a change in pull request #1583:
URL: https://github.com/apache/kylin/pull/1583#discussion_r577376293



##########
File path: kylin-spark-project/kylin-spark-engine/src/main/scala/org/apache/kylin/engine/spark/job/CubeBuildJob.java
##########
@@ -157,7 +157,12 @@ protected void doExecute() throws Exception {
                 logger.info("Triggered cube planner phase one .");
         }
 
-        buildLayoutWithUpdate = new BuildLayoutWithUpdate();
+        if (config.isReuseParentDataSource()) {
+            buildLayerWithUpdate = new BuildLayerWithUpdate();
+        } else {
+            buildLayoutWithUpdate = new BuildLayoutWithUpdate();

Review comment:
       Why needs these two similar Class? It's better to refactor the original class 'BuildLayoutWithUpdate' to support this feature.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kylin] zhengshengjun commented on pull request #1583: KYLIN-4903 cache parent datasource to accelerate next layer's cuboid building

Posted by GitBox <gi...@apache.org>.
zhengshengjun commented on pull request #1583:
URL: https://github.com/apache/kylin/pull/1583#issuecomment-790445719


   @zzcclp updated, please help me to review this again  ^_^


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kylin] zzcclp commented on pull request #1583: KYLIN-4903 cache parent datasource to accelerate next layer's cuboid building

Posted by GitBox <gi...@apache.org>.
zzcclp commented on pull request #1583:
URL: https://github.com/apache/kylin/pull/1583#issuecomment-816498039


   LGTM


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kylin] zzcclp commented on a change in pull request #1583: KYLIN-4903 cache parent datasource to accelerate next layer's cuboid building

Posted by GitBox <gi...@apache.org>.
zzcclp commented on a change in pull request #1583:
URL: https://github.com/apache/kylin/pull/1583#discussion_r607759935



##########
File path: kylin-spark-project/kylin-spark-engine/src/main/scala/org/apache/kylin/engine/spark/job/CubeBuildJob.java
##########
@@ -330,6 +330,11 @@ private void build(Collection<NBuildSourceInfo> buildSourceInfos, SegmentInfo se
             cuboidsNumInLayer += toBuildCuboids.size();
             Preconditions.checkState(!toBuildCuboids.isEmpty(), "To be built cuboids is empty.");
             Dataset<Row> parentDS = info.getParentDS();
+
+            if (toBuildCuboids.size() >= 1) {

Review comment:
       why is '>='? if the size of toBuildCuboids is 1, it doesn't need to cache parent dataset?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kylin] zzcclp commented on a change in pull request #1583: KYLIN-4903 cache parent datasource to accelerate next layer's cuboid building

Posted by GitBox <gi...@apache.org>.
zzcclp commented on a change in pull request #1583:
URL: https://github.com/apache/kylin/pull/1583#discussion_r577379399



##########
File path: kylin-spark-project/kylin-spark-engine/src/main/scala/org/apache/kylin/engine/spark/job/CubeBuildJob.java
##########
@@ -319,45 +324,61 @@ private void build(Collection<NBuildSourceInfo> buildSourceInfos, SegmentInfo se
     // build current layer and return the next layer to be built.
     private List<NBuildSourceInfo> buildLayer(Collection<NBuildSourceInfo> buildSourceInfos, SegmentInfo seg,
                                               SpanningTree st) {
-        int cuboidsNumInLayer = 0;
 
-        // build current layer
+        int cuboidsNumInLayer = 0;
         List<LayoutEntity> allIndexesInCurrentLayer = new ArrayList<>();
+
+        //record build infos before building
         for (NBuildSourceInfo info : buildSourceInfos) {
             Collection<LayoutEntity> toBuildCuboids = info.getToBuildCuboids();
             infos.recordParent2Children(info.getLayout(),
                     toBuildCuboids.stream().map(LayoutEntity::getId).collect(Collectors.toList()));
             cuboidsNumInLayer += toBuildCuboids.size();
             Preconditions.checkState(!toBuildCuboids.isEmpty(), "To be built cuboids is empty.");
-            Dataset<Row> parentDS = info.getParentDS();
-            // record the source count of flat table
-            if (info.getLayoutId() == ParentSourceChooser.FLAT_TABLE_FLAG()) {
-                cuboidsRowCount.putIfAbsent(info.getLayoutId(), parentDS.count());
-            }
+            info.getToBuildCuboids().stream().forEach(allIndexesInCurrentLayer::add);
+            infos.recordCuboidsNumPerLayer(seg.id(), cuboidsNumInLayer);
+        }
 
-            for (LayoutEntity index : toBuildCuboids) {
-                Preconditions.checkNotNull(parentDS, "Parent dataset is null when building.");
-                buildLayoutWithUpdate.submit(new BuildLayoutWithUpdate.JobEntity() {
-                    @Override
-                    public String getName() {
-                        return "build-cuboid-" + index.getId();
-                    }
-
-                    @Override
-                    public LayoutEntity build() throws IOException {
-                        return buildCuboid(seg, index, parentDS, st, info.getLayoutId());
-                    }
-                }, config);
-                allIndexesInCurrentLayer.add(index);
+        if (config.isReuseParentDataSource()) {
+            logger.info("will cache parent dataset to build next layer");
+            for (NBuildSourceInfo info : buildSourceInfos) {
+                buildLayerWithUpdate.submit(new BuildLayerWithUpdate.LayerJobEntity(info,
+                        st, config, this, seg, cuboidsRowCount));
             }
+            //wait until finished and remove LayoutEntity from toBuildLayouts
+            buildLayerWithUpdate.updateLayer();
+        } else {
+            logger.info("will not cache parent dataset to build next layer");
+            // build current layer
+            for (NBuildSourceInfo info : buildSourceInfos) {
+                Collection<LayoutEntity> toBuildCuboids = info.getToBuildCuboids();
+                Dataset<Row> parentDS = info.getParentDS();
+                // record the source count of flat table
+                if (info.getLayoutId() == ParentSourceChooser.FLAT_TABLE_FLAG()) {
+                    cuboidsRowCount.putIfAbsent(info.getLayoutId(), parentDS.count());
+                }
+                for (LayoutEntity index : toBuildCuboids) {
+                    Preconditions.checkNotNull(parentDS, "Parent dataset is null when building.");
+                    buildLayoutWithUpdate.submit(new BuildLayoutWithUpdate.JobEntity() {
+                        @Override
+                        public String getName() {
+                            return "build-cuboid-" + index.getId();
+                        }
+
+                        @Override
+                        public LayoutEntity build() throws IOException {
+                            return buildCuboid(seg, index, parentDS, st, info.getLayoutId());
+                        }
+                    }, config);
+                }
+            }
+            buildLayoutWithUpdate.updateLayout(seg, config);
         }
 
-        infos.recordCuboidsNumPerLayer(seg.id(), cuboidsNumInLayer);
-        buildLayoutWithUpdate.updateLayout(seg, config);
-
         // decided the next layer by current layer's all indexes.
         st.decideTheNextLayer(allIndexesInCurrentLayer, seg);
         return constructTheNextLayerBuildInfos(st, seg, allIndexesInCurrentLayer);
+

Review comment:
       remove this blank line.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kylin] zzcclp commented on a change in pull request #1583: KYLIN-4903 cache parent datasource to accelerate next layer's cuboid building

Posted by GitBox <gi...@apache.org>.
zzcclp commented on a change in pull request #1583:
URL: https://github.com/apache/kylin/pull/1583#discussion_r580960574



##########
File path: kylin-spark-project/kylin-spark-engine/src/main/scala/org/apache/kylin/engine/spark/job/CubeBuildJob.java
##########
@@ -319,45 +324,61 @@ private void build(Collection<NBuildSourceInfo> buildSourceInfos, SegmentInfo se
     // build current layer and return the next layer to be built.
     private List<NBuildSourceInfo> buildLayer(Collection<NBuildSourceInfo> buildSourceInfos, SegmentInfo seg,
                                               SpanningTree st) {
-        int cuboidsNumInLayer = 0;
 
-        // build current layer
+        int cuboidsNumInLayer = 0;
         List<LayoutEntity> allIndexesInCurrentLayer = new ArrayList<>();
+
+        //record build infos before building
         for (NBuildSourceInfo info : buildSourceInfos) {
             Collection<LayoutEntity> toBuildCuboids = info.getToBuildCuboids();
             infos.recordParent2Children(info.getLayout(),
                     toBuildCuboids.stream().map(LayoutEntity::getId).collect(Collectors.toList()));
             cuboidsNumInLayer += toBuildCuboids.size();
             Preconditions.checkState(!toBuildCuboids.isEmpty(), "To be built cuboids is empty.");
-            Dataset<Row> parentDS = info.getParentDS();
-            // record the source count of flat table
-            if (info.getLayoutId() == ParentSourceChooser.FLAT_TABLE_FLAG()) {
-                cuboidsRowCount.putIfAbsent(info.getLayoutId(), parentDS.count());
-            }
+            info.getToBuildCuboids().stream().forEach(allIndexesInCurrentLayer::add);
+            infos.recordCuboidsNumPerLayer(seg.id(), cuboidsNumInLayer);

Review comment:
       the value of 'cuboidsNumInLayer' is the total count of current segments, why needs to set this value into infos per loop




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org