You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@carbondata.apache.org by GitBox <gi...@apache.org> on 2020/10/15 13:24:33 UTC
[GitHub] [carbondata] shenjiayu17 opened a new pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update
shenjiayu17 opened a new pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986
### Why is this PR needed?
The horizontal compaction flow will be too slow when updating with lots of segments(or lots of blocks), so we try to analyze and optimize it for time-consuming problem.
### What changes were proposed in this PR?
In performDeleteDeltaCompaction, optimize the method getSegListIUDCompactionQualified. Combine two traversals of segments which have same process, and move listFiles process outside the traversal of blocks.
### Does this PR introduce any user interface change?
- No
### Is any new testcase added?
- No
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] akashrn5 commented on a change in pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update
Posted by GitBox <gi...@apache.org>.
akashrn5 commented on a change in pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#discussion_r509202313
##########
File path: core/src/main/java/org/apache/carbondata/core/statusmanager/SegmentUpdateStatusManager.java
##########
@@ -415,44 +415,38 @@ public boolean accept(CarbonFile pathName) {
}
/**
- * Return all delta file for a block.
- * @param segmentId
- * @param blockName
- * @return
+ * Get all delete delta files of the block of specified segment.
+ * Actually, delete delta file name is generated from each SegmentUpdateDetails.
+ *
+ * @param seg the segment which is to find block and its delete delta files
+ * @param blockName the specified block of the segment
+ * @return delete delta file list of the block
*/
- public CarbonFile[] getDeleteDeltaFilesList(final Segment segmentId, final String blockName) {
+ public List<String> getDeleteDeltaFilesList(final Segment seg, final String blockName) {
+
+ List<String> deleteDeltaFileList = new ArrayList<>();
String segmentPath = CarbonTablePath.getSegmentPath(
- identifier.getTablePath(), segmentId.getSegmentNo());
- CarbonFile segDir =
- FileFactory.getCarbonFile(segmentPath);
+ identifier.getTablePath(), seg.getSegmentNo());
+
for (SegmentUpdateDetails block : updateDetails) {
if ((block.getBlockName().equalsIgnoreCase(blockName)) &&
- (block.getSegmentName().equalsIgnoreCase(segmentId.getSegmentNo()))
- && !CarbonUpdateUtil.isBlockInvalid((block.getSegmentStatus()))) {
+ (block.getSegmentName().equalsIgnoreCase(seg.getSegmentNo())) &&
+ !CarbonUpdateUtil.isBlockInvalid(block.getSegmentStatus())) {
final long deltaStartTimestamp =
getStartTimeOfDeltaFile(CarbonCommonConstants.DELETE_DELTA_FILE_EXT, block);
final long deltaEndTimeStamp =
getEndTimeOfDeltaFile(CarbonCommonConstants.DELETE_DELTA_FILE_EXT, block);
-
- return segDir.listFiles(new CarbonFileFilter() {
-
- @Override
- public boolean accept(CarbonFile pathName) {
- String fileName = pathName.getName();
- if (pathName.getSize() > 0
- && fileName.endsWith(CarbonCommonConstants.DELETE_DELTA_FILE_EXT)) {
- String blkName = fileName.substring(0, fileName.lastIndexOf("-"));
- long timestamp =
- Long.parseLong(CarbonTablePath.DataFileUtil.getTimeStampFromFileName(fileName));
- return blockName.equals(blkName) && timestamp <= deltaEndTimeStamp
- && timestamp >= deltaStartTimestamp;
- }
- return false;
- }
- });
+ Set<String> deleteDeltaFiles = new HashSet<>();
+ deleteDeltaFiles.add(segmentPath + CarbonCommonConstants.FILE_SEPARATOR +
+ blockName + CarbonCommonConstants.HYPHEN + deltaStartTimestamp +
+ CarbonCommonConstants.DELETE_DELTA_FILE_EXT);
+ deleteDeltaFiles.add(segmentPath + CarbonCommonConstants.FILE_SEPARATOR +
+ blockName + CarbonCommonConstants.HYPHEN + deltaEndTimeStamp +
+ CarbonCommonConstants.DELETE_DELTA_FILE_EXT);
Review comment:
why are you adding two times here with start and end timestamp? This is wrong, please check
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] akashrn5 commented on a change in pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update
Posted by GitBox <gi...@apache.org>.
akashrn5 commented on a change in pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#discussion_r507834166
##########
File path: core/src/main/java/org/apache/carbondata/core/statusmanager/SegmentUpdateStatusManager.java
##########
@@ -415,44 +415,66 @@ public boolean accept(CarbonFile pathName) {
}
/**
- * Return all delta file for a block.
- * @param segmentId
- * @param blockName
- * @return
+ * Get all delete delta files mapped to each block of the specified segment.
+ * First list all deletedelta files in the segment dir, then loop the files and find
+ * a map of blocks and .deletedelta files related to each block.
+ *
+ * @param seg the segment which is to find blocks
+ * @return a map of block and its file list
*/
- public CarbonFile[] getDeleteDeltaFilesList(final Segment segmentId, final String blockName) {
- String segmentPath = CarbonTablePath.getSegmentPath(
- identifier.getTablePath(), segmentId.getSegmentNo());
- CarbonFile segDir =
- FileFactory.getCarbonFile(segmentPath);
+ public Map<String, List<CarbonFile>> getDeleteDeltaFilesList(final Segment seg) {
+
+ Map<String, long[]> blockDeltaStartAndEndTimestampMap = new HashMap<>();
Review comment:
@shenjiayu17 why exactly do we need to change here? Introduce map and all, still we are doing the list files which is costly operation. I have some points, please check
1. Here no need to create these map, again listing files to fill map.
we are already getting the blockname and every blovck will have one corresponding deletedelta file only right. So always the delta files per block block will be one.
2. The updatedetails contains the blockname, actual blockname, timestamps and delete delta timestamps so you can check if the timestamp not empty, you can yourself form the delta file name based on these info and return from this method.
With the above approach, you will avoid list files operation, filtering operation based on timestamp and creating all these maps. So you can avoid all these changes.
Always we keep the horizontal compaction threshold as 1, and we dont change and dont recommend users to change to get the better performance.
##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/HorizontalCompaction.scala
##########
@@ -173,6 +176,9 @@ object HorizontalCompaction {
val db = carbonTable.getDatabaseName
val table = carbonTable.getTableName
+
+ LOG.info(s"Horizontal Delete Compaction operation is getting valid segments for [$db.$table].")
Review comment:
same as above
##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/HorizontalCompaction.scala
##########
@@ -125,6 +125,9 @@ object HorizontalCompaction {
segLists: util.List[Segment]): Unit = {
val db = carbonTable.getDatabaseName
val table = carbonTable.getTableName
+
+ LOG.info(s"Horizontal Update Compaction operation is getting valid segments for [$db.$table].")
Review comment:
i think this log does not give any useful info here, if you put some log after line 133 and print `validSegList ` it looks little useful
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] shenjiayu17 commented on pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update
Posted by GitBox <gi...@apache.org>.
shenjiayu17 commented on pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#issuecomment-717024597
We tested with 10 segments total 10G and update more than 20 times to see the cost, update row count of each time is 176973.
The result and comparison before and after improving is shown below, we can see that time-consuming of UPDATE reduces by about 30% after improving horizontal compaction.
| | Time Before Improving | Time after Improving |
| ------- | ------------- | ------------- |
| Average time <br> from 2nd to 21st UPDATE | **99.19** | **67.55** |
| UPDATE | | |
| 1st | 38.397 | 71.91 |
| 2nd | 74.918 | 67.386 |
| 3rd | 43.851 | 46.171 |
| 4th | 52.133 | 82.974 |
| 5th | 86.021 | 44.499 |
| 6th | 62.992 | 39.973 |
| 7th | 99.077 | 69.292 |
| 8th | 77.815 | 60.416 |
| 9th | 74.949 | 50.187 |
| 10th | 85.874 | 50.973 |
| 11th | 103.902 | 58.005 |
| 12th | 77.534 | 86.087 |
| 13th | 79.154 | 76.283 |
| 14th | 116.117 | 72.029 |
| 15th | 112.846 | 74.182 |
| 16th | 124.707 | 69.282 |
| 17th | 124.677 | 67.546 |
| 18th | 147.15 | 66.652 |
| 19th | 135.127 | 109.385 |
| 20th | 133.94 | 80.849 |
| 21st | 171.112 | 78.92 |
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] shenjiayu17 commented on a change in pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update
Posted by GitBox <gi...@apache.org>.
shenjiayu17 commented on a change in pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#discussion_r507380031
##########
File path: processing/src/main/java/org/apache/carbondata/processing/merger/CarbonDataMergerUtil.java
##########
@@ -1246,8 +1209,22 @@ public static boolean isHorizontalCompactionEnabled() {
// set the update status.
segmentUpdateStatusManager.setUpdateStatusDetails(segmentUpdateDetails);
- CarbonFile[] deleteDeltaFiles =
- segmentUpdateStatusManager.getDeleteDeltaFilesList(new Segment(seg), blockName);
+ // only when SegmentUpdateDetails contain the specified block
+ // will the method getDeleteDeltaFilesList be executed
+ List<String> blockNameList = segmentUpdateStatusManager.getBlockNameFromSegment(seg);
+ Map<String, List<CarbonFile>> blockAndDeleteDeltaFilesMap = new HashMap<>();
+ CarbonFile[] deleteDeltaFiles = null;
+ if (blockNameList.contains(blockName)) {
+ blockAndDeleteDeltaFilesMap =
+ segmentUpdateStatusManager.getDeleteDeltaFilesList(new Segment(seg));
+ }
+ if (blockAndDeleteDeltaFilesMap.containsKey(blockName)) {
+ List<CarbonFile> deleteDeltaFileList = blockAndDeleteDeltaFilesMap.get(blockName);
+ deleteDeltaFiles = deleteDeltaFileList.toArray(new CarbonFile[deleteDeltaFileList.size()]);
+ }
+
+ // CarbonFile[] deleteDeltaFiles =
Review comment:
Done
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] shenjiayu17 commented on a change in pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update
Posted by GitBox <gi...@apache.org>.
shenjiayu17 commented on a change in pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#discussion_r508943092
##########
File path: core/src/main/java/org/apache/carbondata/core/statusmanager/SegmentUpdateStatusManager.java
##########
@@ -415,44 +415,66 @@ public boolean accept(CarbonFile pathName) {
}
/**
- * Return all delta file for a block.
- * @param segmentId
- * @param blockName
- * @return
+ * Get all delete delta files mapped to each block of the specified segment.
+ * First list all deletedelta files in the segment dir, then loop the files and find
+ * a map of blocks and .deletedelta files related to each block.
+ *
+ * @param seg the segment which is to find blocks
+ * @return a map of block and its file list
*/
- public CarbonFile[] getDeleteDeltaFilesList(final Segment segmentId, final String blockName) {
- String segmentPath = CarbonTablePath.getSegmentPath(
- identifier.getTablePath(), segmentId.getSegmentNo());
- CarbonFile segDir =
- FileFactory.getCarbonFile(segmentPath);
+ public Map<String, List<CarbonFile>> getDeleteDeltaFilesList(final Segment seg) {
+
+ Map<String, long[]> blockDeltaStartAndEndTimestampMap = new HashMap<>();
Review comment:
I modified this method as above approach.
The delete delta file name is generated from updatedetails, without listing files
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] akashrn5 commented on a change in pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update
Posted by GitBox <gi...@apache.org>.
akashrn5 commented on a change in pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#discussion_r509979972
##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/HorizontalCompaction.scala
##########
@@ -130,6 +130,7 @@ object HorizontalCompaction {
absTableIdentifier,
segmentUpdateStatusManager,
compactionTypeIUD)
+ LOG.debug(s"The segment list for Horizontal Update Compaction is ${ validSegList }")
Review comment:
remove brackets, which are unnecessary
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update
Posted by GitBox <gi...@apache.org>.
CarbonDataQA1 commented on pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#issuecomment-717880461
Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4716/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] akashrn5 commented on a change in pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update
Posted by GitBox <gi...@apache.org>.
akashrn5 commented on a change in pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#discussion_r510001821
##########
File path: processing/src/main/java/org/apache/carbondata/processing/merger/CarbonDataMergerUtil.java
##########
@@ -1138,73 +1126,36 @@ private static Boolean checkUpdateDeltaFilesInSeg(Segment seg,
}
/**
- * Check is the segment passed qualifies for IUD delete delta compaction or not i.e.
- * if the number of delete delta files present in the segment is more than
- * numberDeltaFilesThreshold.
+ * Check whether the segment passed qualifies for IUD delete delta compaction or not,
+ * i.e., if the number of delete delta files present in the segment is more than
+ * numberDeltaFilesThreshold, this segment will be selected.
*
- * @param seg
- * @param segmentUpdateStatusManager
- * @param numberDeltaFilesThreshold
- * @return
+ * @param seg segment to be qualified
+ * @param segmentUpdateStatusManager segments & blocks details management
+ * @param numberDeltaFilesThreshold threshold of delete delta files
+ * @return block list of the segment
*/
- private static boolean checkDeleteDeltaFilesInSeg(Segment seg,
+ private static List<String> checkDeleteDeltaFilesInSeg(Segment seg,
SegmentUpdateStatusManager segmentUpdateStatusManager, int numberDeltaFilesThreshold) {
+ List<String> blockLists = new ArrayList<>();
Set<String> uniqueBlocks = new HashSet<String>();
List<String> blockNameList =
segmentUpdateStatusManager.getBlockNameFromSegment(seg.getSegmentNo());
-
- for (final String blockName : blockNameList) {
-
- CarbonFile[] deleteDeltaFiles =
+ for (String blockName : blockNameList) {
+ List<String> deleteDeltaFiles =
segmentUpdateStatusManager.getDeleteDeltaFilesList(seg, blockName);
- if (null != deleteDeltaFiles) {
- // The Delete Delta files may have Spill over blocks. Will consider multiple spill over
- // blocks as one. Currently DeleteDeltaFiles array contains Delete Delta Block name which
- // lies within Delete Delta Start TimeStamp and End TimeStamp. In order to eliminate
- // Spill Over Blocks will choose files with unique taskID.
- for (CarbonFile blocks : deleteDeltaFiles) {
- // Get Task ID and the Timestamp from the Block name for e.g.
- // part-0-3-1481084721319.carbondata => "3-1481084721319"
- String task = CarbonTablePath.DataFileUtil.getTaskNo(blocks.getName());
- String timestamp =
- CarbonTablePath.DataFileUtil.getTimeStampFromDeleteDeltaFile(blocks.getName());
- String taskAndTimeStamp = task + "-" + timestamp;
+ if (null != deleteDeltaFiles && deleteDeltaFiles.size() > numberDeltaFilesThreshold) {
+ for (String file : deleteDeltaFiles) {
Review comment:
```suggestion
for (String deleteDeltaFile: deleteDeltaFiles) {
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] akashrn5 commented on a change in pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update
Posted by GitBox <gi...@apache.org>.
akashrn5 commented on a change in pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#discussion_r509980034
##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/HorizontalCompaction.scala
##########
@@ -177,6 +178,7 @@ object HorizontalCompaction {
absTableIdentifier,
segmentUpdateStatusManager,
compactionTypeIUD)
+ LOG.debug(s"The segment list for Horizontal Update Compaction is ${ deletedBlocksList }")
Review comment:
remove brackets, which are unnecessary
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update
Posted by GitBox <gi...@apache.org>.
CarbonDataQA1 commented on pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#issuecomment-709984578
Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4484/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] akashrn5 commented on a change in pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update
Posted by GitBox <gi...@apache.org>.
akashrn5 commented on a change in pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#discussion_r509221753
##########
File path: processing/src/main/java/org/apache/carbondata/processing/merger/CarbonDataMergerUtil.java
##########
@@ -1039,22 +1039,24 @@ private static boolean isSegmentValid(LoadMetadataDetails seg) {
if (CompactionType.IUD_DELETE_DELTA == compactionTypeIUD) {
int numberDeleteDeltaFilesThreshold =
CarbonProperties.getInstance().getNoDeleteDeltaFilesThresholdForIUDCompaction();
- List<Segment> deleteSegments = new ArrayList<>();
- for (Segment seg : segments) {
- if (checkDeleteDeltaFilesInSeg(seg, segmentUpdateStatusManager,
- numberDeleteDeltaFilesThreshold)) {
- deleteSegments.add(seg);
+
+ // firstly find the valid segments which are updated from SegmentUpdateDetails,
+ // in order to reduce the segment list for behind traversal
+ List<String> segmentsPresentInSegmentUpdateDetails = new ArrayList<>();
Review comment:
you dont need to delete the old code, its already going inside logic if the `updateDetails `are present, So you can just put the `getDeleteDeltaFilesInSeg` method inside `checkDeleteDeltaFilesInSeg` and no need to completely modify `getSegListIUDCompactionQualified`, it will be more clean.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update
Posted by GitBox <gi...@apache.org>.
CarbonDataQA1 commented on pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#issuecomment-713264301
Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4555/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] shenjiayu17 commented on a change in pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update
Posted by GitBox <gi...@apache.org>.
shenjiayu17 commented on a change in pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#discussion_r509943740
##########
File path: core/src/main/java/org/apache/carbondata/core/statusmanager/SegmentUpdateStatusManager.java
##########
@@ -415,44 +415,38 @@ public boolean accept(CarbonFile pathName) {
}
/**
- * Return all delta file for a block.
- * @param segmentId
- * @param blockName
- * @return
+ * Get all delete delta files of the block of specified segment.
+ * Actually, delete delta file name is generated from each SegmentUpdateDetails.
+ *
+ * @param seg the segment which is to find block and its delete delta files
+ * @param blockName the specified block of the segment
+ * @return delete delta file list of the block
*/
- public CarbonFile[] getDeleteDeltaFilesList(final Segment segmentId, final String blockName) {
+ public List<String> getDeleteDeltaFilesList(final Segment seg, final String blockName) {
+
+ List<String> deleteDeltaFileList = new ArrayList<>();
String segmentPath = CarbonTablePath.getSegmentPath(
- identifier.getTablePath(), segmentId.getSegmentNo());
- CarbonFile segDir =
- FileFactory.getCarbonFile(segmentPath);
+ identifier.getTablePath(), seg.getSegmentNo());
+
for (SegmentUpdateDetails block : updateDetails) {
if ((block.getBlockName().equalsIgnoreCase(blockName)) &&
- (block.getSegmentName().equalsIgnoreCase(segmentId.getSegmentNo()))
- && !CarbonUpdateUtil.isBlockInvalid((block.getSegmentStatus()))) {
+ (block.getSegmentName().equalsIgnoreCase(seg.getSegmentNo())) &&
+ !CarbonUpdateUtil.isBlockInvalid(block.getSegmentStatus())) {
final long deltaStartTimestamp =
getStartTimeOfDeltaFile(CarbonCommonConstants.DELETE_DELTA_FILE_EXT, block);
final long deltaEndTimeStamp =
getEndTimeOfDeltaFile(CarbonCommonConstants.DELETE_DELTA_FILE_EXT, block);
-
- return segDir.listFiles(new CarbonFileFilter() {
-
- @Override
- public boolean accept(CarbonFile pathName) {
- String fileName = pathName.getName();
- if (pathName.getSize() > 0
- && fileName.endsWith(CarbonCommonConstants.DELETE_DELTA_FILE_EXT)) {
- String blkName = fileName.substring(0, fileName.lastIndexOf("-"));
- long timestamp =
- Long.parseLong(CarbonTablePath.DataFileUtil.getTimeStampFromFileName(fileName));
- return blockName.equals(blkName) && timestamp <= deltaEndTimeStamp
- && timestamp >= deltaStartTimestamp;
- }
- return false;
- }
- });
+ Set<String> deleteDeltaFiles = new HashSet<>();
+ deleteDeltaFiles.add(segmentPath + CarbonCommonConstants.FILE_SEPARATOR +
+ blockName + CarbonCommonConstants.HYPHEN + deltaStartTimestamp +
+ CarbonCommonConstants.DELETE_DELTA_FILE_EXT);
+ deleteDeltaFiles.add(segmentPath + CarbonCommonConstants.FILE_SEPARATOR +
+ blockName + CarbonCommonConstants.HYPHEN + deltaEndTimeStamp +
+ CarbonCommonConstants.DELETE_DELTA_FILE_EXT);
Review comment:
I modified this method by calling `getDeltaFileStamps ` as the comment below
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update
Posted by GitBox <gi...@apache.org>.
CarbonDataQA1 commented on pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#issuecomment-712853350
Build Failed with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/2785/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] marchpure commented on a change in pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update
Posted by GitBox <gi...@apache.org>.
marchpure commented on a change in pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#discussion_r505670979
##########
File path: processing/src/main/java/org/apache/carbondata/processing/merger/CarbonDataMergerUtil.java
##########
@@ -1210,6 +1198,39 @@ private static boolean checkDeleteDeltaFilesInSeg(Segment seg,
return blockLists;
}
+ private static List<String> checkAndGetDeleteDeltaFilesInSeg(Segment seg,
+ SegmentUpdateStatusManager segmentUpdateStatusManager, int numberDeltaFilesThreshold) {
+
+ List<String> blockLists = new ArrayList<>();
+
+ Map<String, List<CarbonFile>> blockAndDeleteDeltaFilesMap =
+ segmentUpdateStatusManager.getDeleteDeltaFilesForSegment(seg);
+
+ List<String> blockNameList =
+ segmentUpdateStatusManager.getBlockNameFromSegment(seg.getSegmentNo());
+
+ Set<String> uniqueBlocks = new HashSet<String>();
+ for (final String blockName : blockNameList) {
+
+ List<CarbonFile> deleteDeltaFiles = blockAndDeleteDeltaFilesMap.get(blockName);
+
+ if (null != deleteDeltaFiles) {
+ for (CarbonFile blocks : deleteDeltaFiles) {
Review comment:
if (delteDeltaFiles.size < threshold) continue
##########
File path: processing/src/main/java/org/apache/carbondata/processing/merger/CarbonDataMergerUtil.java
##########
@@ -1039,22 +1039,10 @@ private static boolean isSegmentValid(LoadMetadataDetails seg) {
if (CompactionType.IUD_DELETE_DELTA == compactionTypeIUD) {
int numberDeleteDeltaFilesThreshold =
CarbonProperties.getInstance().getNoDeleteDeltaFilesThresholdForIUDCompaction();
- List<Segment> deleteSegments = new ArrayList<>();
for (Segment seg : segments) {
- if (checkDeleteDeltaFilesInSeg(seg, segmentUpdateStatusManager,
Review comment:
remove checkDeleteDeltaFilesInSeg function
##########
File path: core/src/main/java/org/apache/carbondata/core/statusmanager/SegmentUpdateStatusManager.java
##########
@@ -455,6 +455,51 @@ public boolean accept(CarbonFile pathName) {
return null;
}
+ public Map<String, List<CarbonFile>> getDeleteDeltaFilesForSegment(final Segment seg) {
Review comment:
remove getDeleteDeltaFilesList function
##########
File path: processing/src/main/java/org/apache/carbondata/processing/merger/CarbonDataMergerUtil.java
##########
@@ -1210,6 +1198,39 @@ private static boolean checkDeleteDeltaFilesInSeg(Segment seg,
return blockLists;
}
+ private static List<String> checkAndGetDeleteDeltaFilesInSeg(Segment seg,
+ SegmentUpdateStatusManager segmentUpdateStatusManager, int numberDeltaFilesThreshold) {
+
+ List<String> blockLists = new ArrayList<>();
+
+ Map<String, List<CarbonFile>> blockAndDeleteDeltaFilesMap =
+ segmentUpdateStatusManager.getDeleteDeltaFilesForSegment(seg);
+
+ List<String> blockNameList =
+ segmentUpdateStatusManager.getBlockNameFromSegment(seg.getSegmentNo());
+
+ Set<String> uniqueBlocks = new HashSet<String>();
+ for (final String blockName : blockNameList) {
+
+ List<CarbonFile> deleteDeltaFiles = blockAndDeleteDeltaFilesMap.get(blockName);
+
+ if (null != deleteDeltaFiles) {
+ for (CarbonFile blocks : deleteDeltaFiles) {
Review comment:
blocks -> block
##########
File path: core/src/main/java/org/apache/carbondata/core/statusmanager/SegmentUpdateStatusManager.java
##########
@@ -455,6 +455,51 @@ public boolean accept(CarbonFile pathName) {
return null;
}
+ public Map<String, List<CarbonFile>> getDeleteDeltaFilesForSegment(final Segment seg) {
+ String segmentPath = CarbonTablePath.getSegmentPath(
+ identifier.getTablePath(), seg.getSegmentNo());
+ CarbonFile segDir = FileFactory.getCarbonFile(segmentPath);
+ CarbonFile[] allDeleteDeltaFilesOfSegment = segDir.listFiles(new CarbonFileFilter() {
+ @Override
+ public boolean accept(CarbonFile pathName) {
+ String fileName = pathName.getName();
+ return (pathName.getSize() > 0) &&
Review comment:
getSize() will trigger one S3 IO.
remove getsSize()
##########
File path: core/src/main/java/org/apache/carbondata/core/statusmanager/SegmentUpdateStatusManager.java
##########
@@ -455,6 +455,51 @@ public boolean accept(CarbonFile pathName) {
return null;
}
+ public Map<String, List<CarbonFile>> getDeleteDeltaFilesForSegment(final Segment seg) {
+ String segmentPath = CarbonTablePath.getSegmentPath(
+ identifier.getTablePath(), seg.getSegmentNo());
Review comment:
if SegmentUpdateDetails donot contains seg, we shall return empty result directly.
which can save a lot of IO overhead
##########
File path: processing/src/main/java/org/apache/carbondata/processing/merger/CarbonDataMergerUtil.java
##########
@@ -1210,6 +1198,39 @@ private static boolean checkDeleteDeltaFilesInSeg(Segment seg,
return blockLists;
}
+ private static List<String> checkAndGetDeleteDeltaFilesInSeg(Segment seg,
+ SegmentUpdateStatusManager segmentUpdateStatusManager, int numberDeltaFilesThreshold) {
+
+ List<String> blockLists = new ArrayList<>();
+
+ Map<String, List<CarbonFile>> blockAndDeleteDeltaFilesMap =
+ segmentUpdateStatusManager.getDeleteDeltaFilesForSegment(seg);
+
+ List<String> blockNameList =
+ segmentUpdateStatusManager.getBlockNameFromSegment(seg.getSegmentNo());
+
+ Set<String> uniqueBlocks = new HashSet<String>();
+ for (final String blockName : blockNameList) {
+
+ List<CarbonFile> deleteDeltaFiles = blockAndDeleteDeltaFilesMap.get(blockName);
+
+ if (null != deleteDeltaFiles) {
+ for (CarbonFile blocks : deleteDeltaFiles) {
+ String task = CarbonTablePath.DataFileUtil.getTaskNo(blocks.getName());
+ String timestamp =
+ CarbonTablePath.DataFileUtil.getTimeStampFromDeleteDeltaFile(blocks.getName());
+ String taskAndTimeStamp = task + "-" + timestamp;
+ uniqueBlocks.add(taskAndTimeStamp);
Review comment:
if (uniqueBlocks.size() > numberDeltaFilesThreshold) {
blockLists.add(seg.getSegmentNo() + "/" + blockName);
break;
}
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update
Posted by GitBox <gi...@apache.org>.
CarbonDataQA1 commented on pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#issuecomment-714490521
Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/2873/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update
Posted by GitBox <gi...@apache.org>.
CarbonDataQA1 commented on pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#issuecomment-709785844
Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4476/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update
Posted by GitBox <gi...@apache.org>.
CarbonDataQA1 commented on pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#issuecomment-715196563
Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4660/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] akashrn5 commented on pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update
Posted by GitBox <gi...@apache.org>.
akashrn5 commented on pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#issuecomment-717054658
> We tested with 10 segments total 10G and update more than 20 times to see the cost, update row count of each time is 176973.
> The result and comparison before and after improving is shown below, we can see that time-consuming of UPDATE reduces by about 30% after improving horizontal compaction.
>
> Time Before Improving Time after Improving
> Average time
> from 2nd to 21st UPDATE **99.19** **67.55**
> UPDATE
> 1st 38.397 71.91
> 2nd 74.918 67.386
> 3rd 43.851 46.171
> 4th 52.133 82.974
> 5th 86.021 44.499
> 6th 62.992 39.973
> 7th 99.077 69.292
> 8th 77.815 60.416
> 9th 74.949 50.187
> 10th 85.874 50.973
> 11th 103.902 58.005
> 12th 77.534 86.087
> 13th 79.154 76.283
> 14th 116.117 72.029
> 15th 112.846 74.182
> 16th 124.707 69.282
> 17th 124.677 67.546
> 18th 147.15 66.652
> 19th 135.127 109.385
> 20th 133.94 80.849
> 21st 171.112 78.92
Cool
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update
Posted by GitBox <gi...@apache.org>.
CarbonDataQA1 commented on pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#issuecomment-711484706
Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4502/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] shenjiayu17 commented on a change in pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update
Posted by GitBox <gi...@apache.org>.
shenjiayu17 commented on a change in pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#discussion_r506203367
##########
File path: core/src/main/java/org/apache/carbondata/core/statusmanager/SegmentUpdateStatusManager.java
##########
@@ -455,6 +455,51 @@ public boolean accept(CarbonFile pathName) {
return null;
}
+ public Map<String, List<CarbonFile>> getDeleteDeltaFilesForSegment(final Segment seg) {
+ String segmentPath = CarbonTablePath.getSegmentPath(
+ identifier.getTablePath(), seg.getSegmentNo());
Review comment:
Done.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update
Posted by GitBox <gi...@apache.org>.
CarbonDataQA1 commented on pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#issuecomment-709712170
Build Failed with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/2721/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] shenjiayu17 commented on a change in pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update
Posted by GitBox <gi...@apache.org>.
shenjiayu17 commented on a change in pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#discussion_r508933445
##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/HorizontalCompaction.scala
##########
@@ -125,6 +125,9 @@ object HorizontalCompaction {
segLists: util.List[Segment]): Unit = {
val db = carbonTable.getDatabaseName
val table = carbonTable.getTableName
+
+ LOG.info(s"Horizontal Update Compaction operation is getting valid segments for [$db.$table].")
Review comment:
This log is modified, prints the validSegList and time taken to get it
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] akashrn5 commented on a change in pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update
Posted by GitBox <gi...@apache.org>.
akashrn5 commented on a change in pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#discussion_r510001358
##########
File path: processing/src/main/java/org/apache/carbondata/processing/merger/CarbonDataMergerUtil.java
##########
@@ -1138,73 +1126,36 @@ private static Boolean checkUpdateDeltaFilesInSeg(Segment seg,
}
/**
- * Check is the segment passed qualifies for IUD delete delta compaction or not i.e.
- * if the number of delete delta files present in the segment is more than
- * numberDeltaFilesThreshold.
+ * Check whether the segment passed qualifies for IUD delete delta compaction or not,
+ * i.e., if the number of delete delta files present in the segment is more than
+ * numberDeltaFilesThreshold, this segment will be selected.
*
- * @param seg
- * @param segmentUpdateStatusManager
- * @param numberDeltaFilesThreshold
- * @return
+ * @param seg segment to be qualified
+ * @param segmentUpdateStatusManager segments & blocks details management
+ * @param numberDeltaFilesThreshold threshold of delete delta files
+ * @return block list of the segment
*/
- private static boolean checkDeleteDeltaFilesInSeg(Segment seg,
+ private static List<String> checkDeleteDeltaFilesInSeg(Segment seg,
SegmentUpdateStatusManager segmentUpdateStatusManager, int numberDeltaFilesThreshold) {
+ List<String> blockLists = new ArrayList<>();
Set<String> uniqueBlocks = new HashSet<String>();
List<String> blockNameList =
segmentUpdateStatusManager.getBlockNameFromSegment(seg.getSegmentNo());
-
- for (final String blockName : blockNameList) {
-
- CarbonFile[] deleteDeltaFiles =
+ for (String blockName : blockNameList) {
+ List<String> deleteDeltaFiles =
segmentUpdateStatusManager.getDeleteDeltaFilesList(seg, blockName);
- if (null != deleteDeltaFiles) {
- // The Delete Delta files may have Spill over blocks. Will consider multiple spill over
Review comment:
please do not delete the existing comments in code, add it back
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] shenjiayu17 commented on a change in pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update
Posted by GitBox <gi...@apache.org>.
shenjiayu17 commented on a change in pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#discussion_r506020858
##########
File path: processing/src/main/java/org/apache/carbondata/processing/merger/CarbonDataMergerUtil.java
##########
@@ -1210,6 +1198,39 @@ private static boolean checkDeleteDeltaFilesInSeg(Segment seg,
return blockLists;
}
+ private static List<String> checkAndGetDeleteDeltaFilesInSeg(Segment seg,
+ SegmentUpdateStatusManager segmentUpdateStatusManager, int numberDeltaFilesThreshold) {
+
+ List<String> blockLists = new ArrayList<>();
+
+ Map<String, List<CarbonFile>> blockAndDeleteDeltaFilesMap =
+ segmentUpdateStatusManager.getDeleteDeltaFilesForSegment(seg);
+
+ List<String> blockNameList =
+ segmentUpdateStatusManager.getBlockNameFromSegment(seg.getSegmentNo());
+
+ Set<String> uniqueBlocks = new HashSet<String>();
+ for (final String blockName : blockNameList) {
+
+ List<CarbonFile> deleteDeltaFiles = blockAndDeleteDeltaFilesMap.get(blockName);
+
+ if (null != deleteDeltaFiles) {
+ for (CarbonFile blocks : deleteDeltaFiles) {
Review comment:
Done
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] Zhangshunyu commented on pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update
Posted by GitBox <gi...@apache.org>.
Zhangshunyu commented on pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#issuecomment-717027548
LGTM
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update
Posted by GitBox <gi...@apache.org>.
CarbonDataQA1 commented on pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#issuecomment-717880662
Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/2959/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update
Posted by GitBox <gi...@apache.org>.
CarbonDataQA1 commented on pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#issuecomment-711817892
Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/2753/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update
Posted by GitBox <gi...@apache.org>.
CarbonDataQA1 commented on pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#issuecomment-714352138
Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/2864/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] shenjiayu17 commented on a change in pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update
Posted by GitBox <gi...@apache.org>.
shenjiayu17 commented on a change in pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#discussion_r508933755
##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/HorizontalCompaction.scala
##########
@@ -173,6 +176,9 @@ object HorizontalCompaction {
val db = carbonTable.getDatabaseName
val table = carbonTable.getTableName
+
+ LOG.info(s"Horizontal Delete Compaction operation is getting valid segments for [$db.$table].")
Review comment:
I have modified the log, which only prints the validSegList and is in LOG.debug
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] asfgit closed pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update
Posted by GitBox <gi...@apache.org>.
asfgit closed pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] shenjiayu17 commented on a change in pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update
Posted by GitBox <gi...@apache.org>.
shenjiayu17 commented on a change in pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#discussion_r509948458
##########
File path: core/src/main/java/org/apache/carbondata/core/statusmanager/SegmentUpdateStatusManager.java
##########
@@ -415,44 +415,39 @@ public boolean accept(CarbonFile pathName) {
}
/**
- * Return all delta file for a block.
- * @param segmentId
- * @param blockName
- * @return
+ * Get all delete delta files of the block of specified segment.
+ * Actually, delete delta file name is generated from each SegmentUpdateDetails.
+ *
+ * @param seg the segment which is to find block and its delete delta files
+ * @param blockName the specified block of the segment
+ * @return delete delta file list of the block
*/
- public CarbonFile[] getDeleteDeltaFilesList(final Segment segmentId, final String blockName) {
+ public List<String> getDeleteDeltaFilesList(final Segment seg, final String blockName) {
+
Review comment:
Done
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update
Posted by GitBox <gi...@apache.org>.
CarbonDataQA1 commented on pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#issuecomment-714213749
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] shenjiayu17 commented on a change in pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update
Posted by GitBox <gi...@apache.org>.
shenjiayu17 commented on a change in pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#discussion_r506020781
##########
File path: processing/src/main/java/org/apache/carbondata/processing/merger/CarbonDataMergerUtil.java
##########
@@ -1210,6 +1198,39 @@ private static boolean checkDeleteDeltaFilesInSeg(Segment seg,
return blockLists;
}
+ private static List<String> checkAndGetDeleteDeltaFilesInSeg(Segment seg,
+ SegmentUpdateStatusManager segmentUpdateStatusManager, int numberDeltaFilesThreshold) {
+
+ List<String> blockLists = new ArrayList<>();
+
+ Map<String, List<CarbonFile>> blockAndDeleteDeltaFilesMap =
+ segmentUpdateStatusManager.getDeleteDeltaFilesForSegment(seg);
+
+ List<String> blockNameList =
+ segmentUpdateStatusManager.getBlockNameFromSegment(seg.getSegmentNo());
+
+ Set<String> uniqueBlocks = new HashSet<String>();
+ for (final String blockName : blockNameList) {
+
+ List<CarbonFile> deleteDeltaFiles = blockAndDeleteDeltaFilesMap.get(blockName);
+
+ if (null != deleteDeltaFiles) {
+ for (CarbonFile blocks : deleteDeltaFiles) {
Review comment:
Done.
Added judgement:
if (deleteDeltaFiles.size() <= numberDeltaFilesThreshold) continue
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] Zhangshunyu commented on pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update
Posted by GitBox <gi...@apache.org>.
Zhangshunyu commented on pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#issuecomment-709698583
Have you ever tested this optimization? Could you pls give a comparison result for this change?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] akashrn5 edited a comment on pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update
Posted by GitBox <gi...@apache.org>.
akashrn5 edited a comment on pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#issuecomment-713522116
> Have you ever tested this optimization? Could you pls give a comparison result for this change?
as @Zhangshunyu said, its better if you can get the reading and update in PR, it would be great. Thanks
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] shenjiayu17 commented on a change in pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update
Posted by GitBox <gi...@apache.org>.
shenjiayu17 commented on a change in pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#discussion_r507380824
##########
File path: processing/src/main/java/org/apache/carbondata/processing/merger/CarbonDataMergerUtil.java
##########
@@ -1246,8 +1209,22 @@ public static boolean isHorizontalCompactionEnabled() {
// set the update status.
segmentUpdateStatusManager.setUpdateStatusDetails(segmentUpdateDetails);
- CarbonFile[] deleteDeltaFiles =
- segmentUpdateStatusManager.getDeleteDeltaFilesList(new Segment(seg), blockName);
+ // only when SegmentUpdateDetails contain the specified block
+ // will the method getDeleteDeltaFilesList be executed
+ List<String> blockNameList = segmentUpdateStatusManager.getBlockNameFromSegment(seg);
+ Map<String, List<CarbonFile>> blockAndDeleteDeltaFilesMap = new HashMap<>();
+ CarbonFile[] deleteDeltaFiles = null;
+ if (blockNameList.contains(blockName)) {
+ blockAndDeleteDeltaFilesMap =
+ segmentUpdateStatusManager.getDeleteDeltaFilesList(new Segment(seg));
+ }
+ if (blockAndDeleteDeltaFilesMap.containsKey(blockName)) {
Review comment:
Done
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] shenjiayu17 commented on a change in pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update
Posted by GitBox <gi...@apache.org>.
shenjiayu17 commented on a change in pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#discussion_r510063450
##########
File path: core/src/main/java/org/apache/carbondata/core/statusmanager/SegmentUpdateStatusManager.java
##########
@@ -415,44 +415,39 @@ public boolean accept(CarbonFile pathName) {
}
/**
- * Return all delta file for a block.
- * @param segmentId
- * @param blockName
- * @return
+ * Get all delete delta files of the block of specified segment.
+ * Actually, delete delta file name is generated from each SegmentUpdateDetails.
+ *
+ * @param seg the segment which is to find block and its delete delta files
+ * @param blockName the specified block of the segment
+ * @return delete delta file list of the block
*/
- public CarbonFile[] getDeleteDeltaFilesList(final Segment segmentId, final String blockName) {
+ public List<String> getDeleteDeltaFilesList(final Segment seg, final String blockName) {
+
+ List<String> deleteDeltaFileList = new ArrayList<>();
String segmentPath = CarbonTablePath.getSegmentPath(
- identifier.getTablePath(), segmentId.getSegmentNo());
- CarbonFile segDir =
- FileFactory.getCarbonFile(segmentPath);
+ identifier.getTablePath(), seg.getSegmentNo());
+
for (SegmentUpdateDetails block : updateDetails) {
if ((block.getBlockName().equalsIgnoreCase(blockName)) &&
- (block.getSegmentName().equalsIgnoreCase(segmentId.getSegmentNo()))
- && !CarbonUpdateUtil.isBlockInvalid((block.getSegmentStatus()))) {
- final long deltaStartTimestamp =
- getStartTimeOfDeltaFile(CarbonCommonConstants.DELETE_DELTA_FILE_EXT, block);
- final long deltaEndTimeStamp =
- getEndTimeOfDeltaFile(CarbonCommonConstants.DELETE_DELTA_FILE_EXT, block);
-
- return segDir.listFiles(new CarbonFileFilter() {
-
- @Override
- public boolean accept(CarbonFile pathName) {
- String fileName = pathName.getName();
- if (pathName.getSize() > 0
- && fileName.endsWith(CarbonCommonConstants.DELETE_DELTA_FILE_EXT)) {
- String blkName = fileName.substring(0, fileName.lastIndexOf("-"));
- long timestamp =
- Long.parseLong(CarbonTablePath.DataFileUtil.getTimeStampFromFileName(fileName));
- return blockName.equals(blkName) && timestamp <= deltaEndTimeStamp
- && timestamp >= deltaStartTimestamp;
- }
- return false;
- }
- });
+ (block.getSegmentName().equalsIgnoreCase(seg.getSegmentNo())) &&
+ !CarbonUpdateUtil.isBlockInvalid(block.getSegmentStatus())) {
+ Set<String> deltaFileTimestamps = block.getDeltaFileStamps();
+ String deleteDeltaFilePrefix = segmentPath + CarbonCommonConstants.FILE_SEPARATOR +
+ blockName + CarbonCommonConstants.HYPHEN;
+ if (deltaFileTimestamps != null && deltaFileTimestamps.size() > 0) {
+ deltaFileTimestamps.forEach(timestamp -> deleteDeltaFileList.add(
+ deleteDeltaFilePrefix + timestamp + CarbonCommonConstants.DELETE_DELTA_FILE_EXT));
+ } else {
+ final long deltaEndTimeStamp =
Review comment:
ok, I understood. Added a comment here.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] akashrn5 commented on pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update
Posted by GitBox <gi...@apache.org>.
akashrn5 commented on pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#issuecomment-713522116
> Have you ever tested this optimization? Could you pls give a comparison result for this change?
as @Zhangshunyu said, its better if you can get the reading and update in PR, it would be great. Thanks
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] marchpure commented on pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update
Posted by GitBox <gi...@apache.org>.
marchpure commented on pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#issuecomment-709459684
checkstyle failes. you can have a checkstyple test in your local env by run:
mvn clean install -DskipTests
metions: don't use mvn clean package -DskipTests
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update
Posted by GitBox <gi...@apache.org>.
CarbonDataQA1 commented on pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#issuecomment-709457343
Build Failed with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/2712/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update
Posted by GitBox <gi...@apache.org>.
CarbonDataQA1 commented on pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#issuecomment-711479262
Build Failed with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/2747/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] shenjiayu17 commented on a change in pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update
Posted by GitBox <gi...@apache.org>.
shenjiayu17 commented on a change in pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#discussion_r506202461
##########
File path: core/src/main/java/org/apache/carbondata/core/statusmanager/SegmentUpdateStatusManager.java
##########
@@ -455,6 +455,51 @@ public boolean accept(CarbonFile pathName) {
return null;
}
+ public Map<String, List<CarbonFile>> getDeleteDeltaFilesForSegment(final Segment seg) {
Review comment:
Done.
Keep function name getDeleteDeltaFilesList and change the return type to Map<String, List<CarbonFile>>
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update
Posted by GitBox <gi...@apache.org>.
CarbonDataQA1 commented on pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#issuecomment-709323928
Can one of the admins verify this patch?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update
Posted by GitBox <gi...@apache.org>.
CarbonDataQA1 commented on pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#issuecomment-713265280
Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/2803/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] akashrn5 commented on a change in pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update
Posted by GitBox <gi...@apache.org>.
akashrn5 commented on a change in pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#discussion_r509190649
##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/HorizontalCompaction.scala
##########
@@ -125,12 +125,18 @@ object HorizontalCompaction {
segLists: util.List[Segment]): Unit = {
val db = carbonTable.getDatabaseName
val table = carbonTable.getTableName
+ val startTime = System.nanoTime()
+
// get the valid segments qualified for update compaction.
val validSegList = CarbonDataMergerUtil.getSegListIUDCompactionQualified(segLists,
absTableIdentifier,
segmentUpdateStatusManager,
compactionTypeIUD)
+ val endTime = System.nanoTime()
+ LOG.info(s"time taken to get segment list for Horizontal Update Compaction is" +
Review comment:
I dont think time in this log will help us, just segment list can help in someway, not time
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] shenjiayu17 commented on a change in pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update
Posted by GitBox <gi...@apache.org>.
shenjiayu17 commented on a change in pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#discussion_r507380663
##########
File path: processing/src/main/java/org/apache/carbondata/processing/merger/CarbonDataMergerUtil.java
##########
@@ -1246,8 +1209,22 @@ public static boolean isHorizontalCompactionEnabled() {
// set the update status.
segmentUpdateStatusManager.setUpdateStatusDetails(segmentUpdateDetails);
- CarbonFile[] deleteDeltaFiles =
- segmentUpdateStatusManager.getDeleteDeltaFilesList(new Segment(seg), blockName);
+ // only when SegmentUpdateDetails contain the specified block
+ // will the method getDeleteDeltaFilesList be executed
+ List<String> blockNameList = segmentUpdateStatusManager.getBlockNameFromSegment(seg);
+ Map<String, List<CarbonFile>> blockAndDeleteDeltaFilesMap = new HashMap<>();
+ CarbonFile[] deleteDeltaFiles = null;
+ if (blockNameList.contains(blockName)) {
Review comment:
Done. Combined the two judgement
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update
Posted by GitBox <gi...@apache.org>.
CarbonDataQA1 commented on pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#issuecomment-712840146
Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4539/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] ajantha-bhat commented on pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update
Posted by GitBox <gi...@apache.org>.
ajantha-bhat commented on pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#issuecomment-709454528
Add to whitelist
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] shenjiayu17 commented on a change in pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update
Posted by GitBox <gi...@apache.org>.
shenjiayu17 commented on a change in pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#discussion_r506020565
##########
File path: processing/src/main/java/org/apache/carbondata/processing/merger/CarbonDataMergerUtil.java
##########
@@ -1210,6 +1198,39 @@ private static boolean checkDeleteDeltaFilesInSeg(Segment seg,
return blockLists;
}
+ private static List<String> checkAndGetDeleteDeltaFilesInSeg(Segment seg,
+ SegmentUpdateStatusManager segmentUpdateStatusManager, int numberDeltaFilesThreshold) {
+
+ List<String> blockLists = new ArrayList<>();
+
+ Map<String, List<CarbonFile>> blockAndDeleteDeltaFilesMap =
+ segmentUpdateStatusManager.getDeleteDeltaFilesForSegment(seg);
+
+ List<String> blockNameList =
+ segmentUpdateStatusManager.getBlockNameFromSegment(seg.getSegmentNo());
+
+ Set<String> uniqueBlocks = new HashSet<String>();
+ for (final String blockName : blockNameList) {
+
+ List<CarbonFile> deleteDeltaFiles = blockAndDeleteDeltaFilesMap.get(blockName);
+
+ if (null != deleteDeltaFiles) {
+ for (CarbonFile blocks : deleteDeltaFiles) {
+ String task = CarbonTablePath.DataFileUtil.getTaskNo(blocks.getName());
+ String timestamp =
+ CarbonTablePath.DataFileUtil.getTimeStampFromDeleteDeltaFile(blocks.getName());
+ String taskAndTimeStamp = task + "-" + timestamp;
+ uniqueBlocks.add(taskAndTimeStamp);
Review comment:
Done
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] akashrn5 commented on a change in pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update
Posted by GitBox <gi...@apache.org>.
akashrn5 commented on a change in pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#discussion_r509190746
##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/HorizontalCompaction.scala
##########
@@ -173,11 +179,17 @@ object HorizontalCompaction {
val db = carbonTable.getDatabaseName
val table = carbonTable.getTableName
+ val startTime = System.nanoTime()
+
val deletedBlocksList = CarbonDataMergerUtil.getSegListIUDCompactionQualified(segLists,
absTableIdentifier,
segmentUpdateStatusManager,
compactionTypeIUD)
+ val endTime = System.nanoTime()
+ LOG.info(s"time taken to get deleted block list for Horizontal Delete Compaction is" +
Review comment:
same as above
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update
Posted by GitBox <gi...@apache.org>.
CarbonDataQA1 commented on pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#issuecomment-714364595
Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4619/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] akashrn5 commented on pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update
Posted by GitBox <gi...@apache.org>.
akashrn5 commented on pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#issuecomment-715336355
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] shenjiayu17 commented on a change in pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update
Posted by GitBox <gi...@apache.org>.
shenjiayu17 commented on a change in pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#discussion_r510677675
##########
File path: processing/src/main/java/org/apache/carbondata/processing/merger/CarbonDataMergerUtil.java
##########
@@ -1138,73 +1126,36 @@ private static Boolean checkUpdateDeltaFilesInSeg(Segment seg,
}
/**
- * Check is the segment passed qualifies for IUD delete delta compaction or not i.e.
- * if the number of delete delta files present in the segment is more than
- * numberDeltaFilesThreshold.
+ * Check whether the segment passed qualifies for IUD delete delta compaction or not,
+ * i.e., if the number of delete delta files present in the segment is more than
+ * numberDeltaFilesThreshold, this segment will be selected.
*
- * @param seg
- * @param segmentUpdateStatusManager
- * @param numberDeltaFilesThreshold
- * @return
+ * @param seg segment to be qualified
+ * @param segmentUpdateStatusManager segments & blocks details management
+ * @param numberDeltaFilesThreshold threshold of delete delta files
+ * @return block list of the segment
*/
- private static boolean checkDeleteDeltaFilesInSeg(Segment seg,
+ private static List<String> checkDeleteDeltaFilesInSeg(Segment seg,
SegmentUpdateStatusManager segmentUpdateStatusManager, int numberDeltaFilesThreshold) {
+ List<String> blockLists = new ArrayList<>();
Set<String> uniqueBlocks = new HashSet<String>();
List<String> blockNameList =
segmentUpdateStatusManager.getBlockNameFromSegment(seg.getSegmentNo());
-
- for (final String blockName : blockNameList) {
-
- CarbonFile[] deleteDeltaFiles =
+ for (String blockName : blockNameList) {
+ List<String> deleteDeltaFiles =
segmentUpdateStatusManager.getDeleteDeltaFilesList(seg, blockName);
- if (null != deleteDeltaFiles) {
- // The Delete Delta files may have Spill over blocks. Will consider multiple spill over
- // blocks as one. Currently DeleteDeltaFiles array contains Delete Delta Block name which
- // lies within Delete Delta Start TimeStamp and End TimeStamp. In order to eliminate
- // Spill Over Blocks will choose files with unique taskID.
- for (CarbonFile blocks : deleteDeltaFiles) {
- // Get Task ID and the Timestamp from the Block name for e.g.
- // part-0-3-1481084721319.carbondata => "3-1481084721319"
- String task = CarbonTablePath.DataFileUtil.getTaskNo(blocks.getName());
- String timestamp =
- CarbonTablePath.DataFileUtil.getTimeStampFromDeleteDeltaFile(blocks.getName());
- String taskAndTimeStamp = task + "-" + timestamp;
+ if (null != deleteDeltaFiles && deleteDeltaFiles.size() > numberDeltaFilesThreshold) {
+ for (String file : deleteDeltaFiles) {
Review comment:
formatted and modified the variable name
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] shenjiayu17 commented on a change in pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update
Posted by GitBox <gi...@apache.org>.
shenjiayu17 commented on a change in pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#discussion_r507380251
##########
File path: processing/src/main/java/org/apache/carbondata/processing/merger/CarbonDataMergerUtil.java
##########
@@ -1246,8 +1209,22 @@ public static boolean isHorizontalCompactionEnabled() {
// set the update status.
segmentUpdateStatusManager.setUpdateStatusDetails(segmentUpdateDetails);
- CarbonFile[] deleteDeltaFiles =
- segmentUpdateStatusManager.getDeleteDeltaFilesList(new Segment(seg), blockName);
+ // only when SegmentUpdateDetails contain the specified block
+ // will the method getDeleteDeltaFilesList be executed
+ List<String> blockNameList = segmentUpdateStatusManager.getBlockNameFromSegment(seg);
+ Map<String, List<CarbonFile>> blockAndDeleteDeltaFilesMap = new HashMap<>();
+ CarbonFile[] deleteDeltaFiles = null;
Review comment:
Done
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] shenjiayu17 commented on a change in pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update
Posted by GitBox <gi...@apache.org>.
shenjiayu17 commented on a change in pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#discussion_r509947651
##########
File path: processing/src/main/java/org/apache/carbondata/processing/merger/CarbonDataMergerUtil.java
##########
@@ -1039,22 +1039,24 @@ private static boolean isSegmentValid(LoadMetadataDetails seg) {
if (CompactionType.IUD_DELETE_DELTA == compactionTypeIUD) {
int numberDeleteDeltaFilesThreshold =
CarbonProperties.getInstance().getNoDeleteDeltaFilesThresholdForIUDCompaction();
- List<Segment> deleteSegments = new ArrayList<>();
- for (Segment seg : segments) {
- if (checkDeleteDeltaFilesInSeg(seg, segmentUpdateStatusManager,
- numberDeleteDeltaFilesThreshold)) {
- deleteSegments.add(seg);
+
+ // firstly find the valid segments which are updated from SegmentUpdateDetails,
+ // in order to reduce the segment list for behind traversal
+ List<String> segmentsPresentInSegmentUpdateDetails = new ArrayList<>();
Review comment:
combined the `getDeleteDeltaFilesInSeg`and `checkDeleteDeltaFilesInSeg` , and kept the method name `checkDeleteDeltaFilesInSeg` and let it return `List<String>`
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] akashrn5 commented on a change in pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update
Posted by GitBox <gi...@apache.org>.
akashrn5 commented on a change in pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#discussion_r509920810
##########
File path: core/src/main/java/org/apache/carbondata/core/statusmanager/SegmentUpdateStatusManager.java
##########
@@ -415,44 +415,39 @@ public boolean accept(CarbonFile pathName) {
}
/**
- * Return all delta file for a block.
- * @param segmentId
- * @param blockName
- * @return
+ * Get all delete delta files of the block of specified segment.
+ * Actually, delete delta file name is generated from each SegmentUpdateDetails.
+ *
+ * @param seg the segment which is to find block and its delete delta files
+ * @param blockName the specified block of the segment
+ * @return delete delta file list of the block
*/
- public CarbonFile[] getDeleteDeltaFilesList(final Segment segmentId, final String blockName) {
+ public List<String> getDeleteDeltaFilesList(final Segment seg, final String blockName) {
+
+ List<String> deleteDeltaFileList = new ArrayList<>();
String segmentPath = CarbonTablePath.getSegmentPath(
- identifier.getTablePath(), segmentId.getSegmentNo());
- CarbonFile segDir =
- FileFactory.getCarbonFile(segmentPath);
+ identifier.getTablePath(), seg.getSegmentNo());
+
for (SegmentUpdateDetails block : updateDetails) {
if ((block.getBlockName().equalsIgnoreCase(blockName)) &&
- (block.getSegmentName().equalsIgnoreCase(segmentId.getSegmentNo()))
- && !CarbonUpdateUtil.isBlockInvalid((block.getSegmentStatus()))) {
- final long deltaStartTimestamp =
- getStartTimeOfDeltaFile(CarbonCommonConstants.DELETE_DELTA_FILE_EXT, block);
- final long deltaEndTimeStamp =
- getEndTimeOfDeltaFile(CarbonCommonConstants.DELETE_DELTA_FILE_EXT, block);
-
- return segDir.listFiles(new CarbonFileFilter() {
-
- @Override
- public boolean accept(CarbonFile pathName) {
- String fileName = pathName.getName();
- if (pathName.getSize() > 0
- && fileName.endsWith(CarbonCommonConstants.DELETE_DELTA_FILE_EXT)) {
- String blkName = fileName.substring(0, fileName.lastIndexOf("-"));
- long timestamp =
- Long.parseLong(CarbonTablePath.DataFileUtil.getTimeStampFromFileName(fileName));
- return blockName.equals(blkName) && timestamp <= deltaEndTimeStamp
- && timestamp >= deltaStartTimestamp;
- }
- return false;
- }
- });
+ (block.getSegmentName().equalsIgnoreCase(seg.getSegmentNo())) &&
+ !CarbonUpdateUtil.isBlockInvalid(block.getSegmentStatus())) {
+ Set<String> deltaFileTimestamps = block.getDeltaFileStamps();
+ String deleteDeltaFilePrefix = segmentPath + CarbonCommonConstants.FILE_SEPARATOR +
+ blockName + CarbonCommonConstants.HYPHEN;
+ if (deltaFileTimestamps != null && deltaFileTimestamps.size() > 0) {
+ deltaFileTimestamps.forEach(timestamp -> deleteDeltaFileList.add(
+ deleteDeltaFilePrefix + timestamp + CarbonCommonConstants.DELETE_DELTA_FILE_EXT));
+ } else {
+ final long deltaEndTimeStamp =
Review comment:
add a comment here saying, when the deltaFileTimestamps is null, then there is only one delta file and update details will have same start and end timestamps
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] shenjiayu17 commented on a change in pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update
Posted by GitBox <gi...@apache.org>.
shenjiayu17 commented on a change in pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#discussion_r513248146
##########
File path: core/src/main/java/org/apache/carbondata/core/statusmanager/SegmentUpdateStatusManager.java
##########
@@ -415,44 +415,41 @@ public boolean accept(CarbonFile pathName) {
}
/**
- * Return all delta file for a block.
- * @param segmentId
- * @param blockName
- * @return
+ * Get all delete delta files of the block of specified segment.
+ * Actually, delete delta file name is generated from each SegmentUpdateDetails.
+ *
+ * @param seg the segment which is to find block and its delete delta files
+ * @param blockName the specified block of the segment
+ * @return delete delta file list of the block
*/
- public CarbonFile[] getDeleteDeltaFilesList(final Segment segmentId, final String blockName) {
+ public List<String> getDeleteDeltaFilesList(final Segment seg, final String blockName) {
+ List<String> deleteDeltaFileList = new ArrayList<>();
String segmentPath = CarbonTablePath.getSegmentPath(
- identifier.getTablePath(), segmentId.getSegmentNo());
- CarbonFile segDir =
- FileFactory.getCarbonFile(segmentPath);
+ identifier.getTablePath(), seg.getSegmentNo());
+
for (SegmentUpdateDetails block : updateDetails) {
if ((block.getBlockName().equalsIgnoreCase(blockName)) &&
- (block.getSegmentName().equalsIgnoreCase(segmentId.getSegmentNo()))
- && !CarbonUpdateUtil.isBlockInvalid((block.getSegmentStatus()))) {
- final long deltaStartTimestamp =
- getStartTimeOfDeltaFile(CarbonCommonConstants.DELETE_DELTA_FILE_EXT, block);
- final long deltaEndTimeStamp =
- getEndTimeOfDeltaFile(CarbonCommonConstants.DELETE_DELTA_FILE_EXT, block);
-
- return segDir.listFiles(new CarbonFileFilter() {
-
- @Override
- public boolean accept(CarbonFile pathName) {
- String fileName = pathName.getName();
- if (pathName.getSize() > 0
- && fileName.endsWith(CarbonCommonConstants.DELETE_DELTA_FILE_EXT)) {
- String blkName = fileName.substring(0, fileName.lastIndexOf("-"));
- long timestamp =
- Long.parseLong(CarbonTablePath.DataFileUtil.getTimeStampFromFileName(fileName));
- return blockName.equals(blkName) && timestamp <= deltaEndTimeStamp
- && timestamp >= deltaStartTimestamp;
- }
- return false;
- }
- });
+ (block.getSegmentName().equalsIgnoreCase(seg.getSegmentNo())) &&
+ !CarbonUpdateUtil.isBlockInvalid(block.getSegmentStatus())) {
+ Set<String> deltaFileTimestamps = block.getDeltaFileStamps();
+ String deleteDeltaFilePrefix = segmentPath + CarbonCommonConstants.FILE_SEPARATOR +
+ blockName + CarbonCommonConstants.HYPHEN;
+ if (deltaFileTimestamps != null && deltaFileTimestamps.size() > 0) {
Review comment:
Added this test case "test IUD Horizontal Compaction after updating without compaction"
in HorizontalCompactionTestCase
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update
Posted by GitBox <gi...@apache.org>.
CarbonDataQA1 commented on pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#issuecomment-709459321
Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4466/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] shenjiayu17 commented on a change in pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update
Posted by GitBox <gi...@apache.org>.
shenjiayu17 commented on a change in pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#discussion_r508933445
##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/HorizontalCompaction.scala
##########
@@ -125,6 +125,9 @@ object HorizontalCompaction {
segLists: util.List[Segment]): Unit = {
val db = carbonTable.getDatabaseName
val table = carbonTable.getTableName
+
+ LOG.info(s"Horizontal Update Compaction operation is getting valid segments for [$db.$table].")
Review comment:
I have modified the log, which only prints the validSegList and is in LOG.debug
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] shenjiayu17 commented on a change in pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update
Posted by GitBox <gi...@apache.org>.
shenjiayu17 commented on a change in pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#discussion_r506020950
##########
File path: core/src/main/java/org/apache/carbondata/core/statusmanager/SegmentUpdateStatusManager.java
##########
@@ -455,6 +455,51 @@ public boolean accept(CarbonFile pathName) {
return null;
}
+ public Map<String, List<CarbonFile>> getDeleteDeltaFilesForSegment(final Segment seg) {
+ String segmentPath = CarbonTablePath.getSegmentPath(
+ identifier.getTablePath(), seg.getSegmentNo());
+ CarbonFile segDir = FileFactory.getCarbonFile(segmentPath);
+ CarbonFile[] allDeleteDeltaFilesOfSegment = segDir.listFiles(new CarbonFileFilter() {
+ @Override
+ public boolean accept(CarbonFile pathName) {
+ String fileName = pathName.getName();
+ return (pathName.getSize() > 0) &&
Review comment:
Done
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] akashrn5 commented on a change in pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update
Posted by GitBox <gi...@apache.org>.
akashrn5 commented on a change in pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#discussion_r510002204
##########
File path: processing/src/main/java/org/apache/carbondata/processing/merger/CarbonDataMergerUtil.java
##########
@@ -1138,73 +1126,36 @@ private static Boolean checkUpdateDeltaFilesInSeg(Segment seg,
}
/**
- * Check is the segment passed qualifies for IUD delete delta compaction or not i.e.
- * if the number of delete delta files present in the segment is more than
- * numberDeltaFilesThreshold.
+ * Check whether the segment passed qualifies for IUD delete delta compaction or not,
+ * i.e., if the number of delete delta files present in the segment is more than
+ * numberDeltaFilesThreshold, this segment will be selected.
*
- * @param seg
- * @param segmentUpdateStatusManager
- * @param numberDeltaFilesThreshold
- * @return
+ * @param seg segment to be qualified
+ * @param segmentUpdateStatusManager segments & blocks details management
+ * @param numberDeltaFilesThreshold threshold of delete delta files
+ * @return block list of the segment
*/
- private static boolean checkDeleteDeltaFilesInSeg(Segment seg,
+ private static List<String> checkDeleteDeltaFilesInSeg(Segment seg,
SegmentUpdateStatusManager segmentUpdateStatusManager, int numberDeltaFilesThreshold) {
+ List<String> blockLists = new ArrayList<>();
Set<String> uniqueBlocks = new HashSet<String>();
List<String> blockNameList =
segmentUpdateStatusManager.getBlockNameFromSegment(seg.getSegmentNo());
-
- for (final String blockName : blockNameList) {
-
- CarbonFile[] deleteDeltaFiles =
+ for (String blockName : blockNameList) {
+ List<String> deleteDeltaFiles =
segmentUpdateStatusManager.getDeleteDeltaFilesList(seg, blockName);
- if (null != deleteDeltaFiles) {
- // The Delete Delta files may have Spill over blocks. Will consider multiple spill over
- // blocks as one. Currently DeleteDeltaFiles array contains Delete Delta Block name which
- // lies within Delete Delta Start TimeStamp and End TimeStamp. In order to eliminate
- // Spill Over Blocks will choose files with unique taskID.
- for (CarbonFile blocks : deleteDeltaFiles) {
- // Get Task ID and the Timestamp from the Block name for e.g.
- // part-0-3-1481084721319.carbondata => "3-1481084721319"
- String task = CarbonTablePath.DataFileUtil.getTaskNo(blocks.getName());
- String timestamp =
- CarbonTablePath.DataFileUtil.getTimeStampFromDeleteDeltaFile(blocks.getName());
- String taskAndTimeStamp = task + "-" + timestamp;
+ if (null != deleteDeltaFiles && deleteDeltaFiles.size() > numberDeltaFilesThreshold) {
+ for (String file : deleteDeltaFiles) {
Review comment:
please do the proper formatting
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] marchpure commented on a change in pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update
Posted by GitBox <gi...@apache.org>.
marchpure commented on a change in pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#discussion_r506261693
##########
File path: processing/src/main/java/org/apache/carbondata/processing/merger/CarbonDataMergerUtil.java
##########
@@ -1246,8 +1209,22 @@ public static boolean isHorizontalCompactionEnabled() {
// set the update status.
segmentUpdateStatusManager.setUpdateStatusDetails(segmentUpdateDetails);
- CarbonFile[] deleteDeltaFiles =
- segmentUpdateStatusManager.getDeleteDeltaFilesList(new Segment(seg), blockName);
+ // only when SegmentUpdateDetails contain the specified block
+ // will the method getDeleteDeltaFilesList be executed
+ List<String> blockNameList = segmentUpdateStatusManager.getBlockNameFromSegment(seg);
+ Map<String, List<CarbonFile>> blockAndDeleteDeltaFilesMap = new HashMap<>();
+ CarbonFile[] deleteDeltaFiles = null;
+ if (blockNameList.contains(blockName)) {
+ blockAndDeleteDeltaFilesMap =
+ segmentUpdateStatusManager.getDeleteDeltaFilesList(new Segment(seg));
+ }
+ if (blockAndDeleteDeltaFilesMap.containsKey(blockName)) {
+ List<CarbonFile> deleteDeltaFileList = blockAndDeleteDeltaFilesMap.get(blockName);
+ deleteDeltaFiles = deleteDeltaFileList.toArray(new CarbonFile[deleteDeltaFileList.size()]);
+ }
+
+ // CarbonFile[] deleteDeltaFiles =
Review comment:
delete these 2 lines
##########
File path: processing/src/main/java/org/apache/carbondata/processing/merger/CarbonDataMergerUtil.java
##########
@@ -1246,8 +1209,22 @@ public static boolean isHorizontalCompactionEnabled() {
// set the update status.
segmentUpdateStatusManager.setUpdateStatusDetails(segmentUpdateDetails);
- CarbonFile[] deleteDeltaFiles =
- segmentUpdateStatusManager.getDeleteDeltaFilesList(new Segment(seg), blockName);
+ // only when SegmentUpdateDetails contain the specified block
+ // will the method getDeleteDeltaFilesList be executed
+ List<String> blockNameList = segmentUpdateStatusManager.getBlockNameFromSegment(seg);
+ Map<String, List<CarbonFile>> blockAndDeleteDeltaFilesMap = new HashMap<>();
+ CarbonFile[] deleteDeltaFiles = null;
Review comment:
= new CarbonFile[0]
##########
File path: processing/src/main/java/org/apache/carbondata/processing/merger/CarbonDataMergerUtil.java
##########
@@ -1246,8 +1209,22 @@ public static boolean isHorizontalCompactionEnabled() {
// set the update status.
segmentUpdateStatusManager.setUpdateStatusDetails(segmentUpdateDetails);
- CarbonFile[] deleteDeltaFiles =
- segmentUpdateStatusManager.getDeleteDeltaFilesList(new Segment(seg), blockName);
+ // only when SegmentUpdateDetails contain the specified block
+ // will the method getDeleteDeltaFilesList be executed
+ List<String> blockNameList = segmentUpdateStatusManager.getBlockNameFromSegment(seg);
+ Map<String, List<CarbonFile>> blockAndDeleteDeltaFilesMap = new HashMap<>();
+ CarbonFile[] deleteDeltaFiles = null;
+ if (blockNameList.contains(blockName)) {
Review comment:
if (blockNameList.contains(blockName)) {
blockAndDeleteDeltaFilesMap =
if (blockAndDeleteDeltaFilesMap.containsKey(blockName)) {
}
}
##########
File path: processing/src/main/java/org/apache/carbondata/processing/merger/CarbonDataMergerUtil.java
##########
@@ -1246,8 +1209,22 @@ public static boolean isHorizontalCompactionEnabled() {
// set the update status.
segmentUpdateStatusManager.setUpdateStatusDetails(segmentUpdateDetails);
- CarbonFile[] deleteDeltaFiles =
- segmentUpdateStatusManager.getDeleteDeltaFilesList(new Segment(seg), blockName);
+ // only when SegmentUpdateDetails contain the specified block
+ // will the method getDeleteDeltaFilesList be executed
+ List<String> blockNameList = segmentUpdateStatusManager.getBlockNameFromSegment(seg);
+ Map<String, List<CarbonFile>> blockAndDeleteDeltaFilesMap = new HashMap<>();
+ CarbonFile[] deleteDeltaFiles = null;
+ if (blockNameList.contains(blockName)) {
+ blockAndDeleteDeltaFilesMap =
+ segmentUpdateStatusManager.getDeleteDeltaFilesList(new Segment(seg));
+ }
+ if (blockAndDeleteDeltaFilesMap.containsKey(blockName)) {
Review comment:
no need to covert to array
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] shenjiayu17 commented on a change in pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update
Posted by GitBox <gi...@apache.org>.
shenjiayu17 commented on a change in pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#discussion_r508933755
##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/HorizontalCompaction.scala
##########
@@ -173,6 +176,9 @@ object HorizontalCompaction {
val db = carbonTable.getDatabaseName
val table = carbonTable.getTableName
+
+ LOG.info(s"Horizontal Delete Compaction operation is getting valid segments for [$db.$table].")
Review comment:
This log is modified, prints the deletedBlocksList and time taken to get it
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] akashrn5 commented on a change in pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update
Posted by GitBox <gi...@apache.org>.
akashrn5 commented on a change in pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#discussion_r510018289
##########
File path: processing/src/main/java/org/apache/carbondata/processing/merger/CarbonDataMergerUtil.java
##########
@@ -1246,20 +1197,16 @@ public static boolean isHorizontalCompactionEnabled() {
// set the update status.
segmentUpdateStatusManager.setUpdateStatusDetails(segmentUpdateDetails);
- CarbonFile[] deleteDeltaFiles =
+ List<String> deleteFilePathList =
segmentUpdateStatusManager.getDeleteDeltaFilesList(new Segment(seg), blockName);
String destFileName =
blockName + "-" + timestamp.toString() + CarbonCommonConstants.DELETE_DELTA_FILE_EXT;
- List<String> deleteFilePathList = new ArrayList<>();
- if (null != deleteDeltaFiles && deleteDeltaFiles.length > 0 && null != deleteDeltaFiles[0]
- .getParentFile()) {
- String fullBlockFilePath = deleteDeltaFiles[0].getParentFile().getCanonicalPath()
- + CarbonCommonConstants.FILE_SEPARATOR + destFileName;
-
- for (CarbonFile cFile : deleteDeltaFiles) {
- deleteFilePathList.add(cFile.getCanonicalPath());
- }
+ if (deleteFilePathList.size() > 0) {
+ String deleteDeltaFilePath = deleteFilePathList.get(0);
+ String fullBlockFilePath =
+ deleteDeltaFilePath.substring(0, deleteDeltaFilePath.lastIndexOf("/")) +
Review comment:
use constant for "/"
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update
Posted by GitBox <gi...@apache.org>.
CarbonDataQA1 commented on pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#issuecomment-711484392
Build Failed with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/2748/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] shenjiayu17 commented on a change in pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update
Posted by GitBox <gi...@apache.org>.
shenjiayu17 commented on a change in pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#discussion_r510065323
##########
File path: processing/src/main/java/org/apache/carbondata/processing/merger/CarbonDataMergerUtil.java
##########
@@ -1246,20 +1197,16 @@ public static boolean isHorizontalCompactionEnabled() {
// set the update status.
segmentUpdateStatusManager.setUpdateStatusDetails(segmentUpdateDetails);
- CarbonFile[] deleteDeltaFiles =
+ List<String> deleteFilePathList =
segmentUpdateStatusManager.getDeleteDeltaFilesList(new Segment(seg), blockName);
String destFileName =
blockName + "-" + timestamp.toString() + CarbonCommonConstants.DELETE_DELTA_FILE_EXT;
- List<String> deleteFilePathList = new ArrayList<>();
- if (null != deleteDeltaFiles && deleteDeltaFiles.length > 0 && null != deleteDeltaFiles[0]
- .getParentFile()) {
- String fullBlockFilePath = deleteDeltaFiles[0].getParentFile().getCanonicalPath()
- + CarbonCommonConstants.FILE_SEPARATOR + destFileName;
-
- for (CarbonFile cFile : deleteDeltaFiles) {
- deleteFilePathList.add(cFile.getCanonicalPath());
- }
+ if (deleteFilePathList.size() > 0) {
+ String deleteDeltaFilePath = deleteFilePathList.get(0);
+ String fullBlockFilePath =
+ deleteDeltaFilePath.substring(0, deleteDeltaFilePath.lastIndexOf("/")) +
Review comment:
Done
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update
Posted by GitBox <gi...@apache.org>.
CarbonDataQA1 commented on pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#issuecomment-709711202
Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4475/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update
Posted by GitBox <gi...@apache.org>.
CarbonDataQA1 commented on pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#issuecomment-710848203
Build Failed with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/2741/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] shenjiayu17 commented on a change in pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update
Posted by GitBox <gi...@apache.org>.
shenjiayu17 commented on a change in pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#discussion_r510064286
##########
File path: processing/src/main/java/org/apache/carbondata/processing/merger/CarbonDataMergerUtil.java
##########
@@ -1138,73 +1126,36 @@ private static Boolean checkUpdateDeltaFilesInSeg(Segment seg,
}
/**
- * Check is the segment passed qualifies for IUD delete delta compaction or not i.e.
- * if the number of delete delta files present in the segment is more than
- * numberDeltaFilesThreshold.
+ * Check whether the segment passed qualifies for IUD delete delta compaction or not,
+ * i.e., if the number of delete delta files present in the segment is more than
+ * numberDeltaFilesThreshold, this segment will be selected.
*
- * @param seg
- * @param segmentUpdateStatusManager
- * @param numberDeltaFilesThreshold
- * @return
+ * @param seg segment to be qualified
+ * @param segmentUpdateStatusManager segments & blocks details management
+ * @param numberDeltaFilesThreshold threshold of delete delta files
+ * @return block list of the segment
*/
- private static boolean checkDeleteDeltaFilesInSeg(Segment seg,
+ private static List<String> checkDeleteDeltaFilesInSeg(Segment seg,
SegmentUpdateStatusManager segmentUpdateStatusManager, int numberDeltaFilesThreshold) {
+ List<String> blockLists = new ArrayList<>();
Set<String> uniqueBlocks = new HashSet<String>();
List<String> blockNameList =
segmentUpdateStatusManager.getBlockNameFromSegment(seg.getSegmentNo());
-
- for (final String blockName : blockNameList) {
-
- CarbonFile[] deleteDeltaFiles =
+ for (String blockName : blockNameList) {
+ List<String> deleteDeltaFiles =
segmentUpdateStatusManager.getDeleteDeltaFilesList(seg, blockName);
- if (null != deleteDeltaFiles) {
- // The Delete Delta files may have Spill over blocks. Will consider multiple spill over
Review comment:
Recovered the existing comments. I will pay attention to it.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] shenjiayu17 commented on a change in pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update
Posted by GitBox <gi...@apache.org>.
shenjiayu17 commented on a change in pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#discussion_r509947651
##########
File path: processing/src/main/java/org/apache/carbondata/processing/merger/CarbonDataMergerUtil.java
##########
@@ -1039,22 +1039,24 @@ private static boolean isSegmentValid(LoadMetadataDetails seg) {
if (CompactionType.IUD_DELETE_DELTA == compactionTypeIUD) {
int numberDeleteDeltaFilesThreshold =
CarbonProperties.getInstance().getNoDeleteDeltaFilesThresholdForIUDCompaction();
- List<Segment> deleteSegments = new ArrayList<>();
- for (Segment seg : segments) {
- if (checkDeleteDeltaFilesInSeg(seg, segmentUpdateStatusManager,
- numberDeleteDeltaFilesThreshold)) {
- deleteSegments.add(seg);
+
+ // firstly find the valid segments which are updated from SegmentUpdateDetails,
+ // in order to reduce the segment list for behind traversal
+ List<String> segmentsPresentInSegmentUpdateDetails = new ArrayList<>();
Review comment:
combined the `getDeleteDeltaFilesInSeg`and `checkDeleteDeltaFilesInSeg` , and kept the method name `checkDeleteDeltaFilesInSeg` and let it return List<String>
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] akashrn5 commented on a change in pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update
Posted by GitBox <gi...@apache.org>.
akashrn5 commented on a change in pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#discussion_r509920920
##########
File path: core/src/main/java/org/apache/carbondata/core/statusmanager/SegmentUpdateStatusManager.java
##########
@@ -415,44 +415,39 @@ public boolean accept(CarbonFile pathName) {
}
/**
- * Return all delta file for a block.
- * @param segmentId
- * @param blockName
- * @return
+ * Get all delete delta files of the block of specified segment.
+ * Actually, delete delta file name is generated from each SegmentUpdateDetails.
+ *
+ * @param seg the segment which is to find block and its delete delta files
+ * @param blockName the specified block of the segment
+ * @return delete delta file list of the block
*/
- public CarbonFile[] getDeleteDeltaFilesList(final Segment segmentId, final String blockName) {
+ public List<String> getDeleteDeltaFilesList(final Segment seg, final String blockName) {
+
Review comment:
remove empty lines
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] akashrn5 commented on a change in pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update
Posted by GitBox <gi...@apache.org>.
akashrn5 commented on a change in pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#discussion_r509204221
##########
File path: core/src/main/java/org/apache/carbondata/core/statusmanager/SegmentUpdateStatusManager.java
##########
@@ -415,44 +415,38 @@ public boolean accept(CarbonFile pathName) {
}
/**
- * Return all delta file for a block.
- * @param segmentId
- * @param blockName
- * @return
+ * Get all delete delta files of the block of specified segment.
+ * Actually, delete delta file name is generated from each SegmentUpdateDetails.
+ *
+ * @param seg the segment which is to find block and its delete delta files
+ * @param blockName the specified block of the segment
+ * @return delete delta file list of the block
*/
- public CarbonFile[] getDeleteDeltaFilesList(final Segment segmentId, final String blockName) {
+ public List<String> getDeleteDeltaFilesList(final Segment seg, final String blockName) {
+
+ List<String> deleteDeltaFileList = new ArrayList<>();
String segmentPath = CarbonTablePath.getSegmentPath(
- identifier.getTablePath(), segmentId.getSegmentNo());
- CarbonFile segDir =
- FileFactory.getCarbonFile(segmentPath);
+ identifier.getTablePath(), seg.getSegmentNo());
+
for (SegmentUpdateDetails block : updateDetails) {
if ((block.getBlockName().equalsIgnoreCase(blockName)) &&
- (block.getSegmentName().equalsIgnoreCase(segmentId.getSegmentNo()))
- && !CarbonUpdateUtil.isBlockInvalid((block.getSegmentStatus()))) {
+ (block.getSegmentName().equalsIgnoreCase(seg.getSegmentNo())) &&
+ !CarbonUpdateUtil.isBlockInvalid(block.getSegmentStatus())) {
final long deltaStartTimestamp =
getStartTimeOfDeltaFile(CarbonCommonConstants.DELETE_DELTA_FILE_EXT, block);
final long deltaEndTimeStamp =
getEndTimeOfDeltaFile(CarbonCommonConstants.DELETE_DELTA_FILE_EXT, block);
-
- return segDir.listFiles(new CarbonFileFilter() {
-
- @Override
- public boolean accept(CarbonFile pathName) {
- String fileName = pathName.getName();
- if (pathName.getSize() > 0
- && fileName.endsWith(CarbonCommonConstants.DELETE_DELTA_FILE_EXT)) {
- String blkName = fileName.substring(0, fileName.lastIndexOf("-"));
- long timestamp =
- Long.parseLong(CarbonTablePath.DataFileUtil.getTimeStampFromFileName(fileName));
- return blockName.equals(blkName) && timestamp <= deltaEndTimeStamp
- && timestamp >= deltaStartTimestamp;
- }
- return false;
- }
- });
+ Set<String> deleteDeltaFiles = new HashSet<>();
Review comment:
instead of this you can follow like below
1. from `SegmentUpdateDetails`, call `getDeltaFileStamps` method and check if not null and size > 0, if not null, directly take the list of delta timestamps and prepare to final list with the blockname, as we already know
2. If the `getDeltaFileStamps` gives null, then there is only one valid delta timestamp, in this case, `SegmentUpdateDetails` will have same delta start and end timestamp, so you can take one and form a delta file name and return only this, as only one file is there. No need to create list just to add one, so u can create list only in 1st case.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update
Posted by GitBox <gi...@apache.org>.
CarbonDataQA1 commented on pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#issuecomment-709707004
Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4474/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] shenjiayu17 commented on a change in pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update
Posted by GitBox <gi...@apache.org>.
shenjiayu17 commented on a change in pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#discussion_r509956278
##########
File path: core/src/main/java/org/apache/carbondata/core/statusmanager/SegmentUpdateStatusManager.java
##########
@@ -415,44 +415,39 @@ public boolean accept(CarbonFile pathName) {
}
/**
- * Return all delta file for a block.
- * @param segmentId
- * @param blockName
- * @return
+ * Get all delete delta files of the block of specified segment.
+ * Actually, delete delta file name is generated from each SegmentUpdateDetails.
+ *
+ * @param seg the segment which is to find block and its delete delta files
+ * @param blockName the specified block of the segment
+ * @return delete delta file list of the block
*/
- public CarbonFile[] getDeleteDeltaFilesList(final Segment segmentId, final String blockName) {
+ public List<String> getDeleteDeltaFilesList(final Segment seg, final String blockName) {
+
+ List<String> deleteDeltaFileList = new ArrayList<>();
String segmentPath = CarbonTablePath.getSegmentPath(
- identifier.getTablePath(), segmentId.getSegmentNo());
- CarbonFile segDir =
- FileFactory.getCarbonFile(segmentPath);
+ identifier.getTablePath(), seg.getSegmentNo());
+
for (SegmentUpdateDetails block : updateDetails) {
if ((block.getBlockName().equalsIgnoreCase(blockName)) &&
- (block.getSegmentName().equalsIgnoreCase(segmentId.getSegmentNo()))
- && !CarbonUpdateUtil.isBlockInvalid((block.getSegmentStatus()))) {
- final long deltaStartTimestamp =
- getStartTimeOfDeltaFile(CarbonCommonConstants.DELETE_DELTA_FILE_EXT, block);
- final long deltaEndTimeStamp =
- getEndTimeOfDeltaFile(CarbonCommonConstants.DELETE_DELTA_FILE_EXT, block);
-
- return segDir.listFiles(new CarbonFileFilter() {
-
- @Override
- public boolean accept(CarbonFile pathName) {
- String fileName = pathName.getName();
- if (pathName.getSize() > 0
- && fileName.endsWith(CarbonCommonConstants.DELETE_DELTA_FILE_EXT)) {
- String blkName = fileName.substring(0, fileName.lastIndexOf("-"));
- long timestamp =
- Long.parseLong(CarbonTablePath.DataFileUtil.getTimeStampFromFileName(fileName));
- return blockName.equals(blkName) && timestamp <= deltaEndTimeStamp
- && timestamp >= deltaStartTimestamp;
- }
- return false;
- }
- });
+ (block.getSegmentName().equalsIgnoreCase(seg.getSegmentNo())) &&
+ !CarbonUpdateUtil.isBlockInvalid(block.getSegmentStatus())) {
+ Set<String> deltaFileTimestamps = block.getDeltaFileStamps();
+ String deleteDeltaFilePrefix = segmentPath + CarbonCommonConstants.FILE_SEPARATOR +
+ blockName + CarbonCommonConstants.HYPHEN;
+ if (deltaFileTimestamps != null && deltaFileTimestamps.size() > 0) {
+ deltaFileTimestamps.forEach(timestamp -> deleteDeltaFileList.add(
+ deleteDeltaFilePrefix + timestamp + CarbonCommonConstants.DELETE_DELTA_FILE_EXT));
+ } else {
+ final long deltaEndTimeStamp =
Review comment:
The part in else{} is when deltaFileTimestamps is null.
The updateDetails has same start and end timestamp, here I use endTimestamp to form the delta file
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update
Posted by GitBox <gi...@apache.org>.
CarbonDataQA1 commented on pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#issuecomment-709976082
Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/2730/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update
Posted by GitBox <gi...@apache.org>.
CarbonDataQA1 commented on pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#issuecomment-714520638
Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4629/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] shenjiayu17 commented on a change in pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update
Posted by GitBox <gi...@apache.org>.
shenjiayu17 commented on a change in pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#discussion_r510065227
##########
File path: processing/src/main/java/org/apache/carbondata/processing/merger/CarbonDataMergerUtil.java
##########
@@ -1138,73 +1126,36 @@ private static Boolean checkUpdateDeltaFilesInSeg(Segment seg,
}
/**
- * Check is the segment passed qualifies for IUD delete delta compaction or not i.e.
- * if the number of delete delta files present in the segment is more than
- * numberDeltaFilesThreshold.
+ * Check whether the segment passed qualifies for IUD delete delta compaction or not,
+ * i.e., if the number of delete delta files present in the segment is more than
+ * numberDeltaFilesThreshold, this segment will be selected.
*
- * @param seg
- * @param segmentUpdateStatusManager
- * @param numberDeltaFilesThreshold
- * @return
+ * @param seg segment to be qualified
+ * @param segmentUpdateStatusManager segments & blocks details management
+ * @param numberDeltaFilesThreshold threshold of delete delta files
+ * @return block list of the segment
*/
- private static boolean checkDeleteDeltaFilesInSeg(Segment seg,
+ private static List<String> checkDeleteDeltaFilesInSeg(Segment seg,
SegmentUpdateStatusManager segmentUpdateStatusManager, int numberDeltaFilesThreshold) {
+ List<String> blockLists = new ArrayList<>();
Set<String> uniqueBlocks = new HashSet<String>();
List<String> blockNameList =
segmentUpdateStatusManager.getBlockNameFromSegment(seg.getSegmentNo());
-
- for (final String blockName : blockNameList) {
-
- CarbonFile[] deleteDeltaFiles =
+ for (String blockName : blockNameList) {
+ List<String> deleteDeltaFiles =
segmentUpdateStatusManager.getDeleteDeltaFilesList(seg, blockName);
- if (null != deleteDeltaFiles) {
- // The Delete Delta files may have Spill over blocks. Will consider multiple spill over
- // blocks as one. Currently DeleteDeltaFiles array contains Delete Delta Block name which
- // lies within Delete Delta Start TimeStamp and End TimeStamp. In order to eliminate
- // Spill Over Blocks will choose files with unique taskID.
- for (CarbonFile blocks : deleteDeltaFiles) {
- // Get Task ID and the Timestamp from the Block name for e.g.
- // part-0-3-1481084721319.carbondata => "3-1481084721319"
- String task = CarbonTablePath.DataFileUtil.getTaskNo(blocks.getName());
- String timestamp =
- CarbonTablePath.DataFileUtil.getTimeStampFromDeleteDeltaFile(blocks.getName());
- String taskAndTimeStamp = task + "-" + timestamp;
+ if (null != deleteDeltaFiles && deleteDeltaFiles.size() > numberDeltaFilesThreshold) {
+ for (String file : deleteDeltaFiles) {
Review comment:
It was formatted
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] QiangCai commented on a change in pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update
Posted by GitBox <gi...@apache.org>.
QiangCai commented on a change in pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#discussion_r513147277
##########
File path: core/src/main/java/org/apache/carbondata/core/statusmanager/SegmentUpdateStatusManager.java
##########
@@ -415,44 +415,41 @@ public boolean accept(CarbonFile pathName) {
}
/**
- * Return all delta file for a block.
- * @param segmentId
- * @param blockName
- * @return
+ * Get all delete delta files of the block of specified segment.
+ * Actually, delete delta file name is generated from each SegmentUpdateDetails.
+ *
+ * @param seg the segment which is to find block and its delete delta files
+ * @param blockName the specified block of the segment
+ * @return delete delta file list of the block
*/
- public CarbonFile[] getDeleteDeltaFilesList(final Segment segmentId, final String blockName) {
+ public List<String> getDeleteDeltaFilesList(final Segment seg, final String blockName) {
+ List<String> deleteDeltaFileList = new ArrayList<>();
String segmentPath = CarbonTablePath.getSegmentPath(
- identifier.getTablePath(), segmentId.getSegmentNo());
- CarbonFile segDir =
- FileFactory.getCarbonFile(segmentPath);
+ identifier.getTablePath(), seg.getSegmentNo());
+
for (SegmentUpdateDetails block : updateDetails) {
if ((block.getBlockName().equalsIgnoreCase(blockName)) &&
- (block.getSegmentName().equalsIgnoreCase(segmentId.getSegmentNo()))
- && !CarbonUpdateUtil.isBlockInvalid((block.getSegmentStatus()))) {
- final long deltaStartTimestamp =
- getStartTimeOfDeltaFile(CarbonCommonConstants.DELETE_DELTA_FILE_EXT, block);
- final long deltaEndTimeStamp =
- getEndTimeOfDeltaFile(CarbonCommonConstants.DELETE_DELTA_FILE_EXT, block);
-
- return segDir.listFiles(new CarbonFileFilter() {
-
- @Override
- public boolean accept(CarbonFile pathName) {
- String fileName = pathName.getName();
- if (pathName.getSize() > 0
- && fileName.endsWith(CarbonCommonConstants.DELETE_DELTA_FILE_EXT)) {
- String blkName = fileName.substring(0, fileName.lastIndexOf("-"));
- long timestamp =
- Long.parseLong(CarbonTablePath.DataFileUtil.getTimeStampFromFileName(fileName));
- return blockName.equals(blkName) && timestamp <= deltaEndTimeStamp
- && timestamp >= deltaStartTimestamp;
- }
- return false;
- }
- });
+ (block.getSegmentName().equalsIgnoreCase(seg.getSegmentNo())) &&
+ !CarbonUpdateUtil.isBlockInvalid(block.getSegmentStatus())) {
+ Set<String> deltaFileTimestamps = block.getDeltaFileStamps();
+ String deleteDeltaFilePrefix = segmentPath + CarbonCommonConstants.FILE_SEPARATOR +
+ blockName + CarbonCommonConstants.HYPHEN;
+ if (deltaFileTimestamps != null && deltaFileTimestamps.size() > 0) {
Review comment:
please add test case:
1. set CarbonCommonConstants.CARBON_HORIZONTAL_COMPACTION_ENABLE to false
2. update different row in same file three times (it should generate three delete delta files for same blockname)
3. set CarbonCommonConstants.CARBON_HORIZONTAL_COMPACTION_ENABLE to true
4. update different row in same file again
5. query table to check result
##########
File path: core/src/main/java/org/apache/carbondata/core/statusmanager/SegmentUpdateStatusManager.java
##########
@@ -415,44 +415,41 @@ public boolean accept(CarbonFile pathName) {
}
/**
- * Return all delta file for a block.
- * @param segmentId
- * @param blockName
- * @return
+ * Get all delete delta files of the block of specified segment.
+ * Actually, delete delta file name is generated from each SegmentUpdateDetails.
+ *
+ * @param seg the segment which is to find block and its delete delta files
+ * @param blockName the specified block of the segment
+ * @return delete delta file list of the block
*/
- public CarbonFile[] getDeleteDeltaFilesList(final Segment segmentId, final String blockName) {
+ public List<String> getDeleteDeltaFilesList(final Segment seg, final String blockName) {
+ List<String> deleteDeltaFileList = new ArrayList<>();
String segmentPath = CarbonTablePath.getSegmentPath(
- identifier.getTablePath(), segmentId.getSegmentNo());
- CarbonFile segDir =
- FileFactory.getCarbonFile(segmentPath);
+ identifier.getTablePath(), seg.getSegmentNo());
+
for (SegmentUpdateDetails block : updateDetails) {
if ((block.getBlockName().equalsIgnoreCase(blockName)) &&
- (block.getSegmentName().equalsIgnoreCase(segmentId.getSegmentNo()))
- && !CarbonUpdateUtil.isBlockInvalid((block.getSegmentStatus()))) {
- final long deltaStartTimestamp =
- getStartTimeOfDeltaFile(CarbonCommonConstants.DELETE_DELTA_FILE_EXT, block);
- final long deltaEndTimeStamp =
- getEndTimeOfDeltaFile(CarbonCommonConstants.DELETE_DELTA_FILE_EXT, block);
-
- return segDir.listFiles(new CarbonFileFilter() {
-
- @Override
- public boolean accept(CarbonFile pathName) {
- String fileName = pathName.getName();
- if (pathName.getSize() > 0
- && fileName.endsWith(CarbonCommonConstants.DELETE_DELTA_FILE_EXT)) {
- String blkName = fileName.substring(0, fileName.lastIndexOf("-"));
- long timestamp =
- Long.parseLong(CarbonTablePath.DataFileUtil.getTimeStampFromFileName(fileName));
- return blockName.equals(blkName) && timestamp <= deltaEndTimeStamp
- && timestamp >= deltaStartTimestamp;
- }
- return false;
- }
- });
+ (block.getSegmentName().equalsIgnoreCase(seg.getSegmentNo())) &&
+ !CarbonUpdateUtil.isBlockInvalid(block.getSegmentStatus())) {
+ Set<String> deltaFileTimestamps = block.getDeltaFileStamps();
+ String deleteDeltaFilePrefix = segmentPath + CarbonCommonConstants.FILE_SEPARATOR +
+ blockName + CarbonCommonConstants.HYPHEN;
+ if (deltaFileTimestamps != null && deltaFileTimestamps.size() > 0) {
+ deltaFileTimestamps.forEach(timestamp -> deleteDeltaFileList.add(
+ deleteDeltaFilePrefix + timestamp + CarbonCommonConstants.DELETE_DELTA_FILE_EXT));
Review comment:
better to reuse CarbonUpdateUtil.getDeleteDeltaFilePath
##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/HorizontalCompaction.scala
##########
@@ -177,6 +178,7 @@ object HorizontalCompaction {
absTableIdentifier,
segmentUpdateStatusManager,
compactionTypeIUD)
+ LOG.debug(s"The segment list for Horizontal Update Compaction is $deletedBlocksList")
Review comment:
same above
##########
File path: core/src/main/java/org/apache/carbondata/core/statusmanager/SegmentUpdateStatusManager.java
##########
@@ -415,44 +415,41 @@ public boolean accept(CarbonFile pathName) {
}
/**
- * Return all delta file for a block.
- * @param segmentId
- * @param blockName
- * @return
+ * Get all delete delta files of the block of specified segment.
+ * Actually, delete delta file name is generated from each SegmentUpdateDetails.
+ *
+ * @param seg the segment which is to find block and its delete delta files
+ * @param blockName the specified block of the segment
+ * @return delete delta file list of the block
*/
- public CarbonFile[] getDeleteDeltaFilesList(final Segment segmentId, final String blockName) {
+ public List<String> getDeleteDeltaFilesList(final Segment seg, final String blockName) {
Review comment:
change seg to segment
##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/HorizontalCompaction.scala
##########
@@ -130,6 +130,7 @@ object HorizontalCompaction {
absTableIdentifier,
segmentUpdateStatusManager,
compactionTypeIUD)
+ LOG.debug(s"The segment list for Horizontal Update Compaction is $validSegList")
Review comment:
avoid converting the list to string
if (LOG.isDebugEnabled) {
...
}
##########
File path: processing/src/main/java/org/apache/carbondata/processing/merger/CarbonDataMergerUtil.java
##########
@@ -1138,73 +1126,43 @@ private static Boolean checkUpdateDeltaFilesInSeg(Segment seg,
}
/**
- * Check is the segment passed qualifies for IUD delete delta compaction or not i.e.
- * if the number of delete delta files present in the segment is more than
- * numberDeltaFilesThreshold.
+ * Check whether the segment passed qualifies for IUD delete delta compaction or not,
+ * i.e., if the number of delete delta files present in the segment is more than
+ * numberDeltaFilesThreshold, this segment will be selected.
*
- * @param seg
- * @param segmentUpdateStatusManager
- * @param numberDeltaFilesThreshold
- * @return
+ * @param seg segment to be qualified
+ * @param segmentUpdateStatusManager segments & blocks details management
+ * @param numberDeltaFilesThreshold threshold of delete delta files
+ * @return block list of the segment
*/
- private static boolean checkDeleteDeltaFilesInSeg(Segment seg,
+ private static List<String> checkDeleteDeltaFilesInSeg(Segment seg,
SegmentUpdateStatusManager segmentUpdateStatusManager, int numberDeltaFilesThreshold) {
+ List<String> blockLists = new ArrayList<>();
Set<String> uniqueBlocks = new HashSet<String>();
List<String> blockNameList =
segmentUpdateStatusManager.getBlockNameFromSegment(seg.getSegmentNo());
-
- for (final String blockName : blockNameList) {
-
- CarbonFile[] deleteDeltaFiles =
+ for (String blockName : blockNameList) {
+ List<String> deleteDeltaFiles =
segmentUpdateStatusManager.getDeleteDeltaFilesList(seg, blockName);
- if (null != deleteDeltaFiles) {
+ if (null != deleteDeltaFiles && deleteDeltaFiles.size() > numberDeltaFilesThreshold) {
Review comment:
add test case for case when CarbonCommonConstants.DELETE_DELTAFILE_COUNT_THRESHOLD_IUD_COMPACTION >= 3
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] ydvpankaj99 commented on pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update
Posted by GitBox <gi...@apache.org>.
ydvpankaj99 commented on pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#issuecomment-713228255
retest this please
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] shenjiayu17 commented on a change in pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update
Posted by GitBox <gi...@apache.org>.
shenjiayu17 commented on a change in pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#discussion_r506020781
##########
File path: processing/src/main/java/org/apache/carbondata/processing/merger/CarbonDataMergerUtil.java
##########
@@ -1210,6 +1198,39 @@ private static boolean checkDeleteDeltaFilesInSeg(Segment seg,
return blockLists;
}
+ private static List<String> checkAndGetDeleteDeltaFilesInSeg(Segment seg,
+ SegmentUpdateStatusManager segmentUpdateStatusManager, int numberDeltaFilesThreshold) {
+
+ List<String> blockLists = new ArrayList<>();
+
+ Map<String, List<CarbonFile>> blockAndDeleteDeltaFilesMap =
+ segmentUpdateStatusManager.getDeleteDeltaFilesForSegment(seg);
+
+ List<String> blockNameList =
+ segmentUpdateStatusManager.getBlockNameFromSegment(seg.getSegmentNo());
+
+ Set<String> uniqueBlocks = new HashSet<String>();
+ for (final String blockName : blockNameList) {
+
+ List<CarbonFile> deleteDeltaFiles = blockAndDeleteDeltaFilesMap.get(blockName);
+
+ if (null != deleteDeltaFiles) {
+ for (CarbonFile blocks : deleteDeltaFiles) {
Review comment:
Done
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update
Posted by GitBox <gi...@apache.org>.
CarbonDataQA1 commented on pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#issuecomment-709805701
Build Failed with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/2722/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] shenjiayu17 commented on a change in pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update
Posted by GitBox <gi...@apache.org>.
shenjiayu17 commented on a change in pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#discussion_r513249713
##########
File path: processing/src/main/java/org/apache/carbondata/processing/merger/CarbonDataMergerUtil.java
##########
@@ -1138,73 +1126,43 @@ private static Boolean checkUpdateDeltaFilesInSeg(Segment seg,
}
/**
- * Check is the segment passed qualifies for IUD delete delta compaction or not i.e.
- * if the number of delete delta files present in the segment is more than
- * numberDeltaFilesThreshold.
+ * Check whether the segment passed qualifies for IUD delete delta compaction or not,
+ * i.e., if the number of delete delta files present in the segment is more than
+ * numberDeltaFilesThreshold, this segment will be selected.
*
- * @param seg
- * @param segmentUpdateStatusManager
- * @param numberDeltaFilesThreshold
- * @return
+ * @param seg segment to be qualified
+ * @param segmentUpdateStatusManager segments & blocks details management
+ * @param numberDeltaFilesThreshold threshold of delete delta files
+ * @return block list of the segment
*/
- private static boolean checkDeleteDeltaFilesInSeg(Segment seg,
+ private static List<String> checkDeleteDeltaFilesInSeg(Segment seg,
SegmentUpdateStatusManager segmentUpdateStatusManager, int numberDeltaFilesThreshold) {
+ List<String> blockLists = new ArrayList<>();
Set<String> uniqueBlocks = new HashSet<String>();
List<String> blockNameList =
segmentUpdateStatusManager.getBlockNameFromSegment(seg.getSegmentNo());
-
- for (final String blockName : blockNameList) {
-
- CarbonFile[] deleteDeltaFiles =
+ for (String blockName : blockNameList) {
+ List<String> deleteDeltaFiles =
segmentUpdateStatusManager.getDeleteDeltaFilesList(seg, blockName);
- if (null != deleteDeltaFiles) {
+ if (null != deleteDeltaFiles && deleteDeltaFiles.size() > numberDeltaFilesThreshold) {
Review comment:
Added test case "test IUD Horizontal Compaction when CarbonCommonConstants.DELETE_DELTAFILE_COUNT_THRESHOLD_IUD_COMPACTION >= 3"
in HorizontalCompactionTestCase
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] akashrn5 commented on a change in pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update
Posted by GitBox <gi...@apache.org>.
akashrn5 commented on a change in pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#discussion_r509979593
##########
File path: core/src/main/java/org/apache/carbondata/core/statusmanager/SegmentUpdateStatusManager.java
##########
@@ -415,44 +415,39 @@ public boolean accept(CarbonFile pathName) {
}
/**
- * Return all delta file for a block.
- * @param segmentId
- * @param blockName
- * @return
+ * Get all delete delta files of the block of specified segment.
+ * Actually, delete delta file name is generated from each SegmentUpdateDetails.
+ *
+ * @param seg the segment which is to find block and its delete delta files
+ * @param blockName the specified block of the segment
+ * @return delete delta file list of the block
*/
- public CarbonFile[] getDeleteDeltaFilesList(final Segment segmentId, final String blockName) {
+ public List<String> getDeleteDeltaFilesList(final Segment seg, final String blockName) {
+
+ List<String> deleteDeltaFileList = new ArrayList<>();
String segmentPath = CarbonTablePath.getSegmentPath(
- identifier.getTablePath(), segmentId.getSegmentNo());
- CarbonFile segDir =
- FileFactory.getCarbonFile(segmentPath);
+ identifier.getTablePath(), seg.getSegmentNo());
+
for (SegmentUpdateDetails block : updateDetails) {
if ((block.getBlockName().equalsIgnoreCase(blockName)) &&
- (block.getSegmentName().equalsIgnoreCase(segmentId.getSegmentNo()))
- && !CarbonUpdateUtil.isBlockInvalid((block.getSegmentStatus()))) {
- final long deltaStartTimestamp =
- getStartTimeOfDeltaFile(CarbonCommonConstants.DELETE_DELTA_FILE_EXT, block);
- final long deltaEndTimeStamp =
- getEndTimeOfDeltaFile(CarbonCommonConstants.DELETE_DELTA_FILE_EXT, block);
-
- return segDir.listFiles(new CarbonFileFilter() {
-
- @Override
- public boolean accept(CarbonFile pathName) {
- String fileName = pathName.getName();
- if (pathName.getSize() > 0
- && fileName.endsWith(CarbonCommonConstants.DELETE_DELTA_FILE_EXT)) {
- String blkName = fileName.substring(0, fileName.lastIndexOf("-"));
- long timestamp =
- Long.parseLong(CarbonTablePath.DataFileUtil.getTimeStampFromFileName(fileName));
- return blockName.equals(blkName) && timestamp <= deltaEndTimeStamp
- && timestamp >= deltaStartTimestamp;
- }
- return false;
- }
- });
+ (block.getSegmentName().equalsIgnoreCase(seg.getSegmentNo())) &&
+ !CarbonUpdateUtil.isBlockInvalid(block.getSegmentStatus())) {
+ Set<String> deltaFileTimestamps = block.getDeltaFileStamps();
+ String deleteDeltaFilePrefix = segmentPath + CarbonCommonConstants.FILE_SEPARATOR +
+ blockName + CarbonCommonConstants.HYPHEN;
+ if (deltaFileTimestamps != null && deltaFileTimestamps.size() > 0) {
+ deltaFileTimestamps.forEach(timestamp -> deleteDeltaFileList.add(
+ deleteDeltaFilePrefix + timestamp + CarbonCommonConstants.DELETE_DELTA_FILE_EXT));
+ } else {
+ final long deltaEndTimeStamp =
Review comment:
yes, i mean to say, add a comment in code so that whoever works on it next, it will be easy to understand from comments. Always add comments wherever you feel important
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] marchpure commented on pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update
Posted by GitBox <gi...@apache.org>.
marchpure commented on pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#issuecomment-709674242
add some logs and comments.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update
Posted by GitBox <gi...@apache.org>.
CarbonDataQA1 commented on pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#issuecomment-711838443
Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4507/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] shenjiayu17 commented on a change in pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update
Posted by GitBox <gi...@apache.org>.
shenjiayu17 commented on a change in pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#discussion_r509945109
##########
File path: core/src/main/java/org/apache/carbondata/core/statusmanager/SegmentUpdateStatusManager.java
##########
@@ -415,44 +415,38 @@ public boolean accept(CarbonFile pathName) {
}
/**
- * Return all delta file for a block.
- * @param segmentId
- * @param blockName
- * @return
+ * Get all delete delta files of the block of specified segment.
+ * Actually, delete delta file name is generated from each SegmentUpdateDetails.
+ *
+ * @param seg the segment which is to find block and its delete delta files
+ * @param blockName the specified block of the segment
+ * @return delete delta file list of the block
*/
- public CarbonFile[] getDeleteDeltaFilesList(final Segment segmentId, final String blockName) {
+ public List<String> getDeleteDeltaFilesList(final Segment seg, final String blockName) {
+
+ List<String> deleteDeltaFileList = new ArrayList<>();
String segmentPath = CarbonTablePath.getSegmentPath(
- identifier.getTablePath(), segmentId.getSegmentNo());
- CarbonFile segDir =
- FileFactory.getCarbonFile(segmentPath);
+ identifier.getTablePath(), seg.getSegmentNo());
+
for (SegmentUpdateDetails block : updateDetails) {
if ((block.getBlockName().equalsIgnoreCase(blockName)) &&
- (block.getSegmentName().equalsIgnoreCase(segmentId.getSegmentNo()))
- && !CarbonUpdateUtil.isBlockInvalid((block.getSegmentStatus()))) {
+ (block.getSegmentName().equalsIgnoreCase(seg.getSegmentNo())) &&
+ !CarbonUpdateUtil.isBlockInvalid(block.getSegmentStatus())) {
final long deltaStartTimestamp =
getStartTimeOfDeltaFile(CarbonCommonConstants.DELETE_DELTA_FILE_EXT, block);
final long deltaEndTimeStamp =
getEndTimeOfDeltaFile(CarbonCommonConstants.DELETE_DELTA_FILE_EXT, block);
-
- return segDir.listFiles(new CarbonFileFilter() {
-
- @Override
- public boolean accept(CarbonFile pathName) {
- String fileName = pathName.getName();
- if (pathName.getSize() > 0
- && fileName.endsWith(CarbonCommonConstants.DELETE_DELTA_FILE_EXT)) {
- String blkName = fileName.substring(0, fileName.lastIndexOf("-"));
- long timestamp =
- Long.parseLong(CarbonTablePath.DataFileUtil.getTimeStampFromFileName(fileName));
- return blockName.equals(blkName) && timestamp <= deltaEndTimeStamp
- && timestamp >= deltaStartTimestamp;
- }
- return false;
- }
- });
+ Set<String> deleteDeltaFiles = new HashSet<>();
Review comment:
modified code like this approach
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] shenjiayu17 commented on a change in pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update
Posted by GitBox <gi...@apache.org>.
shenjiayu17 commented on a change in pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#discussion_r509941524
##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/HorizontalCompaction.scala
##########
@@ -125,12 +125,18 @@ object HorizontalCompaction {
segLists: util.List[Segment]): Unit = {
val db = carbonTable.getDatabaseName
val table = carbonTable.getTableName
+ val startTime = System.nanoTime()
+
// get the valid segments qualified for update compaction.
val validSegList = CarbonDataMergerUtil.getSegListIUDCompactionQualified(segLists,
absTableIdentifier,
segmentUpdateStatusManager,
compactionTypeIUD)
+ val endTime = System.nanoTime()
+ LOG.info(s"time taken to get segment list for Horizontal Update Compaction is" +
Review comment:
I have modified the log, which only prints the validSegList and is in LOG.debug
##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/HorizontalCompaction.scala
##########
@@ -173,11 +179,17 @@ object HorizontalCompaction {
val db = carbonTable.getDatabaseName
val table = carbonTable.getTableName
+ val startTime = System.nanoTime()
+
val deletedBlocksList = CarbonDataMergerUtil.getSegListIUDCompactionQualified(segLists,
absTableIdentifier,
segmentUpdateStatusManager,
compactionTypeIUD)
+ val endTime = System.nanoTime()
+ LOG.info(s"time taken to get deleted block list for Horizontal Delete Compaction is" +
Review comment:
I have modified the log, which only prints the validSegList and is in LOG.debug
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] QiangCai commented on pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update
Posted by GitBox <gi...@apache.org>.
QiangCai commented on pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#issuecomment-718729381
LGTM, greate job
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update
Posted by GitBox <gi...@apache.org>.
CarbonDataQA1 commented on pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#issuecomment-709707683
Build Failed with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/2720/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] shenjiayu17 commented on a change in pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update
Posted by GitBox <gi...@apache.org>.
shenjiayu17 commented on a change in pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#discussion_r513249194
##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/HorizontalCompaction.scala
##########
@@ -130,6 +130,7 @@ object HorizontalCompaction {
absTableIdentifier,
segmentUpdateStatusManager,
compactionTypeIUD)
+ LOG.debug(s"The segment list for Horizontal Update Compaction is $validSegList")
Review comment:
Done
##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/HorizontalCompaction.scala
##########
@@ -177,6 +178,7 @@ object HorizontalCompaction {
absTableIdentifier,
segmentUpdateStatusManager,
compactionTypeIUD)
+ LOG.debug(s"The segment list for Horizontal Update Compaction is $deletedBlocksList")
Review comment:
Done
##########
File path: core/src/main/java/org/apache/carbondata/core/statusmanager/SegmentUpdateStatusManager.java
##########
@@ -415,44 +415,41 @@ public boolean accept(CarbonFile pathName) {
}
/**
- * Return all delta file for a block.
- * @param segmentId
- * @param blockName
- * @return
+ * Get all delete delta files of the block of specified segment.
+ * Actually, delete delta file name is generated from each SegmentUpdateDetails.
+ *
+ * @param seg the segment which is to find block and its delete delta files
+ * @param blockName the specified block of the segment
+ * @return delete delta file list of the block
*/
- public CarbonFile[] getDeleteDeltaFilesList(final Segment segmentId, final String blockName) {
+ public List<String> getDeleteDeltaFilesList(final Segment seg, final String blockName) {
Review comment:
Done
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] marchpure commented on pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update
Posted by GitBox <gi...@apache.org>.
marchpure commented on pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#issuecomment-710780197
retest this please
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] marchpure commented on pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update
Posted by GitBox <gi...@apache.org>.
marchpure commented on pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#issuecomment-711459338
retest this please
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] akashrn5 commented on a change in pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update
Posted by GitBox <gi...@apache.org>.
akashrn5 commented on a change in pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#discussion_r510123816
##########
File path: processing/src/main/java/org/apache/carbondata/processing/merger/CarbonDataMergerUtil.java
##########
@@ -1138,73 +1126,36 @@ private static Boolean checkUpdateDeltaFilesInSeg(Segment seg,
}
/**
- * Check is the segment passed qualifies for IUD delete delta compaction or not i.e.
- * if the number of delete delta files present in the segment is more than
- * numberDeltaFilesThreshold.
+ * Check whether the segment passed qualifies for IUD delete delta compaction or not,
+ * i.e., if the number of delete delta files present in the segment is more than
+ * numberDeltaFilesThreshold, this segment will be selected.
*
- * @param seg
- * @param segmentUpdateStatusManager
- * @param numberDeltaFilesThreshold
- * @return
+ * @param seg segment to be qualified
+ * @param segmentUpdateStatusManager segments & blocks details management
+ * @param numberDeltaFilesThreshold threshold of delete delta files
+ * @return block list of the segment
*/
- private static boolean checkDeleteDeltaFilesInSeg(Segment seg,
+ private static List<String> checkDeleteDeltaFilesInSeg(Segment seg,
SegmentUpdateStatusManager segmentUpdateStatusManager, int numberDeltaFilesThreshold) {
+ List<String> blockLists = new ArrayList<>();
Set<String> uniqueBlocks = new HashSet<String>();
List<String> blockNameList =
segmentUpdateStatusManager.getBlockNameFromSegment(seg.getSegmentNo());
-
- for (final String blockName : blockNameList) {
-
- CarbonFile[] deleteDeltaFiles =
+ for (String blockName : blockNameList) {
+ List<String> deleteDeltaFiles =
segmentUpdateStatusManager.getDeleteDeltaFilesList(seg, blockName);
- if (null != deleteDeltaFiles) {
- // The Delete Delta files may have Spill over blocks. Will consider multiple spill over
- // blocks as one. Currently DeleteDeltaFiles array contains Delete Delta Block name which
- // lies within Delete Delta Start TimeStamp and End TimeStamp. In order to eliminate
- // Spill Over Blocks will choose files with unique taskID.
- for (CarbonFile blocks : deleteDeltaFiles) {
- // Get Task ID and the Timestamp from the Block name for e.g.
- // part-0-3-1481084721319.carbondata => "3-1481084721319"
- String task = CarbonTablePath.DataFileUtil.getTaskNo(blocks.getName());
- String timestamp =
- CarbonTablePath.DataFileUtil.getTimeStampFromDeleteDeltaFile(blocks.getName());
- String taskAndTimeStamp = task + "-" + timestamp;
+ if (null != deleteDeltaFiles && deleteDeltaFiles.size() > numberDeltaFilesThreshold) {
+ for (String file : deleteDeltaFiles) {
Review comment:
i gave comment to change the variable name and remove spaces after and before :, please check
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update
Posted by GitBox <gi...@apache.org>.
CarbonDataQA1 commented on pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#issuecomment-711479603
Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4501/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] shenjiayu17 commented on a change in pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update
Posted by GitBox <gi...@apache.org>.
shenjiayu17 commented on a change in pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#discussion_r506020436
##########
File path: processing/src/main/java/org/apache/carbondata/processing/merger/CarbonDataMergerUtil.java
##########
@@ -1039,22 +1039,10 @@ private static boolean isSegmentValid(LoadMetadataDetails seg) {
if (CompactionType.IUD_DELETE_DELTA == compactionTypeIUD) {
int numberDeleteDeltaFilesThreshold =
CarbonProperties.getInstance().getNoDeleteDeltaFilesThresholdForIUDCompaction();
- List<Segment> deleteSegments = new ArrayList<>();
for (Segment seg : segments) {
- if (checkDeleteDeltaFilesInSeg(seg, segmentUpdateStatusManager,
Review comment:
Done.
Combined function checkDeleteDeltaFilesInSeg and function getDeleteDeltaFilesInSeg to new function checkAndGetDeleteDeltaFilesInSeg
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] shenjiayu17 commented on a change in pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update
Posted by GitBox <gi...@apache.org>.
shenjiayu17 commented on a change in pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#discussion_r510063577
##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/HorizontalCompaction.scala
##########
@@ -130,6 +130,7 @@ object HorizontalCompaction {
absTableIdentifier,
segmentUpdateStatusManager,
compactionTypeIUD)
+ LOG.debug(s"The segment list for Horizontal Update Compaction is ${ validSegList }")
Review comment:
Done
##########
File path: integration/spark/src/main/scala/org/apache/spark/sql/execution/command/mutation/HorizontalCompaction.scala
##########
@@ -177,6 +178,7 @@ object HorizontalCompaction {
absTableIdentifier,
segmentUpdateStatusManager,
compactionTypeIUD)
+ LOG.debug(s"The segment list for Horizontal Update Compaction is ${ deletedBlocksList }")
Review comment:
Done
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] shenjiayu17 commented on a change in pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update
Posted by GitBox <gi...@apache.org>.
shenjiayu17 commented on a change in pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#discussion_r506020436
##########
File path: processing/src/main/java/org/apache/carbondata/processing/merger/CarbonDataMergerUtil.java
##########
@@ -1039,22 +1039,10 @@ private static boolean isSegmentValid(LoadMetadataDetails seg) {
if (CompactionType.IUD_DELETE_DELTA == compactionTypeIUD) {
int numberDeleteDeltaFilesThreshold =
CarbonProperties.getInstance().getNoDeleteDeltaFilesThresholdForIUDCompaction();
- List<Segment> deleteSegments = new ArrayList<>();
for (Segment seg : segments) {
- if (checkDeleteDeltaFilesInSeg(seg, segmentUpdateStatusManager,
Review comment:
Done
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] Zhangshunyu commented on pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update
Posted by GitBox <gi...@apache.org>.
Zhangshunyu commented on pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#issuecomment-717025630
> We tested with 10 segments total 10G and update more than 20 times to see the cost, update row count of each time is 176973.
> The result and comparison before and after improving is shown below, we can see that time-consuming of UPDATE reduces by about 30% after improving horizontal compaction.
>
> Time Before Improving Time after Improving
> Average time
> from 2nd to 21st UPDATE **99.19** **67.55**
> UPDATE
> 1st 38.397 71.91
> 2nd 74.918 67.386
> 3rd 43.851 46.171
> 4th 52.133 82.974
> 5th 86.021 44.499
> 6th 62.992 39.973
> 7th 99.077 69.292
> 8th 77.815 60.416
> 9th 74.949 50.187
> 10th 85.874 50.973
> 11th 103.902 58.005
> 12th 77.534 86.087
> 13th 79.154 76.283
> 14th 116.117 72.029
> 15th 112.846 74.182
> 16th 124.707 69.282
> 17th 124.677 67.546
> 18th 147.15 66.652
> 19th 135.127 109.385
> 20th 133.94 80.849
> 21st 171.112 78.92
greate!
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update
Posted by GitBox <gi...@apache.org>.
CarbonDataQA1 commented on pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#issuecomment-715197754
Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/2905/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update
Posted by GitBox <gi...@apache.org>.
CarbonDataQA1 commented on pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#issuecomment-710849261
Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4495/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [carbondata] shenjiayu17 commented on a change in pull request #3986: [CARBONDATA-4034] Improve the time-consuming of Horizontal Compaction for update
Posted by GitBox <gi...@apache.org>.
shenjiayu17 commented on a change in pull request #3986:
URL: https://github.com/apache/carbondata/pull/3986#discussion_r513248715
##########
File path: core/src/main/java/org/apache/carbondata/core/statusmanager/SegmentUpdateStatusManager.java
##########
@@ -415,44 +415,41 @@ public boolean accept(CarbonFile pathName) {
}
/**
- * Return all delta file for a block.
- * @param segmentId
- * @param blockName
- * @return
+ * Get all delete delta files of the block of specified segment.
+ * Actually, delete delta file name is generated from each SegmentUpdateDetails.
+ *
+ * @param seg the segment which is to find block and its delete delta files
+ * @param blockName the specified block of the segment
+ * @return delete delta file list of the block
*/
- public CarbonFile[] getDeleteDeltaFilesList(final Segment segmentId, final String blockName) {
+ public List<String> getDeleteDeltaFilesList(final Segment seg, final String blockName) {
+ List<String> deleteDeltaFileList = new ArrayList<>();
String segmentPath = CarbonTablePath.getSegmentPath(
- identifier.getTablePath(), segmentId.getSegmentNo());
- CarbonFile segDir =
- FileFactory.getCarbonFile(segmentPath);
+ identifier.getTablePath(), seg.getSegmentNo());
+
for (SegmentUpdateDetails block : updateDetails) {
if ((block.getBlockName().equalsIgnoreCase(blockName)) &&
- (block.getSegmentName().equalsIgnoreCase(segmentId.getSegmentNo()))
- && !CarbonUpdateUtil.isBlockInvalid((block.getSegmentStatus()))) {
- final long deltaStartTimestamp =
- getStartTimeOfDeltaFile(CarbonCommonConstants.DELETE_DELTA_FILE_EXT, block);
- final long deltaEndTimeStamp =
- getEndTimeOfDeltaFile(CarbonCommonConstants.DELETE_DELTA_FILE_EXT, block);
-
- return segDir.listFiles(new CarbonFileFilter() {
-
- @Override
- public boolean accept(CarbonFile pathName) {
- String fileName = pathName.getName();
- if (pathName.getSize() > 0
- && fileName.endsWith(CarbonCommonConstants.DELETE_DELTA_FILE_EXT)) {
- String blkName = fileName.substring(0, fileName.lastIndexOf("-"));
- long timestamp =
- Long.parseLong(CarbonTablePath.DataFileUtil.getTimeStampFromFileName(fileName));
- return blockName.equals(blkName) && timestamp <= deltaEndTimeStamp
- && timestamp >= deltaStartTimestamp;
- }
- return false;
- }
- });
+ (block.getSegmentName().equalsIgnoreCase(seg.getSegmentNo())) &&
+ !CarbonUpdateUtil.isBlockInvalid(block.getSegmentStatus())) {
+ Set<String> deltaFileTimestamps = block.getDeltaFileStamps();
+ String deleteDeltaFilePrefix = segmentPath + CarbonCommonConstants.FILE_SEPARATOR +
+ blockName + CarbonCommonConstants.HYPHEN;
+ if (deltaFileTimestamps != null && deltaFileTimestamps.size() > 0) {
+ deltaFileTimestamps.forEach(timestamp -> deleteDeltaFileList.add(
+ deleteDeltaFilePrefix + timestamp + CarbonCommonConstants.DELETE_DELTA_FILE_EXT));
Review comment:
Done
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org