You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@inlong.apache.org by GitBox <gi...@apache.org> on 2022/07/21 12:55:18 UTC

[GitHub] [inlong] ifndef-SleePy opened a new pull request, #5170: [INLONG-5169][Sort] Fix race condition issue of HBaseSinkFunction metric collection

ifndef-SleePy opened a new pull request, #5170:
URL: https://github.com/apache/inlong/pull/5170

   …ric collection
   
   ### Prepare a Pull Request
   
   - Fixes #5169 
   
   ### Motivation
   
   * Fix race condition issue of HBaseSinkFunction metric collection
   
   ### Modifications
   
   *Describe the modifications you've done.*
   
   ### Verifying this change
   
   *(Please pick either of the following options)*
   
   - [x] This change is a trivial rework/code cleanup without any test coverage.
   
   ### Documentation
   
     - Does this pull request introduce a new feature? (no)
     - If yes, how is the feature documented? (not applicable)
     - If a feature is not applicable for documentation, explain why?
     - If a feature is not documented yet in this PR, please create a follow-up issue for adding the documentation
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@inlong.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [inlong] ifndef-SleePy commented on pull request #5170: [INLONG-5169][Sort] Fix race condition issue of HBaseSinkFunction metric collection

Posted by GitBox <gi...@apache.org>.
ifndef-SleePy commented on PR #5170:
URL: https://github.com/apache/inlong/pull/5170#issuecomment-1195369139

   > > Hi guys, thanks for reviewing. First, I'm sorry that I didn't explain my points behind the PR.
   > > As you can see, all the comments are focused on the scenario of flushing failure. I think we don't need care about the metrics after flushing failure. Because
   > > 
   > > 1. If flushing fails, the job would falls into failover scenario.
   > > 2. Metric accuracy can not be guaranteed when failover happens. The consistency mode of `HBaseSinkFunction` is at-least-once. So some of produced records would probably be re-produced again. However the metric collection would not be reverted. All of these records would be counted.
   > > 3. Moreover, nobody knows how many records have been written into HBase Region Server when flushing fails. Flushing of `BufferedMutator` is not a transactional operation. If I remember correctly, HBase client spawns RPC calls to several Region Server concurrently. Any failure RPC call leads to a global failure. So when failure happens, we don't know how many RPC succeeded. It might be none, some or all records have been flushed successfully.
   > > 
   > > My points here are:
   > > 
   > > 1. We have to guarantee the metric accuracy if no failure happens. That's what we do in this PR.
   > > 2. Metric is not accurate if flushing fails, no matter what we do. So we don't need to involve any synchronization of these counters. Just make it as cheaper as possible.
   > > 3. A counter records the number of flushing failure times is meaningful. But I'm a bit confused of the counter name "dirty", so I removed the these ambiguous counters. Maybe we could introduce another counter to record the failure times.
   > > 
   > > What do you guys think of this? @EMsnap @gong @yunqingmoswu
   > 
   > As far as I know, the underlying logic of refresh should generate hfile first and then load, then refresh will only have this batch of data either all visible or all lost. If I understand correctly, then after refreshing the statistics, the data obtained will be more accurate, not only for dirty data, but also for normal sync data.
   
   1. The `HFile` way you described is called "bulk loading" [1]. That's not the way worked here. `BufferedMutator` would not trigger bulk loading, there is another API for bulk loading. It would go through WAL and `MemStore` to be ingested into Region Server.
   2. Bulk loading also could not guarantee "this batch of data either all visible or all lost", because there might be several `HFile` involved in the batch operation.
   3. HBase could guarantee consistency in a row, but not cross rows, please check the ACID docs [2], "APIs that mutate several rows will _not_ be atomic across the multiple rows.".
   
   [1] https://hbase.apache.org/book.html#arch.bulk.load
   [2] https://hbase.apache.org/acid-semantics.html


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@inlong.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [inlong] EMsnap merged pull request #5170: [INLONG-5169][Sort] Fix race condition issue of HBaseSinkFunction metric collection

Posted by GitBox <gi...@apache.org>.
EMsnap merged PR #5170:
URL: https://github.com/apache/inlong/pull/5170


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@inlong.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [inlong] gong commented on a diff in pull request #5170: [INLONG-5169][Sort] Fix race condition issue of HBaseSinkFunction metric collection

Posted by GitBox <gi...@apache.org>.
gong commented on code in PR #5170:
URL: https://github.com/apache/inlong/pull/5170#discussion_r927265471


##########
inlong-sort/sort-connectors/hbase/src/main/java/org/apache/inlong/sort/hbase/sink/HBaseSinkFunction.java:
##########
@@ -233,39 +213,20 @@ public void invoke(T value, Context context) throws Exception {
         checkErrorAndRethrow();
 
         mutator.mutate(mutationConverter.convertToMutation(value));
-        rowSize++;
-        dataSize = dataSize + value.toString().getBytes(StandardCharsets.UTF_8).length;
+        if (metricData.getNumRecordsOut() != null) {
+            metricData.getNumRecordsOut().inc(1);
+        }
+        if (metricData.getNumRecordsOut() != null) {
+            metricData.getNumBytesOut()
+                .inc(value.toString().getBytes(StandardCharsets.UTF_8).length);
+        }
         // flush when the buffer number of mutations greater than the configured max size.
         if (bufferFlushMaxMutations > 0
                 && numPendingRequests.incrementAndGet() >= bufferFlushMaxMutations) {
-            try {
-                flush();
-                if (metricData.getNumRecordsOut() != null) {
-                    metricData.getNumRecordsOut().inc(rowSize);
-                }
-                if (metricData.getNumRecordsOut() != null) {
-                    metricData.getNumBytesOut()
-                            .inc(dataSize);
-                }
-                resetStateAfterFlush();
-            } catch (Exception e) {
-                if (metricData.getDirtyRecords() != null) {
-                    metricData.getDirtyRecords().inc(rowSize);
-                }
-                if (metricData.getDirtyBytes() != null) {
-                    metricData.getDirtyBytes().inc(dataSize);
-                }

Review Comment:
   lost write Fail records



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@inlong.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [inlong] yunqingmoswu commented on pull request #5170: [INLONG-5169][Sort] Fix race condition issue of HBaseSinkFunction metric collection

Posted by GitBox <gi...@apache.org>.
yunqingmoswu commented on PR #5170:
URL: https://github.com/apache/inlong/pull/5170#issuecomment-1193031171

   > @yunqingmoswu @thexiay PTAL, thanks.
   
   done


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@inlong.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [inlong] yunqingmoswu commented on pull request #5170: [INLONG-5169][Sort] Fix race condition issue of HBaseSinkFunction metric collection

Posted by GitBox <gi...@apache.org>.
yunqingmoswu commented on PR #5170:
URL: https://github.com/apache/inlong/pull/5170#issuecomment-1195009736

   > Hi guys, thanks for reviewing. First, I'm sorry that I didn't explain my points behind the PR.
   > 
   > As you can see, all the comments are focused on the scenario of flushing failure. I think we don't need care about the metrics after flushing failure. Because
   > 
   > 1. If flushing fails, the job would falls into failover scenario.
   > 2. Metric accuracy can not be guaranteed when failover happens. The consistency mode of `HBaseSinkFunction` is at-least-once. So some of produced records would probably be re-produced again. However the metric collection would not be reverted. All of these records would be counted.
   > 3. Moreover, nobody knows how many records have been written into HBase Region Server when flushing fails. Flushing of `BufferedMutator` is not a transactional operation. If I remember correctly, HBase client spawns RPC calls to several Region Server concurrently. Any failure RPC call leads to a global failure. So when failure happens, we don't know how many RPC succeeded. It might be none, some or all records have been flushed successfully.
   > 
   > My points here are:
   > 
   > 1. We have to guarantee the metric accuracy if no failure happens. That's what we do in this PR.
   > 2. Metric is not accurate if flushing fails, no matter what we do. So we don't need to involve any synchronization of these counters. Just make it as cheaper as possible.
   > 3. A counter records the number of flushing failure times is meaningful. But I'm a bit confused of the counter name "dirty", so I removed the these ambiguous counters. Maybe we could introduce another counter to record the failure times.
   > 
   > What do you guys think of this? @EMsnap @gong @yunqingmoswu
   
   As far as I know, the underlying logic of refresh should generate hfile first and then load, then refresh will only have this batch of data either all visible or all lost. If I understand correctly, then after refreshing the statistics, the data obtained will be more accurate, not only for dirty data, but also for normal sync data.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@inlong.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [inlong] yunqingmoswu commented on pull request #5170: [INLONG-5169][Sort] Fix race condition issue of HBaseSinkFunction metric collection

Posted by GitBox <gi...@apache.org>.
yunqingmoswu commented on PR #5170:
URL: https://github.com/apache/inlong/pull/5170#issuecomment-1193031071

   > @yunqingmoswu @thexiay PTAL, thanks.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@inlong.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [inlong] ifndef-SleePy commented on pull request #5170: [INLONG-5169][Sort] Fix race condition issue of HBaseSinkFunction metric collection

Posted by GitBox <gi...@apache.org>.
ifndef-SleePy commented on PR #5170:
URL: https://github.com/apache/inlong/pull/5170#issuecomment-1194082432

   Hi guys, thanks for reviewing. First, I'm sorry that I didn't explain my points behind the PR. 
   
   As you can see, all the comments are focused on the scenario of flushing failure. I think we don't need care about the metrics after flushing failure. Because
   1. If flushing fails, the job would falls into failover scenario.
   2. Metric accuracy can not be guaranteed when failover happens. The consistency mode of `HBaseSinkFunction` is at-least-once. So some of produced records would probably be re-produced again. However the metric collection would not be reverted. All of these records would be counted.
   3. Moreover, nobody knows how many records have been written into HBase Region Server when flushing fails. Flushing of `BufferedMutator` is not a transactional operation. If I remember correctly, HBase client spawns RPC calls to several Region Server concurrently. Any failure RPC call leads to a global failure. So when failure happens, we don't know how many RPC succeeded. It might be none, some or all records have been flushed successfully.
   
   My points here are:
   1. We have to guarantee the metric accuracy if no failure happens. That's what we do in this PR.
   2. Metric is not accurate if flushing fails, no matter what we do. So we don't need to involve any synchronization of these counters. Just make it as cheaper as possible.
   3. A counter records the number of flushing failure times is meaningful. But I'm a bit confused of the counter name "dirty", so I removed the these ambiguous counters. Maybe we could introduce another counter to record the failure times.
   
   What do you guys think of this? @EMsnap @gong @yunqingmoswu 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@inlong.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [inlong] yunqingmoswu commented on pull request #5170: [INLONG-5169][Sort] Fix race condition issue of HBaseSinkFunction metric collection

Posted by GitBox <gi...@apache.org>.
yunqingmoswu commented on PR #5170:
URL: https://github.com/apache/inlong/pull/5170#issuecomment-1195399228

   > > > Hi guys, thanks for reviewing. First, I'm sorry that I didn't explain my points behind the PR.
   > > > As you can see, all the comments are focused on the scenario of flushing failure. I think we don't need care about the metrics after flushing failure. Because
   > > > 
   > > > 1. If flushing fails, the job would falls into failover scenario.
   > > > 2. Metric accuracy can not be guaranteed when failover happens. The consistency mode of `HBaseSinkFunction` is at-least-once. So some of produced records would probably be re-produced again. However the metric collection would not be reverted. All of these records would be counted.
   > > > 3. Moreover, nobody knows how many records have been written into HBase Region Server when flushing fails. Flushing of `BufferedMutator` is not a transactional operation. If I remember correctly, HBase client spawns RPC calls to several Region Server concurrently. Any failure RPC call leads to a global failure. So when failure happens, we don't know how many RPC succeeded. It might be none, some or all records have been flushed successfully.
   > > > 
   > > > My points here are:
   > > > 
   > > > 1. We have to guarantee the metric accuracy if no failure happens. That's what we do in this PR.
   > > > 2. Metric is not accurate if flushing fails, no matter what we do. So we don't need to involve any synchronization of these counters. Just make it as cheaper as possible.
   > > > 3. A counter records the number of flushing failure times is meaningful. But I'm a bit confused of the counter name "dirty", so I removed the these ambiguous counters. Maybe we could introduce another counter to record the failure times.
   > > > 
   > > > What do you guys think of this? @EMsnap @gong @yunqingmoswu
   > > 
   > > 
   > > As far as I know, the underlying logic of refresh should generate hfile first and then load, then refresh will only have this batch of data either all visible or all lost. If I understand correctly, then after refreshing the statistics, the data obtained will be more accurate, not only for dirty data, but also for normal sync data.
   > 
   > 1. The `HFile` way you described is called "bulk loading" [1]. That's not the way worked here. `BufferedMutator` would not trigger bulk loading, there is another API for bulk loading. It would go through WAL and `MemStore` to be ingested into Region Server.
   > 2. Bulk loading also could not guarantee "this batch of data either all visible or all lost", because there might be several `HFile` involved in the batch operation.
   > 3. HBase could guarantee consistency in a row, but not cross rows, please check the ACID docs [2], "APIs that mutate several rows will _not_ be atomic across the multiple rows.".
   > 
   > [1] https://hbase.apache.org/book.html#arch.bulk.load [2] https://hbase.apache.org/acid-semantics.html
   
   Thanks for clarifying.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@inlong.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [inlong] dockerzhang commented on pull request #5170: [INLONG-5169][Sort] Fix race condition issue of HBaseSinkFunction metric collection

Posted by GitBox <gi...@apache.org>.
dockerzhang commented on PR #5170:
URL: https://github.com/apache/inlong/pull/5170#issuecomment-1192140370

   @yunqingmoswu @thexiay PTAL, thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@inlong.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [inlong] gong commented on pull request #5170: [INLONG-5169][Sort] Fix race condition issue of HBaseSinkFunction metric collection

Posted by GitBox <gi...@apache.org>.
gong commented on PR #5170:
URL: https://github.com/apache/inlong/pull/5170#issuecomment-1199164005

   please git pull uptream master , git rebase , resolve conflic@ifndef-Sleepy


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@inlong.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [inlong] gong commented on pull request #5170: [INLONG-5169][Sort] Fix race condition issue of HBaseSinkFunction metric collection

Posted by GitBox <gi...@apache.org>.
gong commented on PR #5170:
URL: https://github.com/apache/inlong/pull/5170#issuecomment-1192172659

   Maybe we can define a `ConcurrentCounter`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@inlong.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [inlong] ifndef-SleePy commented on pull request #5170: [INLONG-5169][Sort] Fix race condition issue of HBaseSinkFunction metric collection

Posted by GitBox <gi...@apache.org>.
ifndef-SleePy commented on PR #5170:
URL: https://github.com/apache/inlong/pull/5170#issuecomment-1198142926

   After an offline communication with @yunqingmoswu , we reached an agreement that just keep the origin implementation with thread-safe counter. Because it's bases on production requirement, not technical.
   1. We have to keep the meaningless "dirty" prefix counter till someday the production manager realizes that it's meaningless.
   2. Summing the counter after flushing is better than before flushing although neither of them could guarantee accuracy. Because we don't want to meet the scenario that user asks a question why there exists successful counter in metric system however there is no data in HBase.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@inlong.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [inlong] gong commented on pull request #5170: [INLONG-5169][Sort] Fix race condition issue of HBaseSinkFunction metric collection

Posted by GitBox <gi...@apache.org>.
gong commented on PR #5170:
URL: https://github.com/apache/inlong/pull/5170#issuecomment-1195012299

   > 
   
   @ifndef-SleePy hello, here define of dirty data is write fail, and we need data records and data size.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@inlong.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [inlong] EMsnap commented on a diff in pull request #5170: [INLONG-5169][Sort] Fix race condition issue of HBaseSinkFunction metric collection

Posted by GitBox <gi...@apache.org>.
EMsnap commented on code in PR #5170:
URL: https://github.com/apache/inlong/pull/5170#discussion_r927260497


##########
inlong-sort/sort-connectors/hbase/src/main/java/org/apache/inlong/sort/hbase/sink/HBaseSinkFunction.java:
##########
@@ -233,39 +213,20 @@ public void invoke(T value, Context context) throws Exception {
         checkErrorAndRethrow();
 
         mutator.mutate(mutationConverter.convertToMutation(value));
-        rowSize++;
-        dataSize = dataSize + value.toString().getBytes(StandardCharsets.UTF_8).length;
+        if (metricData.getNumRecordsOut() != null) {
+            metricData.getNumRecordsOut().inc(1);

Review Comment:
   metrics should be updated after flush 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@inlong.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [inlong] yunqingmoswu commented on a diff in pull request #5170: [INLONG-5169][Sort] Fix race condition issue of HBaseSinkFunction metric collection

Posted by GitBox <gi...@apache.org>.
yunqingmoswu commented on code in PR #5170:
URL: https://github.com/apache/inlong/pull/5170#discussion_r928060651


##########
inlong-sort/sort-connectors/hbase/src/main/java/org/apache/inlong/sort/hbase/sink/HBaseSinkFunction.java:
##########
@@ -80,13 +79,10 @@
     private transient ScheduledExecutorService executor;
     private transient ScheduledFuture scheduledFuture;
     private transient AtomicLong numPendingRequests;
-    private transient RuntimeContext runtimeContext;
 
     private transient volatile boolean closed = false;
 
     private MetricData metricData;
-    private Long dataSize = 0L;
-    private Long rowSize = 0L;

Review Comment:
   The original logic should be fine, just turn it into an atomic variable.



##########
inlong-sort/sort-connectors/hbase/src/main/java/org/apache/inlong/sort/hbase/sink/HBaseSinkFunction.java:
##########
@@ -80,13 +79,10 @@
     private transient ScheduledExecutorService executor;
     private transient ScheduledFuture scheduledFuture;
     private transient AtomicLong numPendingRequests;
-    private transient RuntimeContext runtimeContext;
 
     private transient volatile boolean closed = false;
 
     private MetricData metricData;
-    private Long dataSize = 0L;

Review Comment:
   The original logic should be fine, just turn it into an atomic variable.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@inlong.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [inlong] ifndef-SleePy commented on pull request #5170: [INLONG-5169][Sort] Fix race condition issue of HBaseSinkFunction metric collection

Posted by GitBox <gi...@apache.org>.
ifndef-SleePy commented on PR #5170:
URL: https://github.com/apache/inlong/pull/5170#issuecomment-1203896269

   I've re-created the PR after resolving conflicts.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@inlong.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org