You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@dolphinscheduler.apache.org by GitBox <gi...@apache.org> on 2022/12/27 03:18:04 UTC

[GitHub] [dolphinscheduler] duspring opened a new issue, #13282: [Improvement][dolphinscheduler-data-quality] Data quality optimization of large amount of data

duspring opened a new issue, #13282:
URL: https://github.com/apache/dolphinscheduler/issues/13282

   ### Search before asking
   
   - [X] I had searched in the [issues](https://github.com/apache/dolphinscheduler/issues?q=is%3Aissue) and found no similar feature requirement.
   
   
   ### Description
   
   When the data volume reaches more than 30 million, the task execution is slow and the following error will be reported:
   ![image](https://user-images.githubusercontent.com/35389283/209605342-45761177-8eb6-4880-b1a1-8fe2b984c6bb.png)
   After checking the data quality execution process, it is found that the code does not filter the data in the read phase, but loads all the data into memory, which will result in the inability to run even adjust the Spark tuning parameters when the data volume is huge (billions or billions)
   ![image](https://user-images.githubusercontent.com/35389283/209605634-2e39ce49-b597-4db3-b063-aa67cea50a52.png)
   
   
   ### Are you willing to submit a PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [dolphinscheduler] github-actions[bot] commented on issue #13282: [Improvement][dolphinscheduler-data-quality] Data quality optimization of large amount of data

Posted by github-actions.
github-actions[bot] commented on issue #13282:
URL: https://github.com/apache/dolphinscheduler/issues/13282#issuecomment-1407214053

   This issue has been automatically marked as stale because it has not had recent activity for 30 days. It will be closed in next 7 days if no further activity occurs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [dolphinscheduler] github-actions[bot] commented on issue #13282: [Improvement][dolphinscheduler-data-quality] Data quality optimization of large amount of data

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on issue #13282:
URL: https://github.com/apache/dolphinscheduler/issues/13282#issuecomment-1416552514

   This issue has been closed because it has not received response for too long time. You could reopen it if you encountered similar problems in the future.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [dolphinscheduler] github-actions[bot] commented on issue #13282: [Improvement][dolphinscheduler-data-quality] Data quality optimization of large amount of data

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on issue #13282:
URL: https://github.com/apache/dolphinscheduler/issues/13282#issuecomment-1365580238

   Thank you for your feedback, we have received your issue, Please wait patiently for a reply.
   * In order for us to understand your request as soon as possible, please provide detailed information、version or pictures.
   * If you haven't received a reply for a long time, you can [join our slack](https://s.apache.org/dolphinscheduler-slack) and send your question to channel `#troubleshooting`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [dolphinscheduler] duspring commented on issue #13282: [Improvement][dolphinscheduler-data-quality] Data quality optimization of large amount of data

Posted by GitBox <gi...@apache.org>.
duspring commented on issue #13282:
URL: https://github.com/apache/dolphinscheduler/issues/13282#issuecomment-1365685905

   How soon can the data quality module add fetchSize and SQL filtering conditions to limit the amount of data on the reader side? It was originally thought that the filtering was done at the source, but now it is hard to use it in production


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [dolphinscheduler] Tianqi-Dotes commented on issue #13282: [Improvement][dolphinscheduler-data-quality] Data quality optimization of large amount of data

Posted by GitBox <gi...@apache.org>.
Tianqi-Dotes commented on issue #13282:
URL: https://github.com/apache/dolphinscheduler/issues/13282#issuecomment-1365634502

   JDBC need to add options: fetchSize and SQL filter


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [dolphinscheduler] github-actions[bot] closed issue #13282: [Improvement][dolphinscheduler-data-quality] Data quality optimization of large amount of data

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] closed issue #13282: [Improvement][dolphinscheduler-data-quality] Data quality optimization of large amount of data
URL: https://github.com/apache/dolphinscheduler/issues/13282


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org