You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@doris.apache.org by GitBox <gi...@apache.org> on 2022/05/07 07:22:03 UTC
[GitHub] [incubator-doris] spaces-X opened a new pull request, #9436: [Spark Load]fix min_value of GlobalDict will be negative number in spark load
spaces-X opened a new pull request, #9436:
URL: https://github.com/apache/incubator-doris/pull/9436
# Proposed changes
The `row_number()` function in **spark** returns **an integer type value**.
It will cause two problems in Spark Load.
**case 1: loading a large amount of data at one time causes `row_number()` overflow.**
When the cardinality of the columns to be encoded in the **data imported at one time** is more than **2.1 billion**, `row_number` will return a negative number.
**case 2: loading data by many times causes the maximum dict_value in the global dictionary to exceed Integer, but we do not cast it to bigint.**
---
For case 1, I think it's a design flaw that causes a bottleneck on one-time loading and case 1 has relatively few scenarios, which can be solved by importing in multiple batches in the short term.
For case 2, it will be solved by this pr.
## Problem Summary:
Describe the overview of changes.
## Checklist(Required)
1. Does it affect the original behavior: (Yes/No/I Don't know)
2. Has unit tests been added: (Yes/No/No Need)
3. Has document been added or modified: (Yes/No/No Need)
4. Does it need to update dependencies: (Yes/No)
5. Are there any changes that cannot be rolled back: (Yes/No)
## Further comments
If this is a relatively large or complex change, kick off the discussion at [dev@doris.apache.org](mailto:dev@doris.apache.org) by explaining why you chose the solution you did and what alternatives you considered, etc...
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org
[GitHub] [incubator-doris] spaces-X commented on pull request #9436: [Spark Load]fix min_value of GlobalDict will be negative number in spark load
Posted by GitBox <gi...@apache.org>.
spaces-X commented on PR #9436:
URL: https://github.com/apache/incubator-doris/pull/9436#issuecomment-1120156128
@wangshuo128 Is there any window function like **row_num** return-type is **Long** in spark?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org
[GitHub] [incubator-doris] github-actions[bot] commented on pull request #9436: [Spark Load]fix min_value of GlobalDict will be negative number in spark load
Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #9436:
URL: https://github.com/apache/incubator-doris/pull/9436#issuecomment-1129559003
PR approved by anyone and no changes requested.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org
[GitHub] [incubator-doris] github-actions[bot] commented on pull request #9436: [Spark Load]fix min_value of GlobalDict will be negative number in spark load
Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #9436:
URL: https://github.com/apache/incubator-doris/pull/9436#issuecomment-1129558992
PR approved by at least one committer and no changes requested.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org
[GitHub] [incubator-doris] yiguolei merged pull request #9436: [Spark Load]fix min_value of GlobalDict will be negative number in spark load
Posted by GitBox <gi...@apache.org>.
yiguolei merged PR #9436:
URL: https://github.com/apache/incubator-doris/pull/9436
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org
[GitHub] [incubator-doris] wangshuo128 commented on pull request #9436: [Spark Load]fix min_value of GlobalDict will be negative number in spark load
Posted by GitBox <gi...@apache.org>.
wangshuo128 commented on PR #9436:
URL: https://github.com/apache/incubator-doris/pull/9436#issuecomment-1120162257
> @wangshuo128 Is there any window function like **row_num** return-type is **Long** in spark?
The data type of Spark builtin function `row_number` is `Integer`, please see
https://github.com/apache/spark/blob/2349f74866ae1b365b5e4e0ec8a58c4f7f06885c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L564
If you want to have a long type `row_number` function, you could implement a UDAF yourself.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org