You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@doris.apache.org by GitBox <gi...@apache.org> on 2022/05/07 07:22:03 UTC

[GitHub] [incubator-doris] spaces-X opened a new pull request, #9436: [Spark Load]fix min_value of GlobalDict will be negative number in spark load

spaces-X opened a new pull request, #9436:
URL: https://github.com/apache/incubator-doris/pull/9436

   
   # Proposed changes
   
   The `row_number()` function in **spark** returns **an integer type value**.
   
   
   
   It will cause two problems in Spark Load.
   
   **case 1:  loading a large amount of data at one time causes `row_number()` overflow.** 
   
   When the cardinality of the columns to be encoded in the **data imported at one time** is more than **2.1 billion**, `row_number` will return a negative number.
   
   
   
   **case 2:  loading data by many times causes the maximum dict_value  in the global dictionary to exceed Integer, but we do not cast it to bigint.**
   
   ---
   
   For case 1, I think it's a design flaw that causes a bottleneck on one-time loading and case 1 has relatively few scenarios, which can be solved by importing in multiple batches in the short term.
   
   For case 2,  it will be solved by this pr.
   
   ## Problem Summary:
   
   Describe the overview of changes.
   
   ## Checklist(Required)
   
   1. Does it affect the original behavior: (Yes/No/I Don't know)
   2. Has unit tests been added: (Yes/No/No Need)
   3. Has document been added or modified: (Yes/No/No Need)
   4. Does it need to update dependencies: (Yes/No)
   5. Are there any changes that cannot be rolled back: (Yes/No)
   
   ## Further comments
   
   If this is a relatively large or complex change, kick off the discussion at [dev@doris.apache.org](mailto:dev@doris.apache.org) by explaining why you chose the solution you did and what alternatives you considered, etc...
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] spaces-X commented on pull request #9436: [Spark Load]fix min_value of GlobalDict will be negative number in spark load

Posted by GitBox <gi...@apache.org>.
spaces-X commented on PR #9436:
URL: https://github.com/apache/incubator-doris/pull/9436#issuecomment-1120156128

   @wangshuo128  Is there any window function like **row_num** return-type is **Long** in spark?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] github-actions[bot] commented on pull request #9436: [Spark Load]fix min_value of GlobalDict will be negative number in spark load

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #9436:
URL: https://github.com/apache/incubator-doris/pull/9436#issuecomment-1129559003

   PR approved by anyone and no changes requested.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] github-actions[bot] commented on pull request #9436: [Spark Load]fix min_value of GlobalDict will be negative number in spark load

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #9436:
URL: https://github.com/apache/incubator-doris/pull/9436#issuecomment-1129558992

   PR approved by at least one committer and no changes requested.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] yiguolei merged pull request #9436: [Spark Load]fix min_value of GlobalDict will be negative number in spark load

Posted by GitBox <gi...@apache.org>.
yiguolei merged PR #9436:
URL: https://github.com/apache/incubator-doris/pull/9436


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] wangshuo128 commented on pull request #9436: [Spark Load]fix min_value of GlobalDict will be negative number in spark load

Posted by GitBox <gi...@apache.org>.
wangshuo128 commented on PR #9436:
URL: https://github.com/apache/incubator-doris/pull/9436#issuecomment-1120162257

   > @wangshuo128 Is there any window function like **row_num** return-type is **Long** in spark?
   
   The data type of Spark builtin function `row_number` is `Integer`, please see 
   https://github.com/apache/spark/blob/2349f74866ae1b365b5e4e0ec8a58c4f7f06885c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L564
   
   If you want to have a long type `row_number` function, you could implement a UDAF yourself.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org