You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "randolf-scholz (via GitHub)" <gi...@apache.org> on 2023/04/18 15:24:11 UTC

[GitHub] [arrow] randolf-scholz opened a new pull request, #35210: added dictionary[int32,string]

randolf-scholz opened a new pull request, #35210:
URL: https://github.com/apache/arrow/pull/35210

   Fixes #35209


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] randolf-scholz commented on a diff in pull request #35210: GH-35209: [Python] Add a type alias for pa.dictionary(pa.int32(), pa.string())

Posted by "randolf-scholz (via GitHub)" <gi...@apache.org>.
randolf-scholz commented on code in PR #35210:
URL: https://github.com/apache/arrow/pull/35210#discussion_r1170623666


##########
python/pyarrow/tests/test_csv.py:
##########
@@ -952,6 +952,23 @@ def test_decimal_point(self):
             'b': [2.5, -3.0, None],
         }
 
+    def test_dictionary(self):

Review Comment:
   Btw., reading my own comment I'm wondering whether `dictionary[int32, string]` with space after the comma makes more sense than `dictionary[int32,string]`. Alternatively one could consider adding code that removes all whitespaces from the string and just have both of them work.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] randolf-scholz commented on a diff in pull request #35210: GH-35209: [Python] Add a type alias for pa.dictionary(pa.int32(), pa.string())

Posted by "randolf-scholz (via GitHub)" <gi...@apache.org>.
randolf-scholz commented on code in PR #35210:
URL: https://github.com/apache/arrow/pull/35210#discussion_r1170620315


##########
python/pyarrow/tests/test_csv.py:
##########
@@ -952,6 +952,23 @@ def test_decimal_point(self):
             'b': [2.5, -3.0, None],
         }
 
+    def test_dictionary(self):

Review Comment:
   > This is the first composite type alias added to this list. I don't see any issue in adding it, but also it leaves the user wondering why only a subset is implemented.
   
   I wrote about this in the issue, please correct me if I am wrong, but it seems that `dictionary[int32, string]` is the "standard" dictionary type that has some useful kernels. At least I noticed horrible performance degradation when trying to read CSV data with `int16`, `uint32` or `int64` indexed dictionary types. So adding only `dictionary[int32, string]` would be to protect the user from using unoptimized composite types. (If this is a bug, I can open another issue).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] randolf-scholz commented on pull request #35210: GH-35209: [Python] Add a type alias for pa.dictionary(pa.int32(), pa.string())

Posted by "randolf-scholz (via GitHub)" <gi...@apache.org>.
randolf-scholz commented on PR #35210:
URL: https://github.com/apache/arrow/pull/35210#issuecomment-1546589412

   @danepitkin I'd add `dictionary[string,int32]` since it is likely the most common one. Maybe one should add `dictionary[int8,string]`, `dictionary[int16,string]` and `dictionary[int64,string]` immediately as well? How common are non-string dictionaries anyway?
   
   If all possible combinations with non-string types are needed one could write a custom lookup instead. 
   
   Regarding the representation one could do either:
   
   - differentiate `DictionaryType.__repr__` and `DictionaryType.__str__`, so that the latter returns the type-code
   - add an additional `name` / `type_code` property to each dtype that returns the abbreviated type code
       - for example, numpy has `name` attribute
       - `np.dtype("int32").name` ⇝ "int32"
       - `np.dtype("datetime64[ns]").name` ⇝ "datetime64[ns]"
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] danepitkin commented on a diff in pull request #35210: GH-35209: [Python] Add a type alias for pa.dictionary(pa.int32(), pa.string())

Posted by "danepitkin (via GitHub)" <gi...@apache.org>.
danepitkin commented on code in PR #35210:
URL: https://github.com/apache/arrow/pull/35210#discussion_r1170550875


##########
python/pyarrow/tests/test_csv.py:
##########
@@ -952,6 +952,23 @@ def test_decimal_point(self):
             'b': [2.5, -3.0, None],
         }
 
+    def test_dictionary(self):

Review Comment:
   Can you add to the test case `test_schema.py::test_type_for_alias()`? IMO you can delete this test afterwards.



##########
python/pyarrow/types.pxi:
##########
@@ -4790,6 +4790,7 @@ cdef dict _type_aliases = {
     'duration[us]': duration('us'),
     'duration[ns]': duration('ns'),
     'month_day_nano_interval': month_day_nano_interval(),
+    'dictionary[int32,string]': dictionary(int32(), utf8()),

Review Comment:
   ```suggestion
   
       # Composite types
       'dictionary[int32,string]': dictionary(int32(), utf8()),
   ```



##########
python/pyarrow/types.pxi:
##########
@@ -4790,6 +4790,7 @@ cdef dict _type_aliases = {
     'duration[us]': duration('us'),
     'duration[ns]': duration('ns'),
     'month_day_nano_interval': month_day_nano_interval(),
+    'dictionary[int32,string]': dictionary(int32(), utf8()),

Review Comment:
   I wonder if it makes sense to separate composite types from Arrow's primitive types into a different list? Theoretically we could auto generate them for completeness, although I wouldn't want to introduce unnecessary computation on startup. We could lazy-load composite types on usage. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] randolf-scholz closed pull request #35210: added dictionary[int32,string]

Posted by "randolf-scholz (via GitHub)" <gi...@apache.org>.
randolf-scholz closed pull request #35210: added dictionary[int32,string]
URL: https://github.com/apache/arrow/pull/35210


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #35210: GH-35209: [Python] Add a type alias for pa.dictionary(pa.int32(), pa.string())

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #35210:
URL: https://github.com/apache/arrow/pull/35210#issuecomment-1513626314

   :warning: GitHub issue #35209 **has been automatically assigned in GitHub** to PR creator.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #35210: added dictionary[int32,string]

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #35210:
URL: https://github.com/apache/arrow/pull/35210#issuecomment-1513357895

   <!--
     Licensed to the Apache Software Foundation (ASF) under one
     or more contributor license agreements.  See the NOTICE file
     distributed with this work for additional information
     regarding copyright ownership.  The ASF licenses this file
     to you under the Apache License, Version 2.0 (the
     "License"); you may not use this file except in compliance
     with the License.  You may obtain a copy of the License at
   
       http://www.apache.org/licenses/LICENSE-2.0
   
     Unless required by applicable law or agreed to in writing,
     software distributed under the License is distributed on an
     "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
     KIND, either express or implied.  See the License for the
     specific language governing permissions and limitations
     under the License.
   -->
   
   Thanks for opening a pull request!
   
   If this is not a [minor PR](https://github.com/apache/arrow/blob/main/CONTRIBUTING.md#Minor-Fixes). Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose
   
   Opening GitHub issues ahead of time contributes to the [Openness](http://theapacheway.com/open/#:~:text=Openness%20allows%20new%20users%20the,must%20happen%20in%20the%20open.) of the Apache Arrow project.
   
   Then could you also rename the pull request title in the following format?
   
       GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}
   
   or
   
       MINOR: [${COMPONENT}] ${SUMMARY}
   
   In the case of PARQUET issues on JIRA the title also supports:
   
       PARQUET-${JIRA_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}
   
   See also:
   
     * [Other pull requests](https://github.com/apache/arrow/pulls/)
     * [Contribution Guidelines - How to contribute patches](https://arrow.apache.org/docs/developers/contributing.html#how-to-contribute-patches)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #35210: GH-35209: [Python] Add a type alias for pa.dictionary(pa.int32(), pa.string())

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #35210:
URL: https://github.com/apache/arrow/pull/35210#issuecomment-1513626265

   * Closes: #35209


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] danepitkin commented on pull request #35210: GH-35209: [Python] Add a type alias for pa.dictionary(pa.int32(), pa.string())

Posted by "danepitkin (via GitHub)" <gi...@apache.org>.
danepitkin commented on PR #35210:
URL: https://github.com/apache/arrow/pull/35210#issuecomment-1546114236

   Sorry to go back on this, but after some more thought I'm a bit hesitant to add composite type string aliases. There's hundreds of possibilities (or more) and there can also be several variations of string representations each (e.g. spaces vs no spaces). We also aren't matching the string representation of `pa.dictionary.__str__` here, which looks something like this: `DictionaryType(dictionary<values=string, indices=int32, ordered=0>)`. 
   
   @jorisvandenbossche @AlenkaF any thoughts here?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] AlenkaF commented on pull request #35210: GH-35209: [Python] Add a type alias for pa.dictionary(pa.int32(), pa.string())

Posted by "AlenkaF (via GitHub)" <gi...@apache.org>.
AlenkaF commented on PR #35210:
URL: https://github.com/apache/arrow/pull/35210#issuecomment-1547808438

   I have to agree with Dane. Adding only one composite type plus only a few options for the selected type is not ideal. Also the various ways one could decide to have the alias defined is not helping. I do not have a good suggestion on how make it useful though.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org