You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "hofst (via GitHub)" <gi...@apache.org> on 2024/03/03 06:47:16 UTC

[I] Parquet element_type is nullable despite being non-null [arrow]

hofst opened a new issue, #40324:
URL: https://github.com/apache/arrow/issues/40324

   ### Describe the bug, including details regarding any error messages, version, and platform.
   
   I have a column where each element is an array of floats (vector embeddings). Those floats are never null. I need to persist this column in this exact format but pyarrow will always set the inner element_type to `optional`. FastParquet seems to have an option `has_nulls=False` for similar purposes but FastParquet does not seem to be able to write nested array types at all.
   
   This is the pyarrow schema type of the column:
   ```
    ('embedding', pyarrow.list_(pyarrow.float32(), list_size=1024, False),
   ```
   And this is the resulting parquet schema
   ```
     required group field_id=-1 embedding (List) {
       repeated group field_id=-1 list {
         optional float field_id=-1 element;
       }
     }
   ```
   
   No matter what I try, I don't find a way to set the `optional` qualifier of the inner element to `required`. For my purposes, I need this exact schema and it seems problematic that pyarrow cannot create it despite the pyarrow documentation explicitly acknowledging that the inner element type may be either `required` or `optional`.
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [Python][Parquet] element_type is nullable despite being non-null [arrow]

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on issue #40324:
URL: https://github.com/apache/arrow/issues/40324#issuecomment-1976450590

   Hmm this can be a workaround🤔 I'm not familiar with Python, the list is defined as below:
   
   ```c++
   class ARROW_EXPORT ListType : public BaseListType {
    public:
     static constexpr Type::type type_id = Type::LIST;
     using offset_type = int32_t;
   
     static constexpr const char* type_name() { return "list"; }
   
     // List can contain any other logical value type
     explicit ListType(const std::shared_ptr<DataType>& value_type)
         : ListType(std::make_shared<Field>("item", value_type)) {}
   ```
   
   It contains a `Field` here, so I think it's ok, but I agree it's a bit weird for user. But actually I'm not familiar with how to create in convinient... Maybe other can help


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [Python][Parquet] element_type is nullable despite being non-null [arrow]

Posted by "hofst (via GitHub)" <gi...@apache.org>.
hofst commented on issue #40324:
URL: https://github.com/apache/arrow/issues/40324#issuecomment-1976433349

   @mapleFU Thanks, that seems to work. But it seems very strange to use a Field inside a list. Fields should be part of a struct, right? When I use them inside a list, I *have* to provide a field name which at the same time is ignored.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [Python][Parquet] element_type is nullable despite being non-null [arrow]

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on issue #40324:
URL: https://github.com/apache/arrow/issues/40324#issuecomment-1975553222

   `pyarrow.float32()`
   
   I think This should be set to non-nullable.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org