You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by "rtyler (via GitHub)" <gi...@apache.org> on 2023/03/26 03:42:12 UTC

[GitHub] [arrow-rs] rtyler opened a new issue, #3949: Consider renaming rather than removing Decoder

rtyler opened a new issue, #3949:
URL: https://github.com/apache/arrow-rs/issues/3949

   **Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
   
   I have found that creating `serde_json::Value` objects is a *lot* easier than trying to construct `RecordBatch` objects. This is especially useful when using serde's deserialization code for data ingestion. Crates like [serde_arrow}(https://crates.io/crates/serde_arrow) are out-dated and add unnecessary complexity
   
   I have taken to using Decoder to convert Value objects to RecordBatchs easily.
   
   **Describe the solution you'd like**
   
   Renaming or moving the Decoder to where it's not considered for use on deserializing raw JSON buffers, e.g. the BufReader approach that it uses now, but rather can be used for naturally converting pre-deserialized JSON `Value` objects.
   
   **Describe alternatives you've considered**
   
   If Decoder gets bounced out of the repo, I would probably just refactor it into its own crate and push that :smile: 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] tustvold commented on issue #3949: Consider renaming rather than removing Decoder

Posted by "tustvold (via GitHub)" <gi...@apache.org>.

tustvold commented on issue #3949:
URL: https://github.com/apache/arrow-rs/issues/3949#issuecomment-1484077983

   The major reason for wanting to remove the old decoder, aside from its incredibly underwhelming performance, is that the code is very hard to maintain and reason about, in particular the way it handles nested types is very convoluted, and is likely also incorrect.
   
   That being said I wonder if we can accommodate your use-case in a different manner, you mention "using serde's deserialization code", does this mean that the types are known statically ahead of time? I ask as providing a nicer story for creating `RecordBatch` from statically typed rows is something I have been playing around with, would this work for your use-case? 
   
    


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] tustvold closed issue #3949: Consider renaming rather than removing Decoder

Posted by "tustvold (via GitHub)" <gi...@apache.org>.

tustvold closed issue #3949: Consider renaming rather than removing Decoder
URL: https://github.com/apache/arrow-rs/issues/3949


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] trueleo commented on issue #3949: Consider renaming rather than removing Decoder

Posted by "trueleo (via GitHub)" <gi...@apache.org>.

trueleo commented on issue #3949:
URL: https://github.com/apache/arrow-rs/issues/3949#issuecomment-1488337343

   > the code is very hard to maintain and reason about, in particular the way it handles nested types is very convoluted, and is likely also incorrect.
   
   I understand that this can be a PITA to maintain but i also agree with what @rtyler said 
   
   > creating serde_json::Value objects is a lot easier than trying to construct RecordBatch objects.
   
   At Parseable we want to support basic schema evolution on top level. We flatten JSON before converting it into record batch. To do that we need to convert it to serde_json::Value ( it'll be amazing to have a flatten algorithm that does this at very low level but currently we don't have that ). 
   
   So for us it is hard to move away from `Value` -> `RecordBatch` flow. If maintainers at arrow-rs feel strongly about deprecating this then we will probably end up maintaining Decode ourselves.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] rtyler commented on issue #3949: Consider renaming rather than removing Decoder

Posted by "rtyler (via GitHub)" <gi...@apache.org>.

rtyler commented on issue #3949:
URL: https://github.com/apache/arrow-rs/issues/3949#issuecomment-1484170281

   > the code is very hard to maintain and reason about, in particular the way it handles nested types is very convolute
   
   Yes I would agree with that assessment :laughing: 
   
   > creating `RecordBatch` from statically typed rows is something I have been playing around with, would this work for your use-case, or are the types not known at compile time?
   
   Yes the data is more or less known ahead of time. The work that I am doing is similar to what [kafka-delta-ingest](https://github.com/delta-io/kafka-delta-ingest) does, which is converting JSON data into Delta formatted data. In my cases I want to deserialize into structs with Serde in order to perform some operations with the data. If I have a path to re-use serde's `Value` type which is shared between `serde_json` and `serde_yaml` (and others) then I can use that as my "intermediary" format for deserializing known typed data _before_ converting that into a `RecordBatch` for writing out to Delta.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] tustvold commented on issue #3949: Consider renaming rather than removing Decoder

Posted by "tustvold (via GitHub)" <gi...@apache.org>.

tustvold commented on issue #3949:
URL: https://github.com/apache/arrow-rs/issues/3949#issuecomment-1500235697

   `label_issue.py` automatically added labels {'arrow'} from #3979


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] tustvold commented on issue #3949: Consider renaming rather than removing Decoder

Posted by "tustvold (via GitHub)" <gi...@apache.org>.

tustvold commented on issue #3949:
URL: https://github.com/apache/arrow-rs/issues/3949#issuecomment-1488344012

   > it'll be amazing to have a flatten algorithm that does this at very low level but currently we don't have that
   
   Perhaps you could file a ticket for this, flattening structs at least is trivial. I'm not sure how one flattens lists though...
   
   > I am curious to check how fast RawDecoder is compared to older Decoder
   
   It's at least twice as fast, but the JSON writer is very inefficient currently and so is likely to eat into that severely


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] tustvold commented on issue #3949: Consider renaming rather than removing Decoder

Posted by "tustvold (via GitHub)" <gi...@apache.org>.

tustvold commented on issue #3949:
URL: https://github.com/apache/arrow-rs/issues/3949#issuecomment-1490042002

   https://github.com/apache/arrow-rs/pull/3979 contains a POC of how we could support this going forward, PTAL


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org