You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@beam.apache.org by "Jeff Webb (Jira)" <ji...@apache.org> on 2021/09/14 20:25:00 UTC

[jira] [Updated] (BEAM-8734) Optimize the inference of element_type when writing a list of objects to FileBasedCache

     [ https://issues.apache.org/jira/browse/BEAM-8734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeff Webb updated BEAM-8734:
----------------------------
    Resolution: Fixed
        Status: Resolved  (was: Triage Needed)

attached link makes it look like this has been resolved.

> Optimize the inference of element_type when writing a list of objects to FileBasedCache
> ---------------------------------------------------------------------------------------
>
>                 Key: BEAM-8734
>                 URL: https://issues.apache.org/jira/browse/BEAM-8734
>             Project: Beam
>          Issue Type: Improvement
>          Components: sdk-py-core
>    Affects Versions: 2.16.0
>            Reporter: Alexey Strokach
>            Priority: P3
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> The proposed {{FileBasedCache.write}} method allows the user to write a list of arbitrary objects to a cache. The {{element_type}} and the appropriate {{coder}} for the list of objects is inferred using the {{apache_beam.testing.datatype_inference.infer_element_type}} function. This works well for lists that are small to moderate in size, but is likely to be very inefficient when the amount of data being written is large.
> Two approaches to solving this issue have been considered:
> 1. We could attempt to infer the {{element_type}} from the first N elements (e.g. first 100 elements) in the provided list. This should produce the correct {{element_type}} for all elements in the list in the majority of cases (since every element in the list is likely to have the same data type). In  the cases where the inferred element_type is incorrect, we could attempt to catch the resulting errors and infer the {{element_type}} again using a larger portion of the data.
> 2. If inferring the `element_type` in the first call to {{FileBasedCache.write}} takes too long, we could instruct the user to try again, in the first call providing a small but representative sample of the data, while in the second call providing the rest of the data. Since the {{element_type}} is inferred only the first time that anything is written to a cache, subsequent calls would not have the same constraint on the number of elements.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)