You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@beam.apache.org by "Brian Hulette (Jira)" <ji...@apache.org> on 2021/11/29 13:38:00 UTC

[jira] [Created] (BEAM-13335) DataFrame sources produce excessively large index

Brian Hulette created BEAM-13335:
------------------------------------

             Summary: DataFrame sources produce excessively large index
                 Key: BEAM-13335
                 URL: https://issues.apache.org/jira/browse/BEAM-13335
             Project: Beam
          Issue Type: Improvement
          Components: dsl-dataframe
            Reporter: Brian Hulette
            Assignee: Robert Bradshaw


DataFrame reads attempt to match user expectations by giving every element across all
shards a unique index. This is done by embedding the filepath
itself in the index, but this results in the (often quite long) path
being duplicated for every element (sometimes exceeding the size of the
data itself).

We should instead generate a guaranteed unique _numeric_ index. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)