You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@beam.apache.org by "Brian Hulette (Jira)" <ji...@apache.org> on 2021/11/29 13:38:00 UTC
[jira] [Created] (BEAM-13335) DataFrame sources produce excessively large index
Brian Hulette created BEAM-13335:
------------------------------------
Summary: DataFrame sources produce excessively large index
Key: BEAM-13335
URL: https://issues.apache.org/jira/browse/BEAM-13335
Project: Beam
Issue Type: Improvement
Components: dsl-dataframe
Reporter: Brian Hulette
Assignee: Robert Bradshaw
DataFrame reads attempt to match user expectations by giving every element across all
shards a unique index. This is done by embedding the filepath
itself in the index, but this results in the (often quite long) path
being duplicated for every element (sometimes exceeding the size of the
data itself).
We should instead generate a guaranteed unique _numeric_ index.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)