You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by GitBox <gi...@apache.org> on 2020/03/14 09:26:15 UTC
[GitHub] [drill] vvysotskyi commented on issue #2026: DRILL-7330: Implement metadata usage for all format plugins

vvysotskyi commented on issue #2026: DRILL-7330: Implement metadata usage for all format plugins
URL: https://github.com/apache/drill/pull/2026#issuecomment-599032547
 
 
   @paul-rogers, this pull request enables the format plugin to gather metadata. Metadata gathering logic was added in DRILL-7273.
   
   Regarding the schema, when metadata is collecting, rules are the same as for regular select queries - Drill tries to infer the table schema or uses user-provided schema.
   
   Collecting metadata logic may become clearer after reading this section of docs: https://github.com/apache/drill/blob/master/docs/dev/MetastoreAnalyze.md#analyze-operators-description or this design doc: https://docs.google.com/document/d/14pSIzKqDltjLEEpEebwmKnsDPxyS_6jGrPOjXu6M_NM/edit?usp=sharing
   In short, yes, we use a reader that reads all the data and downstream operators for transforming and storing its statistics.
   
   > For files that need a provided schema (CSV, say), do we apply stats to the columns after type conversion, or are stats gathered on the raw text values? That is, does this work use the provided schema if available?
   
   Yes, we apply stats to the columns after schema conversion, so such stats as min/max would have correct values in the scope of natural ordering.
   
   > How does the provided schema relate to the metadata schema?
   
   After the provided schema is used in the scan, Drill will use the resolved schema for columns and store it to the metastore.
   
   > What stats will we gather for non-Parquet files? How will we use them? Looks like there is code for partitions (have not looked in depth, so I may be wrong). Are we using stats for partition pruning? If so, how does that differ from the existing practice of just walking the directory tree?
   
   We collect exactly the same stats for non-parquet files. We may use them in the same way as it is used in parquet - prune files when filter for specific columns is specified, prune unneeded files for limit queries. Dirs pruning would still work in the same way as it worked before changes (it also works for parquet).
   I think some tests in `TestMetastoreWithEasyFormatPlugin` will help to understand which optimizations are added.
   
   > Do you see any potential conflicts between your metadata model and the above provided schema model?
   
   Looks like there shouldn't be any conflicts.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services