You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by GitBox <gi...@apache.org> on 2021/11/08 13:05:02 UTC

[GitHub] [drill] cgivre commented on pull request #2282: DRILL-7978: Fixed Width Format Plugin

cgivre commented on pull request #2282:
URL: https://github.com/apache/drill/pull/2282#issuecomment-963130138


   > @MFoss19 @estherbuchwalter following some [recent chat](https://github.com/apache/drill/pull/2359#issuecomment-962673076) with @paul-rogers and my last comment here, how about a reduced format config such as the following? The goal is to get to something terse and consistent with what we do for other text formats.
   > 
   > ```json
   > "fixedwidth": {
   >   "type": "fixedwidth",
   >   "extensions": [
   >     "fwf"
   >   ],
   >   "extractHeader": true,
   >   "trimStrings": true,
   >   "columnOffsets": [1, 11, 21, 31],
   >   "columnWidths": [10, 10, 10, 10]
   > }
   > ```
   > 
   > Column names and types can already come from a provided schema or aliasing after calls to `CAST()`. Incidentally, the settings above can be overriden per query using a provided schema too.
   > 
   > There's also a part of that wonders whether we could have justified adding our fixed width functionality to the existing delimited text format reader.
   
   @dzamo In this case, I'd respectfully disagree here.  In effect, the configuration is providing a schema to the user, similar to the way the logRegex reader works.  In this case, the user will get the best data possible if we can include datatypes and field names in the schema, so that they can just do a `SELECT *` and not have to worry about casting etc. 
   
   Let's consider a real world use case: some fixed width log generated by a database.  Since the fields may be mashed together, there isn't a delimiter that you can use to divide the fields.   You *could* use however the logRegex reader to do this.  That point aside for the moment, the way I imagined someone using this was that different configs could be set up and linked to workspaces such that if a file was in the `mysql_logs` folder, it would use the mysql log config, and if it was in the `postgres` it would use another.  
   
   My opinion here is that the goal should be to get the cleanest data to the user as possible without the user having to rely on CASTs and other complicating factors. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org