You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2022/02/22 22:21:00 UTC

[jira] [Commented] (DRILL-8146) SAS reader fails to read the majority of sas files

    [ https://issues.apache.org/jira/browse/DRILL-8146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17496357#comment-17496357 ] 

ASF GitHub Bot commented on DRILL-8146:
---------------------------------------

pseudomo opened a new pull request #2472:
URL: https://github.com/apache/drill/pull/2472


   # [DRILL-8146](https://issues.apache.org/jira/browse/DRILL-8146): SAS reader fails to read the majority of sas files
   
   ## Description
   
   The idea to infer schema by analyzing the type of first row is not the best idea in this case because either the value of field in the first row can be null or the entire row can be missing (0 rows). Moreover, I think that there is no point of using MinorType.BIGINT (Long) at all since SAS stores all numbers as Double anyway. Actually, as it turned out, SAS stores any data either in VARCHAR or DOUBLE format.
   
   My proposal is to analyze SAS column type (VARCHAR/DOUBLE) together with column format to define MinorType. In this case we don't need to use the first row at all. 
   As you can see below, we can take advantage of dictionaries with all possible Date/Time formats that are already defined in parso lib to distinguish between TIME, DATE and TIMESTAMP. 
   ```
           if (DateTimeConstants.TIME_FORMAT_STRINGS.contains(columnFormat.getName())) {
             type = MinorType.TIME;
           } else if (DateTimeConstants.DATE_FORMAT_STRINGS.containsKey(columnFormat.getName())) {
             type = MinorType.DATE;
           } else if (DateTimeConstants.DATETIME_FORMAT_STRINGS.containsKey(columnFormat.getName())) {
             type = MinorType.TIMESTAMP;
   ```
   All other fields that are not Date/Time can be recognized either as String or Double.
   
   ## Documentation
   
   ## Testing
   I tested it on 160 real world sas files and 90 synthetic sas files. In all cases the result was ok.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> SAS reader fails to read the majority of sas files
> --------------------------------------------------
>
>                 Key: DRILL-8146
>                 URL: https://issues.apache.org/jira/browse/DRILL-8146
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Storage - Text &amp; CSV
>            Reporter: pseudomo
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.20.0
>
>
> SAS reader fails to read the majority of real world sas files.
> The reader throws NPEs if:
>  * SAS file has 0 rows
>  * Date column value is null
>  * The type of value is Number
>  * Long and Double are mixed together in one column (for some reason if the fractional part of any number is zero, it will be converted to Long by parso library)
> Schema inference issue:
>  * Any Date values converted to LocalDate but actually SAS supports DateTime (timestamps). The problem is that time will be dropped
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)