You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Nick Burch (JIRA)" <ji...@apache.org> on 2017/09/14 22:38:00 UTC

[jira] [Comment Edited] (TIKA-2462) Add a parser for sas7bdat

    [ https://issues.apache.org/jira/browse/TIKA-2462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16167063#comment-16167063 ] 

Nick Burch edited comment on TIKA-2462 at 9/14/17 10:37 PM:
------------------------------------------------------------

I've just had a quick try with the library, against a test SAS file with 5 columns each of different types. Looking at the properties on the file, and on the columns, Parso is able to return:
{code}
u64 - false
compressionMethod - null
endianness - 1
encoding - windows-1252
sessionEncoding - null
name - SHEET1
fileType - DATA
dateCreated - Fri Mar 06 19:10:19 GMT 2015
dateModified - Fri Mar 06 19:10:19 GMT 2015
sasRelease - 9.0101M3
serverType - XP_PRO
osName - 
osType - 
headerLength - 1024
pageLength - 8192
pageCount - 1
rowLength - 96
rowCount - 31
mixPageRowCount - 69
columnsCount - 5

5 Columns defined:
 1 - A
  Label: A
  Format: $
  Size 58 of java.lang.String
 2 - B
  Label: B
  Format: 
  Size 8 of java.lang.Number
 3 - C
  Label: C
  Format: DATE
  Size 8 of java.lang.Number
 4 - D
  Label: D
  Format: DATETIME
  Size 8 of java.lang.Number
 5 - E
  Label: E
  Format: 
  Size 8 of java.lang.Number
{code}

I guess we'd want to map some of the file properties onto standard keys, and the rest onto custom ones? For the data, I guess we output SAX events for a HTML-like table. Not sure about the column metadata, any patterns we can copy from any of the database formats or other scientific dataset formats?

Also, we only seem to have 1 fairly simple test sas7bdat file in the Tika Parsers test documents area. Do we have a standard "moderately complicated" tabular test file (eg XLS, CSV) which I could get a SAS version made of, so we can have largely the same test data between formats?


was (Author: gagravarr):
I've just had a quick try with the library, against a test SAS file with 5 columns each of different types. Looking at the properties on the file, and on the columns, Parso is able to return:
{{{
u64 - false
compressionMethod - null
endianness - 1
encoding - windows-1252
sessionEncoding - null
name - SHEET1
fileType - DATA
dateCreated - Fri Mar 06 19:10:19 GMT 2015
dateModified - Fri Mar 06 19:10:19 GMT 2015
sasRelease - 9.0101M3
serverType - XP_PRO
osName - 
osType - 
headerLength - 1024
pageLength - 8192
pageCount - 1
rowLength - 96
rowCount - 31
mixPageRowCount - 69
columnsCount - 5

5 Columns defined:
 1 - A
  Label: A
  Format: $
  Size 58 of java.lang.String
 2 - B
  Label: B
  Format: 
  Size 8 of java.lang.Number
 3 - C
  Label: C
  Format: DATE
  Size 8 of java.lang.Number
 4 - D
  Label: D
  Format: DATETIME
  Size 8 of java.lang.Number
 5 - E
  Label: E
  Format: 
  Size 8 of java.lang.Number
}}}

I guess we'd want to map some of the file properties onto standard keys, and the rest onto custom ones? For the data, I guess we output SAX events for a HTML-like table. Not sure about the column metadata, any patterns we can copy from any of the database formats or other scientific dataset formats?

Also, we only seem to have 1 fairly simple test sas7bdat file in the Tika Parsers test documents area. Do we have a standard "moderately complicated" tabular test file (eg XLS, CSV) which I could get a SAS version made of, so we can have largely the same test data between formats?

> Add a parser for sas7bdat
> -------------------------
>
>                 Key: TIKA-2462
>                 URL: https://issues.apache.org/jira/browse/TIKA-2462
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>
> EPAM recently agreed to migrate to Apache 2.0 so that we can incorporate parso into Tika for sas7bdat files: https://github.com/epam/parso/issues/19 !!!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)