You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Nick Burch (JIRA)" <ji...@apache.org> on 2017/09/14 22:38:00 UTC
[jira] [Comment Edited] (TIKA-2462) Add a parser for sas7bdat
[ https://issues.apache.org/jira/browse/TIKA-2462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16167063#comment-16167063 ]
Nick Burch edited comment on TIKA-2462 at 9/14/17 10:37 PM:
------------------------------------------------------------
I've just had a quick try with the library, against a test SAS file with 5 columns each of different types. Looking at the properties on the file, and on the columns, Parso is able to return:
{code}
u64 - false
compressionMethod - null
endianness - 1
encoding - windows-1252
sessionEncoding - null
name - SHEET1
fileType - DATA
dateCreated - Fri Mar 06 19:10:19 GMT 2015
dateModified - Fri Mar 06 19:10:19 GMT 2015
sasRelease - 9.0101M3
serverType - XP_PRO
osName -
osType -
headerLength - 1024
pageLength - 8192
pageCount - 1
rowLength - 96
rowCount - 31
mixPageRowCount - 69
columnsCount - 5
5 Columns defined:
1 - A
Label: A
Format: $
Size 58 of java.lang.String
2 - B
Label: B
Format:
Size 8 of java.lang.Number
3 - C
Label: C
Format: DATE
Size 8 of java.lang.Number
4 - D
Label: D
Format: DATETIME
Size 8 of java.lang.Number
5 - E
Label: E
Format:
Size 8 of java.lang.Number
{code}
I guess we'd want to map some of the file properties onto standard keys, and the rest onto custom ones? For the data, I guess we output SAX events for a HTML-like table. Not sure about the column metadata, any patterns we can copy from any of the database formats or other scientific dataset formats?
Also, we only seem to have 1 fairly simple test sas7bdat file in the Tika Parsers test documents area. Do we have a standard "moderately complicated" tabular test file (eg XLS, CSV) which I could get a SAS version made of, so we can have largely the same test data between formats?
was (Author: gagravarr):
I've just had a quick try with the library, against a test SAS file with 5 columns each of different types. Looking at the properties on the file, and on the columns, Parso is able to return:
{{{
u64 - false
compressionMethod - null
endianness - 1
encoding - windows-1252
sessionEncoding - null
name - SHEET1
fileType - DATA
dateCreated - Fri Mar 06 19:10:19 GMT 2015
dateModified - Fri Mar 06 19:10:19 GMT 2015
sasRelease - 9.0101M3
serverType - XP_PRO
osName -
osType -
headerLength - 1024
pageLength - 8192
pageCount - 1
rowLength - 96
rowCount - 31
mixPageRowCount - 69
columnsCount - 5
5 Columns defined:
1 - A
Label: A
Format: $
Size 58 of java.lang.String
2 - B
Label: B
Format:
Size 8 of java.lang.Number
3 - C
Label: C
Format: DATE
Size 8 of java.lang.Number
4 - D
Label: D
Format: DATETIME
Size 8 of java.lang.Number
5 - E
Label: E
Format:
Size 8 of java.lang.Number
}}}
I guess we'd want to map some of the file properties onto standard keys, and the rest onto custom ones? For the data, I guess we output SAX events for a HTML-like table. Not sure about the column metadata, any patterns we can copy from any of the database formats or other scientific dataset formats?
Also, we only seem to have 1 fairly simple test sas7bdat file in the Tika Parsers test documents area. Do we have a standard "moderately complicated" tabular test file (eg XLS, CSV) which I could get a SAS version made of, so we can have largely the same test data between formats?
> Add a parser for sas7bdat
> -------------------------
>
> Key: TIKA-2462
> URL: https://issues.apache.org/jira/browse/TIKA-2462
> Project: Tika
> Issue Type: Improvement
> Reporter: Tim Allison
>
> EPAM recently agreed to migrate to Apache 2.0 so that we can incorporate parso into Tika for sas7bdat files: https://github.com/epam/parso/issues/19 !!!
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)