You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Nick Burch (JIRA)" <ji...@apache.org> on 2018/05/03 20:42:00 UTC

[jira] [Created] (TIKA-2641) Unit test for consistency between tabular/columnar formats

Nick Burch created TIKA-2641:
--------------------------------

             Summary: Unit test for consistency between tabular/columnar formats
                 Key: TIKA-2641
                 URL: https://issues.apache.org/jira/browse/TIKA-2641
             Project: Tika
          Issue Type: Improvement
          Components: parser
    Affects Versions: 1.18, 2.0
            Reporter: Nick Burch


We now have a number of parsers which deal with file formats which are either wholey or optionally "table-based" formats with consistency in the data types held in a given column. This includes multi-table formats like sqlite, single-table formats like sas7bdat, and anything-goes-table formats like csv or xlsx

We should firstly try to create a simple-ish, small but rich file for each of these formats, similar to what we do for archive formats with the {{test-documents}} archives. Then, we should add unit tests that verified that, as much as formats permit, you get basically the same XHTML out for the "same" input. Oh, and fix up any obvious inconsistencies...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)