You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Zoltan Toth (JIRA)" <ji...@apache.org> on 2015/12/22 01:11:46 UTC

[jira] [Created] (TIKA-1817) Extracts entire file content for ASCII DXF files

Zoltan Toth created TIKA-1817:
---------------------------------

             Summary: Extracts entire file content for ASCII DXF files
                 Key: TIKA-1817
                 URL: https://issues.apache.org/jira/browse/TIKA-1817
             Project: Tika
          Issue Type: Bug
    Affects Versions: 1.11
            Reporter: Zoltan Toth


By definition, ASCII DXF files are encoded in plain text.  However. the vast majority of their content is not intended to be human readable (see https://en.wikipedia.org/wiki/AutoCAD_DXF).  Unfortunately for these files, Tika simply "extracts" the entire content of the file instead of the human-readable portions (i.e. comments etc.) that a CAD tool would render.  This results in massive amounts of rubbish data being returned with dire consequences for applications that rely on this.

It would be nice if only the human-readable text fields were extracted.  Failing this, it would still be nice if no text was extracted from these files at all.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)