You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Zoltan Toth (JIRA)" <ji...@apache.org> on 2015/12/22 02:24:46 UTC

[jira] [Updated] (TIKA-1817) Extracts entire file content for ASCII DXF files

     [ https://issues.apache.org/jira/browse/TIKA-1817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Zoltan Toth updated TIKA-1817:
------------------------------
    Attachment: jcsample-screendump.jpg
                jcsample.dxf

I have attached a sample DXF found on the web, as well as a screen dump taken of the file being rendered by a freeware viewer called DWGSee (http://www.autodwg.com/dwg-viewer/).

Hopefully, these will provide enough of an idea on what human readable text is available i.e. It should be a simple matter of searching the sample DXF file for rendered text content to discover how it is encoded.

Apologies, I should have done this at the beginning :-)

> Extracts entire file content for ASCII DXF files
> ------------------------------------------------
>
>                 Key: TIKA-1817
>                 URL: https://issues.apache.org/jira/browse/TIKA-1817
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.11
>            Reporter: Zoltan Toth
>         Attachments: jcsample-screendump.jpg, jcsample.dxf
>
>
> By definition, ASCII DXF files are encoded in plain text.  However. the vast majority of their content is not intended to be human readable (see https://en.wikipedia.org/wiki/AutoCAD_DXF).  Unfortunately for these files, Tika simply "extracts" the entire content of the file instead of the human-readable portions (i.e. comments etc.) that a CAD tool would render.  This results in massive amounts of rubbish data being returned with dire consequences for applications that rely on this.
> It would be nice if only the human-readable text fields were extracted.  Failing this, it would still be nice if no text was extracted from these files at all.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)