You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Ziqi (JIRA)" <ji...@apache.org> on 2015/10/14 12:50:05 UTC

[jira] [Updated] (TIKA-1770) AutoDetectParser wrongly detects plain text as images/audio

     [ https://issues.apache.org/jira/browse/TIKA-1770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ziqi updated TIKA-1770:
-----------------------
    Attachment: the-acl-rd-tec_chunk_10228.txt
                the-acl-rd-tec_chunk_9113.txt
                the-acl-rd-tec_chunk_15.txt

> AutoDetectParser wrongly detects plain text as images/audio
> -----------------------------------------------------------
>
>                 Key: TIKA-1770
>                 URL: https://issues.apache.org/jira/browse/TIKA-1770
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.10
>         Environment: OS independent (tested on both Windows, MAC OS)
>            Reporter: Ziqi
>            Priority: Minor
>         Attachments: the-acl-rd-tec_chunk_10228.txt, the-acl-rd-tec_chunk_15.txt, the-acl-rd-tec_chunk_9113.txt
>
>
> AutoDetectParser fails to recognize certain plain-text files as plain text.
> In the attachment are three testing files, as you can see they are all plain text.
> The following code is used for testing:
> ————————
> AutoDetectParser parser = new AutoDetectParser();
> for (File f : new File("path").listFiles()) {
>     InputStream in = new BufferedInputStream(new FileInputStream(f.toString()));
>     BodyContentHandler handler = new BodyContentHandler(-1);
>     Metadata metadata = new Metadata();
>     try {
>         parser.parse(in, handler, metadata);
>         String content = handler.toString();
>         System.out.println(metadata); //line A
>     }catch (Exception e){
>         e.printStackTrace();
>     }
> }
> ————————
> for the three testing files, line A prints the following:
> X-Parsed-By=org.apache.tika.parser.EmptyParser Content-Type=image/x-portable-bitmap 
> X-Parsed-By=org.apache.tika.parser.DefaultParser X-Parsed-By=org.apache.tika.parser.mp3.Mp3Parser xmpDM:audioCompressor=MP3 Content-Type=audio/mpeg 
> X-Parsed-By=org.apache.tika.parser.EmptyParser Content-Type=image/x-portable-bitmap 
> And as a result, variable "content" is always empty.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)