You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by aliosha79 <al...@yahoo.it> on 2014/05/16 15:12:13 UTC

parser metadata empty after tika detect

i'm facing up to with tika parsing.
I my use case i have to parse different file types using the right parser,
including an .eml file.
As input of my app i can have every kind of file. In particular i have a
MyEmail.eml file whose content-type is recognized as text/html. I aim to get
all the available file's metadata.
Using AutoDetectParser MyEmail.eml is recognized as text/html and it's not
good enough... so i have to use the right RFC822Parser by which i can get
Message-From .. Message-To metadata.
For this purpose i have write these few code lines:

       File f = new File("MyEmail.eml");
       is= new FileInputStream(f);

       Tika tika = new Tika();
       String mimeType = tika.detect(is);
    
      
      if (FileUtils.getExtension("MyEmail.eml").equalsIgnoreCase("eml")){
    	  if (mimeType.equalsIgnoreCase("text/html"))    	  
    		  parser = new RFC822Parser();
    	  else
    		  parser = new AutoDetectParser();
    	  
      }else{
    	  parser = new AutoDetectParser();
      }
    
      parser.parse(is, ch, metadata,new ParseContext());
      for (int i = 0; i < metadata.names().length; i++) {
          String item = metadata.names()[i];
          System.out.println(item + " -- " + metadata.get(item));
      }

In this case the result of metadata syso is just content-type
=application/octet-stream.
If i comment out tika.detect(is) ... the syso output print all the metadata
i need.
If i initialize a second input stream on the same filename and i write:

       is2= new FileInputStream(f);
       Tika tika = new Tika();
       String mimeType = tika.detect(is2);

the syso  prints all the metadata i need.
What happens using the tika.detect(inputstream) function?
thanks a lot




--
View this message in context: http://lucene.472066.n3.nabble.com/parser-metadata-empty-after-tika-detect-tp4136053.html
Sent from the Apache Tika - Development mailing list archive at Nabble.com.

Re: parser metadata empty after tika detect

Posted by Nick Burch <ap...@gagravarr.org>.
On Fri, 16 May 2014, aliosha79 wrote:
> For this purpose i have write these few code lines:
>
>       File f = new File("MyEmail.eml");
>       is= new FileInputStream(f);
>
>       Tika tika = new Tika();
>       String mimeType = tika.detect(is);

This will most likely use a fair bit (to possibly all) of the input 
stream. You'd be much much better off initialising a TikaInputStream from 
the File object directly

> As input of my app i can have every kind of file. In particular i have a
> MyEmail.eml file whose content-type is recognized as text/html

I'd suggest you raise a bug, and attach a small file that doesn't detect 
properly. We can then look at if we can improve the detection

Nick