You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by aliosha79 <al...@yahoo.it> on 2014/05/16 15:12:13 UTC
parser metadata empty after tika detect
i'm facing up to with tika parsing.
I my use case i have to parse different file types using the right parser,
including an .eml file.
As input of my app i can have every kind of file. In particular i have a
MyEmail.eml file whose content-type is recognized as text/html. I aim to get
all the available file's metadata.
Using AutoDetectParser MyEmail.eml is recognized as text/html and it's not
good enough... so i have to use the right RFC822Parser by which i can get
Message-From .. Message-To metadata.
For this purpose i have write these few code lines:
File f = new File("MyEmail.eml");
is= new FileInputStream(f);
Tika tika = new Tika();
String mimeType = tika.detect(is);
if (FileUtils.getExtension("MyEmail.eml").equalsIgnoreCase("eml")){
if (mimeType.equalsIgnoreCase("text/html"))
parser = new RFC822Parser();
else
parser = new AutoDetectParser();
}else{
parser = new AutoDetectParser();
}
parser.parse(is, ch, metadata,new ParseContext());
for (int i = 0; i < metadata.names().length; i++) {
String item = metadata.names()[i];
System.out.println(item + " -- " + metadata.get(item));
}
In this case the result of metadata syso is just content-type
=application/octet-stream.
If i comment out tika.detect(is) ... the syso output print all the metadata
i need.
If i initialize a second input stream on the same filename and i write:
is2= new FileInputStream(f);
Tika tika = new Tika();
String mimeType = tika.detect(is2);
the syso prints all the metadata i need.
What happens using the tika.detect(inputstream) function?
thanks a lot
--
View this message in context: http://lucene.472066.n3.nabble.com/parser-metadata-empty-after-tika-detect-tp4136053.html
Sent from the Apache Tika - Development mailing list archive at Nabble.com.
Re: parser metadata empty after tika detect
Posted by Nick Burch <ap...@gagravarr.org>.
On Fri, 16 May 2014, aliosha79 wrote:
> For this purpose i have write these few code lines:
>
> File f = new File("MyEmail.eml");
> is= new FileInputStream(f);
>
> Tika tika = new Tika();
> String mimeType = tika.detect(is);
This will most likely use a fair bit (to possibly all) of the input
stream. You'd be much much better off initialising a TikaInputStream from
the File object directly
> As input of my app i can have every kind of file. In particular i have a
> MyEmail.eml file whose content-type is recognized as text/html
I'd suggest you raise a bug, and attach a small file that doesn't detect
properly. We can then look at if we can improve the detection
Nick