You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "liyu (Jira)" <ji...@apache.org> on 2021/10/14 06:43:00 UTC

[jira] [Created] (TIKA-3574) after the fork parser timeout,Can't get the correct content-type

liyu created TIKA-3574:
--------------------------

             Summary: after the fork parser timeout,Can't get the correct content-type
                 Key: TIKA-3574
                 URL: https://issues.apache.org/jira/browse/TIKA-3574
             Project: Tika
          Issue Type: Bug
            Reporter: liyu


code example
{code:java}
Parser parser = new AutoDecterParser(tikaConfig);

parser = new RecursiveParserWrapper(parser);

ForkParser forkParser = new ForkParser(parser.getClass().getClassLoader(), parser);
forkParser.setServerParseTimeoutMills(600000);
forkParser.setServerWaitTimeoutMills(600000);

// then parser inputstream
BasicContentHandlerFactory factory = new BasicContentHandlerFactory(HANDLER_TYPE.HEML, 104857600);
RecursiveParseWrapperHandler handler = new RecursiveParseWrapperHandler(factory, -1);
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
try{
  forkParser.parse(inputStream,handler,metadata,context);
} catch (Exception e) {
}

{code}
after the fork parser timeout, i get metaDataList from handler.getMetaDataList()

But handler.getMetaDataList().get(0) not root metadata of inputstream, it's embeddedDocument metadata of inputStream

So i can't get current ContentType for inputstream

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)