You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "liyu (Jira)" <ji...@apache.org> on 2021/10/14 06:45:00 UTC

[jira] [Updated] (TIKA-3574) after the fork parser timeout,Can't get the correct content-type

     [ https://issues.apache.org/jira/browse/TIKA-3574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

liyu updated TIKA-3574:
-----------------------
    Description: 
code example
{code:java}
Parser parser = new AutoDecterParser(tikaConfig);

parser = new RecursiveParserWrapper(parser);

ForkParser forkParser = new ForkParser(parser.getClass().getClassLoader(), parser);
forkParser.setServerParseTimeoutMills(600000);
forkParser.setServerWaitTimeoutMills(600000);

// then parser inputstream
BasicContentHandlerFactory factory = new BasicContentHandlerFactory(HANDLER_TYPE.HEML, 104857600);
RecursiveParseWrapperHandler handler = new RecursiveParseWrapperHandler(factory, -1);
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
try{
  forkParser.parse(inputStream,handler,metadata,context);
} catch (Exception e) {
}

{code}
after the fork parser timeout, i get metaDataList from handler.getMetaDataList()

But handler.getMetaDataList().get(0) not root metadata of inputstream, it's embeddedDocument metadata of inputStream

So i can't get current ContentType for inputstream

 

 

tika version: apache tika 1.25

 

  was:
code example
{code:java}
Parser parser = new AutoDecterParser(tikaConfig);

parser = new RecursiveParserWrapper(parser);

ForkParser forkParser = new ForkParser(parser.getClass().getClassLoader(), parser);
forkParser.setServerParseTimeoutMills(600000);
forkParser.setServerWaitTimeoutMills(600000);

// then parser inputstream
BasicContentHandlerFactory factory = new BasicContentHandlerFactory(HANDLER_TYPE.HEML, 104857600);
RecursiveParseWrapperHandler handler = new RecursiveParseWrapperHandler(factory, -1);
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
try{
  forkParser.parse(inputStream,handler,metadata,context);
} catch (Exception e) {
}

{code}
after the fork parser timeout, i get metaDataList from handler.getMetaDataList()

But handler.getMetaDataList().get(0) not root metadata of inputstream, it's embeddedDocument metadata of inputStream

So i can't get current ContentType for inputstream

 


> after the fork parser timeout,Can't get the correct content-type
> ----------------------------------------------------------------
>
>                 Key: TIKA-3574
>                 URL: https://issues.apache.org/jira/browse/TIKA-3574
>             Project: Tika
>          Issue Type: Bug
>            Reporter: liyu
>            Priority: Major
>
> code example
> {code:java}
> Parser parser = new AutoDecterParser(tikaConfig);
> parser = new RecursiveParserWrapper(parser);
> ForkParser forkParser = new ForkParser(parser.getClass().getClassLoader(), parser);
> forkParser.setServerParseTimeoutMills(600000);
> forkParser.setServerWaitTimeoutMills(600000);
> // then parser inputstream
> BasicContentHandlerFactory factory = new BasicContentHandlerFactory(HANDLER_TYPE.HEML, 104857600);
> RecursiveParseWrapperHandler handler = new RecursiveParseWrapperHandler(factory, -1);
> Metadata metadata = new Metadata();
> ParseContext context = new ParseContext();
> try{
>   forkParser.parse(inputStream,handler,metadata,context);
> } catch (Exception e) {
> }
> {code}
> after the fork parser timeout, i get metaDataList from handler.getMetaDataList()
> But handler.getMetaDataList().get(0) not root metadata of inputstream, it's embeddedDocument metadata of inputStream
> So i can't get current ContentType for inputstream
>  
>  
> tika version: apache tika 1.25
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)