You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Vitaliy Filippov (JIRA)" <ji...@apache.org> on 2011/09/07 18:03:16 UTC

[jira] [Created] (TIKA-709) Tika network server does not print anything in response to, for example, Word documents

Tika network server does not print anything in response to, for example, Word documents
---------------------------------------------------------------------------------------

                 Key: TIKA-709
                 URL: https://issues.apache.org/jira/browse/TIKA-709
             Project: Tika
          Issue Type: Bug
          Components: cli
    Affects Versions: 0.9
         Environment: Debian Linux Sid
            Reporter: Vitaliy Filippov


When trying to use Tika Server (java -jar tika-app-0.9.jar -t -p PORT) to parse M$Word DOC/DOCX files, tika server reads the file and then doesn't do anything more, it simply hangs, probably blocked on a socket read. This does not happend with, for example, HTML documents. I don't know the mechanics of this bug, but the following change definitely fixes the issue:

Change
type.process(socket.getInputStream(), output);
to
type.process(new CloseShieldInputStream(socket.getInputStream()), output);

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Resolved] (TIKA-709) Tika network server does not print anything in response to, for example, Word documents

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting resolved TIKA-709.
--------------------------------

    Resolution: Fixed

It's better to file a new issue than reopen an old one whose fix has already been released. Thus re-resolving.

Re: CloseShieldInputStream; I don't see the need for that or for explicitly closing the streams in the finally block. The socket.close() call already takes care of releasing all resources, and there shouldn't be any need to explicitly protect the input stream from being closed. Please follow up on dev@ or in a separate issue if I'm missing something.
                
> Tika network server does not print anything in response to, for example, Word documents
> ---------------------------------------------------------------------------------------
>
>                 Key: TIKA-709
>                 URL: https://issues.apache.org/jira/browse/TIKA-709
>             Project: Tika
>          Issue Type: Bug
>          Components: cli
>    Affects Versions: 0.9
>         Environment: Debian Linux Sid
>            Reporter: Vitaliy Filippov
>            Assignee: Jukka Zitting
>             Fix For: 0.10
>
>         Attachments: tika-709.diff
>
>
> When trying to use Tika Server (java -jar tika-app-0.9.jar -t -p PORT) to parse M$Word DOC/DOCX files, tika server reads the file and then doesn't do anything more, it simply hangs, probably blocked on a socket read. This does not happend with, for example, HTML documents. I don't know the mechanics of this bug, but the following change definitely fixes the issue:
> Change
> type.process(socket.getInputStream(), output);
> to
> type.process(new CloseShieldInputStream(socket.getInputStream()), output);

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Resolved] (TIKA-709) Tika network server does not print anything in response to, for example, Word documents

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting resolved TIKA-709.
--------------------------------

       Resolution: Fixed
    Fix Version/s: 0.10
         Assignee: Jukka Zitting

Good catch, thanks! This problem was caused by some of our parser classes closing the input stream even when the assumption in the Parser interface is that they shouldn't. I fixed the parsers in revisions 1173951 by making them use CloseShieldInputStream where appropriate.

> Tika network server does not print anything in response to, for example, Word documents
> ---------------------------------------------------------------------------------------
>
>                 Key: TIKA-709
>                 URL: https://issues.apache.org/jira/browse/TIKA-709
>             Project: Tika
>          Issue Type: Bug
>          Components: cli
>    Affects Versions: 0.9
>         Environment: Debian Linux Sid
>            Reporter: Vitaliy Filippov
>            Assignee: Jukka Zitting
>             Fix For: 0.10
>
>
> When trying to use Tika Server (java -jar tika-app-0.9.jar -t -p PORT) to parse M$Word DOC/DOCX files, tika server reads the file and then doesn't do anything more, it simply hangs, probably blocked on a socket read. This does not happend with, for example, HTML documents. I don't know the mechanics of this bug, but the following change definitely fixes the issue:
> Change
> type.process(socket.getInputStream(), output);
> to
> type.process(new CloseShieldInputStream(socket.getInputStream()), output);

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Reopened] (TIKA-709) Tika network server does not print anything in response to, for example, Word documents

Posted by "Vitaliy Filippov (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vitaliy Filippov reopened TIKA-709:
-----------------------------------


I see you included the fix in some version, and I have one more comment - you've included the fix just into pipe mode, but it's better to also use CloseShieldInputStream in server mode. I'll attach a patch with next comment.
                
> Tika network server does not print anything in response to, for example, Word documents
> ---------------------------------------------------------------------------------------
>
>                 Key: TIKA-709
>                 URL: https://issues.apache.org/jira/browse/TIKA-709
>             Project: Tika
>          Issue Type: Bug
>          Components: cli
>    Affects Versions: 0.9
>         Environment: Debian Linux Sid
>            Reporter: Vitaliy Filippov
>            Assignee: Jukka Zitting
>             Fix For: 0.10
>
>         Attachments: tika-709.diff
>
>
> When trying to use Tika Server (java -jar tika-app-0.9.jar -t -p PORT) to parse M$Word DOC/DOCX files, tika server reads the file and then doesn't do anything more, it simply hangs, probably blocked on a socket read. This does not happend with, for example, HTML documents. I don't know the mechanics of this bug, but the following change definitely fixes the issue:
> Change
> type.process(socket.getInputStream(), output);
> to
> type.process(new CloseShieldInputStream(socket.getInputStream()), output);

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (TIKA-709) Tika network server does not print anything in response to, for example, Word documents

Posted by "Vitaliy Filippov (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vitaliy Filippov updated TIKA-709:
----------------------------------

    Attachment: tika-709.diff

Patch for server mode
                
> Tika network server does not print anything in response to, for example, Word documents
> ---------------------------------------------------------------------------------------
>
>                 Key: TIKA-709
>                 URL: https://issues.apache.org/jira/browse/TIKA-709
>             Project: Tika
>          Issue Type: Bug
>          Components: cli
>    Affects Versions: 0.9
>         Environment: Debian Linux Sid
>            Reporter: Vitaliy Filippov
>            Assignee: Jukka Zitting
>             Fix For: 0.10
>
>         Attachments: tika-709.diff
>
>
> When trying to use Tika Server (java -jar tika-app-0.9.jar -t -p PORT) to parse M$Word DOC/DOCX files, tika server reads the file and then doesn't do anything more, it simply hangs, probably blocked on a socket read. This does not happend with, for example, HTML documents. I don't know the mechanics of this bug, but the following change definitely fixes the issue:
> Change
> type.process(socket.getInputStream(), output);
> to
> type.process(new CloseShieldInputStream(socket.getInputStream()), output);

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira