You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Haris Osmanagic <ha...@gmail.com> on 2017/06/02 10:27:37 UTC

"Stream closed" error when extracting text using Tika Server

Hi everyone!

I am using Tika Server, and I have faced a weird thing when extracting text
and requiring a plain text response. Tests can be found here:
https://github.com/hariso/tika/commit/2a0dc37a4427070360c7ebe147712d9c873a4e7b

*Version used*: 1.15
*File used*: Any I tried (MS Word, DOCX, PDF)
*Method used*: Multipart upload, using Accept: text/plain

*Expected result*: extracted text
*Actual result*: extract text PLUS an error saying

<ns1:XMLFault xmlns:ns1="http://cxf.apache.org/bindings/xformat"><ns1:faultstring
xmlns:ns1="http://cxf.apache.org/bindings/xformat">java.io.IOException:
Stream Closed</ns1:faultstring></ns1:XMLFault>

Looking at the code, it seems like the method used for producing text is using
try-with-resources
<https://github.com/hariso/tika/blob/2a0dc37a4427070360c7ebe147712d9c873a4e7b/tika-server/src/main/java/org/apache/tika/server/resource/TikaResource.java#L408-L411>,
and the used input stream has already been closed. The method used for
producing XML doesn't do it
<https://github.com/hariso/tika/blob/2a0dc37a4427070360c7ebe147712d9c873a4e7b/tika-server/src/main/java/org/apache/tika/server/resource/TikaResource.java#L476>
.

In my use case, the parsed text is processed in an additional, where using
XML/HTML is not really desired, hence I cannot use it as a workaround (at
least not now).

Any help or comments are appreciated!

Haris

Re: "Stream closed" error when extracting text using Tika Server

Posted by Haris Osmanagic <ha...@gmail.com>.
Yep, a filter did block me. It's called "Ignore mistyped email addresses".
: ) Friday... In any case, now it works.: )



On Fri, Jun 2, 2017 at 5:39 PM Allison, Timothy B. <ta...@mitre.org>
wrote:

> Hmmm…
>
>
>
> Any spam filters getting in the way, maybe?
>
>
>
> *From:* Haris Osmanagic [mailto:haris.osmanagic@gmail.com]
> *Sent:* Friday, June 2, 2017 11:24 AM
>
>
> *To:* user@tika.apache.org; lfcnassif@gmail.com
> *Subject:* Re: "Stream closed" error when extracting text using Tika
> Server
>
>
>
> @Tim
>
>
>
> I'm not sure. I've been here
> <https://issues.apache.org/jira/secure/Signup!default.jspa> and registration
> completes, but no email. I also tried resetting my password, but no email
> again.:S
>

RE: "Stream closed" error when extracting text using Tika Server

Posted by "Allison, Timothy B." <ta...@mitre.org>.
Hmmm…

Any spam filters getting in the way, maybe?

From: Haris Osmanagic [mailto:haris.osmanagic@gmail.com]
Sent: Friday, June 2, 2017 11:24 AM
To: user@tika.apache.org; lfcnassif@gmail.com
Subject: Re: "Stream closed" error when extracting text using Tika Server

@Tim

I'm not sure. I've been here<https://issues.apache.org/jira/secure/Signup!default.jspa> and registration completes, but no email. I also tried resetting my password, but no email again.:S

Re: "Stream closed" error when extracting text using Tika Server

Posted by Haris Osmanagic <ha...@gmail.com>.
@Tim

I'm not sure. I've been here
<https://issues.apache.org/jira/secure/Signup!default.jspa> and registration
completes, but no email. I also tried resetting my password, but no email
again.:S

On Fri, Jun 2, 2017 at 5:04 PM Allison, Timothy B. <ta...@mitre.org>
wrote:

> You already have!  J
>
>
>
> >I am not able to sign up for Apache's JIRA
>
>
>
> What went wrong?  That’s the best way to let us know that you’ve actually
> found a problem, which you did, unit test and all!
>
>
>
> *From:* Haris Osmanagic [mailto:haris.osmanagic@gmail.com]
> *Sent:* Friday, June 2, 2017 10:56 AM
> *To:* user@tika.apache.org; lfcnassif@gmail.com
>
>
> *Subject:* Re: "Stream closed" error when extracting text using Tika
> Server
>
>
>
> Thanks everyone for feedback!
>
>
>
> I am not able to sign up for Apache's JIRA, so I couldn't open the ticket
> myself, sorry for that. Am I able to help somehow this way?
>
>
>
> On Fri, Jun 2, 2017 at 3:18 PM Allison, Timothy B. <ta...@mitre.org>
> wrote:
>
> I opened TIKA-2384 for this.  Let’s move discussion there.
>
>
>
> *From:* Luís Filipe Nassif [mailto:lfcnassif@gmail.com]
> *Sent:* Friday, June 2, 2017 9:00 AM
> *To:* user@tika.apache.org
> *Subject:* RE: "Stream closed" error when extracting text using Tika
> Server
>
>
>
> I think resources should be closed where they are opened, like
> parser.parse() API contract, no?
>
>
>
> Luis
>
>
>
> Em 2 de jun de 2017 8:27 AM, "Allison, Timothy B." <ta...@mitre.org>
> escreveu:
>
> Haris is correct.
>
> The static "parse()" closes the InputStream so we shouldn't wrap the call
> to parse in an autoclose
>
> try(InputStream is = xyz) {
>         TikaResource.parse(...)
> }
>
> Once I remove the autoclosing try, the test passes.
>
>
> -----Original Message-----
> From: Sergey Beryozkin [mailto:sberyozkin@gmail.com]
> Sent: Friday, June 2, 2017 7:20 AM
> To: user@tika.apache.org
> Subject: Re: "Stream closed" error when extracting text using Tika Server
>
> Hi Tim, sorry, I'm not sure now what I was planning to fix :-), I've
> looked at the source again and it is not a case of InputStream returned
> directly from the method...
> try/catch will most likely work better, though may be it would hide some
> issue to do with some of the parsers closing the stream early somewhere...
>
> Thanks, Sergey
> On 02/06/17 12:13, Allison, Timothy B. wrote:
> > Thank you for sharing this with us.
> >
> > Oddly, I’m able to reproduce this with our 2pic.docx test file, but
> > not with our “test_recursive_embedded.docx”.
> >
> > Please open a ticket on our JIRA.
> >
> > *From:*Haris Osmanagic [mailto:haris.osmanagic@gmail.com]
> > *Sent:* Friday, June 2, 2017 6:28 AM
> > *To:* user@tika.apache.org
> > *Subject:* "Stream closed" error when extracting text using Tika
> > Server
> >
> > Hi everyone!
> >
> > I am using Tika Server, and I have faced a weird thing when extracting
> > text and requiring a plain text response. Tests can be found here:
> > https://github.com/hariso/tika/commit/2a0dc37a4427070360c7ebe147712d9c
> > 873a4e7b
> >
> > *Version used*: 1.15
> >
> > *File used*: Any I tried (MS Word, DOCX, PDF)
> >
> > *Method used*: Multipart upload, using Accept: text/plain
> >
> > *Expected result*: extracted text
> >
> > *Actual result*: extract text PLUS an error saying
> >
> > <ns1:XMLFault
> > xmlns:ns1="http://cxf.apache.org/bindings/xformat"><ns1:faultstring
> > xmlns:ns1="http://cxf.apache.org/bindings/xformat">java.io.IOException:
> > Stream Closed</ns1:faultstring></ns1:XMLFault>
> >
> > Looking at the code, it seems like the method used for producing text
> > is using try-with-resources
> > <https://github.com/hariso/tika/blob/2a0dc37a4427070360c7ebe147712d9c8
> > 73a4e7b/tika-server/src/main/java/org/apache/tika/server/resource/Tika
> > Resource.java#L408-L411>, and the used input stream has already been
> > closed. The method used for producing XML doesn't do it
> > <
> https://github.com/hariso/tika/blob/2a0dc37a4427070360c7ebe147712d9c873a4e7b/tika-server/src/main/java/org/apache/tika/server/resource/TikaResource.java#L476
> >.
> >
> > In my use case, the parsed text is processed in an additional, where
> > using XML/HTML is not really desired, hence I cannot use it as a
> > workaround (at least not now).
> >
> > Any help or comments are appreciated!
> >
> > Haris
> >
>
>
> --
> Sergey Beryozkin
>
> Talend Community Coders
> http://coders.talend.com/
>
>

RE: "Stream closed" error when extracting text using Tika Server

Posted by "Allison, Timothy B." <ta...@mitre.org>.
You already have!  ☺

>I am not able to sign up for Apache's JIRA

What went wrong?  That’s the best way to let us know that you’ve actually found a problem, which you did, unit test and all!

From: Haris Osmanagic [mailto:haris.osmanagic@gmail.com]
Sent: Friday, June 2, 2017 10:56 AM
To: user@tika.apache.org; lfcnassif@gmail.com
Subject: Re: "Stream closed" error when extracting text using Tika Server

Thanks everyone for feedback!

I am not able to sign up for Apache's JIRA, so I couldn't open the ticket myself, sorry for that. Am I able to help somehow this way?

On Fri, Jun 2, 2017 at 3:18 PM Allison, Timothy B. <ta...@mitre.org>> wrote:
I opened TIKA-2384 for this.  Let’s move discussion there.

From: Luís Filipe Nassif [mailto:lfcnassif@gmail.com<ma...@gmail.com>]
Sent: Friday, June 2, 2017 9:00 AM
To: user@tika.apache.org<ma...@tika.apache.org>
Subject: RE: "Stream closed" error when extracting text using Tika Server

I think resources should be closed where they are opened, like parser.parse() API contract, no?

Luis

Em 2 de jun de 2017 8:27 AM, "Allison, Timothy B." <ta...@mitre.org>> escreveu:
Haris is correct.

The static "parse()" closes the InputStream so we shouldn't wrap the call to parse in an autoclose

try(InputStream is = xyz) {
        TikaResource.parse(...)
}

Once I remove the autoclosing try, the test passes.


-----Original Message-----
From: Sergey Beryozkin [mailto:sberyozkin@gmail.com<ma...@gmail.com>]
Sent: Friday, June 2, 2017 7:20 AM
To: user@tika.apache.org<ma...@tika.apache.org>
Subject: Re: "Stream closed" error when extracting text using Tika Server

Hi Tim, sorry, I'm not sure now what I was planning to fix :-), I've looked at the source again and it is not a case of InputStream returned directly from the method...
try/catch will most likely work better, though may be it would hide some issue to do with some of the parsers closing the stream early somewhere...

Thanks, Sergey
On 02/06/17 12:13, Allison, Timothy B. wrote:
> Thank you for sharing this with us.
>
> Oddly, I’m able to reproduce this with our 2pic.docx test file, but
> not with our “test_recursive_embedded.docx”.
>
> Please open a ticket on our JIRA.
>
> *From:*Haris Osmanagic [mailto:haris.osmanagic@gmail.com<ma...@gmail.com>]
> *Sent:* Friday, June 2, 2017 6:28 AM
> *To:* user@tika.apache.org<ma...@tika.apache.org>
> *Subject:* "Stream closed" error when extracting text using Tika
> Server
>
> Hi everyone!
>
> I am using Tika Server, and I have faced a weird thing when extracting
> text and requiring a plain text response. Tests can be found here:
> https://github.com/hariso/tika/commit/2a0dc37a4427070360c7ebe147712d9c
> 873a4e7b
>
> *Version used*: 1.15
>
> *File used*: Any I tried (MS Word, DOCX, PDF)
>
> *Method used*: Multipart upload, using Accept: text/plain
>
> *Expected result*: extracted text
>
> *Actual result*: extract text PLUS an error saying
>
> <ns1:XMLFault
> xmlns:ns1="http://cxf.apache.org/bindings/xformat"><ns1:faultstring
> xmlns:ns1="http://cxf.apache.org/bindings/xformat">java.io<http://java.io>.IOException:
> Stream Closed</ns1:faultstring></ns1:XMLFault>
>
> Looking at the code, it seems like the method used for producing text
> is using try-with-resources
> <https://github.com/hariso/tika/blob/2a0dc37a4427070360c7ebe147712d9c8
> 73a4e7b/tika-server/src/main/java/org/apache/tika/server/resource/Tika
> Resource.java#L408-L411>, and the used input stream has already been
> closed. The method used for producing XML doesn't do it
> <https://github.com/hariso/tika/blob/2a0dc37a4427070360c7ebe147712d9c873a4e7b/tika-server/src/main/java/org/apache/tika/server/resource/TikaResource.java#L476>.
>
> In my use case, the parsed text is processed in an additional, where
> using XML/HTML is not really desired, hence I cannot use it as a
> workaround (at least not now).
>
> Any help or comments are appreciated!
>
> Haris
>


--
Sergey Beryozkin

Talend Community Coders
http://coders.talend.com/

Re: "Stream closed" error when extracting text using Tika Server

Posted by Haris Osmanagic <ha...@gmail.com>.
Thanks everyone for feedback!

I am not able to sign up for Apache's JIRA, so I couldn't open the ticket
myself, sorry for that. Am I able to help somehow this way?

On Fri, Jun 2, 2017 at 3:18 PM Allison, Timothy B. <ta...@mitre.org>
wrote:

> I opened TIKA-2384 for this.  Let’s move discussion there.
>
>
>
> *From:* Luís Filipe Nassif [mailto:lfcnassif@gmail.com]
> *Sent:* Friday, June 2, 2017 9:00 AM
> *To:* user@tika.apache.org
> *Subject:* RE: "Stream closed" error when extracting text using Tika
> Server
>
>
>
> I think resources should be closed where they are opened, like
> parser.parse() API contract, no?
>
>
>
> Luis
>
>
>
> Em 2 de jun de 2017 8:27 AM, "Allison, Timothy B." <ta...@mitre.org>
> escreveu:
>
> Haris is correct.
>
> The static "parse()" closes the InputStream so we shouldn't wrap the call
> to parse in an autoclose
>
> try(InputStream is = xyz) {
>         TikaResource.parse(...)
> }
>
> Once I remove the autoclosing try, the test passes.
>
>
> -----Original Message-----
> From: Sergey Beryozkin [mailto:sberyozkin@gmail.com]
> Sent: Friday, June 2, 2017 7:20 AM
> To: user@tika.apache.org
> Subject: Re: "Stream closed" error when extracting text using Tika Server
>
> Hi Tim, sorry, I'm not sure now what I was planning to fix :-), I've
> looked at the source again and it is not a case of InputStream returned
> directly from the method...
> try/catch will most likely work better, though may be it would hide some
> issue to do with some of the parsers closing the stream early somewhere...
>
> Thanks, Sergey
> On 02/06/17 12:13, Allison, Timothy B. wrote:
> > Thank you for sharing this with us.
> >
> > Oddly, I’m able to reproduce this with our 2pic.docx test file, but
> > not with our “test_recursive_embedded.docx”.
> >
> > Please open a ticket on our JIRA.
> >
> > *From:*Haris Osmanagic [mailto:haris.osmanagic@gmail.com]
> > *Sent:* Friday, June 2, 2017 6:28 AM
> > *To:* user@tika.apache.org
> > *Subject:* "Stream closed" error when extracting text using Tika
> > Server
> >
> > Hi everyone!
> >
> > I am using Tika Server, and I have faced a weird thing when extracting
> > text and requiring a plain text response. Tests can be found here:
> > https://github.com/hariso/tika/commit/2a0dc37a4427070360c7ebe147712d9c
> > 873a4e7b
> >
> > *Version used*: 1.15
> >
> > *File used*: Any I tried (MS Word, DOCX, PDF)
> >
> > *Method used*: Multipart upload, using Accept: text/plain
> >
> > *Expected result*: extracted text
> >
> > *Actual result*: extract text PLUS an error saying
> >
> > <ns1:XMLFault
> > xmlns:ns1="http://cxf.apache.org/bindings/xformat"><ns1:faultstring
> > xmlns:ns1="http://cxf.apache.org/bindings/xformat">java.io.IOException:
> > Stream Closed</ns1:faultstring></ns1:XMLFault>
> >
> > Looking at the code, it seems like the method used for producing text
> > is using try-with-resources
> > <https://github.com/hariso/tika/blob/2a0dc37a4427070360c7ebe147712d9c8
> > 73a4e7b/tika-server/src/main/java/org/apache/tika/server/resource/Tika
> > Resource.java#L408-L411>, and the used input stream has already been
> > closed. The method used for producing XML doesn't do it
> > <
> https://github.com/hariso/tika/blob/2a0dc37a4427070360c7ebe147712d9c873a4e7b/tika-server/src/main/java/org/apache/tika/server/resource/TikaResource.java#L476
> >.
> >
> > In my use case, the parsed text is processed in an additional, where
> > using XML/HTML is not really desired, hence I cannot use it as a
> > workaround (at least not now).
> >
> > Any help or comments are appreciated!
> >
> > Haris
> >
>
>
> --
> Sergey Beryozkin
>
> Talend Community Coders
> http://coders.talend.com/
>
>

RE: "Stream closed" error when extracting text using Tika Server

Posted by "Allison, Timothy B." <ta...@mitre.org>.
I opened TIKA-2384 for this.  Let’s move discussion there.

From: Luís Filipe Nassif [mailto:lfcnassif@gmail.com]
Sent: Friday, June 2, 2017 9:00 AM
To: user@tika.apache.org
Subject: RE: "Stream closed" error when extracting text using Tika Server

I think resources should be closed where they are opened, like parser.parse() API contract, no?

Luis

Em 2 de jun de 2017 8:27 AM, "Allison, Timothy B." <ta...@mitre.org>> escreveu:
Haris is correct.

The static "parse()" closes the InputStream so we shouldn't wrap the call to parse in an autoclose

try(InputStream is = xyz) {
        TikaResource.parse(...)
}

Once I remove the autoclosing try, the test passes.


-----Original Message-----
From: Sergey Beryozkin [mailto:sberyozkin@gmail.com<ma...@gmail.com>]
Sent: Friday, June 2, 2017 7:20 AM
To: user@tika.apache.org<ma...@tika.apache.org>
Subject: Re: "Stream closed" error when extracting text using Tika Server

Hi Tim, sorry, I'm not sure now what I was planning to fix :-), I've looked at the source again and it is not a case of InputStream returned directly from the method...
try/catch will most likely work better, though may be it would hide some issue to do with some of the parsers closing the stream early somewhere...

Thanks, Sergey
On 02/06/17 12:13, Allison, Timothy B. wrote:
> Thank you for sharing this with us.
>
> Oddly, I’m able to reproduce this with our 2pic.docx test file, but
> not with our “test_recursive_embedded.docx”.
>
> Please open a ticket on our JIRA.
>
> *From:*Haris Osmanagic [mailto:haris.osmanagic@gmail.com<ma...@gmail.com>]
> *Sent:* Friday, June 2, 2017 6:28 AM
> *To:* user@tika.apache.org<ma...@tika.apache.org>
> *Subject:* "Stream closed" error when extracting text using Tika
> Server
>
> Hi everyone!
>
> I am using Tika Server, and I have faced a weird thing when extracting
> text and requiring a plain text response. Tests can be found here:
> https://github.com/hariso/tika/commit/2a0dc37a4427070360c7ebe147712d9c
> 873a4e7b
>
> *Version used*: 1.15
>
> *File used*: Any I tried (MS Word, DOCX, PDF)
>
> *Method used*: Multipart upload, using Accept: text/plain
>
> *Expected result*: extracted text
>
> *Actual result*: extract text PLUS an error saying
>
> <ns1:XMLFault
> xmlns:ns1="http://cxf.apache.org/bindings/xformat"><ns1:faultstring
> xmlns:ns1="http://cxf.apache.org/bindings/xformat">java.io<http://java.io>.IOException:
> Stream Closed</ns1:faultstring></ns1:XMLFault>
>
> Looking at the code, it seems like the method used for producing text
> is using try-with-resources
> <https://github.com/hariso/tika/blob/2a0dc37a4427070360c7ebe147712d9c8
> 73a4e7b/tika-server/src/main/java/org/apache/tika/server/resource/Tika
> Resource.java#L408-L411>, and the used input stream has already been
> closed. The method used for producing XML doesn't do it
> <https://github.com/hariso/tika/blob/2a0dc37a4427070360c7ebe147712d9c873a4e7b/tika-server/src/main/java/org/apache/tika/server/resource/TikaResource.java#L476>.
>
> In my use case, the parsed text is processed in an additional, where
> using XML/HTML is not really desired, hence I cannot use it as a
> workaround (at least not now).
>
> Any help or comments are appreciated!
>
> Haris
>


--
Sergey Beryozkin

Talend Community Coders
http://coders.talend.com/

RE: "Stream closed" error when extracting text using Tika Server

Posted by Luís Filipe Nassif <lf...@gmail.com>.
I think resources should be closed where they are opened, like
parser.parse() API contract, no?

Luis

Em 2 de jun de 2017 8:27 AM, "Allison, Timothy B." <ta...@mitre.org>
escreveu:

> Haris is correct.
>
> The static "parse()" closes the InputStream so we shouldn't wrap the call
> to parse in an autoclose
>
> try(InputStream is = xyz) {
>         TikaResource.parse(...)
> }
>
> Once I remove the autoclosing try, the test passes.
>
>
> -----Original Message-----
> From: Sergey Beryozkin [mailto:sberyozkin@gmail.com]
> Sent: Friday, June 2, 2017 7:20 AM
> To: user@tika.apache.org
> Subject: Re: "Stream closed" error when extracting text using Tika Server
>
> Hi Tim, sorry, I'm not sure now what I was planning to fix :-), I've
> looked at the source again and it is not a case of InputStream returned
> directly from the method...
> try/catch will most likely work better, though may be it would hide some
> issue to do with some of the parsers closing the stream early somewhere...
>
> Thanks, Sergey
> On 02/06/17 12:13, Allison, Timothy B. wrote:
> > Thank you for sharing this with us.
> >
> > Oddly, I’m able to reproduce this with our 2pic.docx test file, but
> > not with our “test_recursive_embedded.docx”.
> >
> > Please open a ticket on our JIRA.
> >
> > *From:*Haris Osmanagic [mailto:haris.osmanagic@gmail.com]
> > *Sent:* Friday, June 2, 2017 6:28 AM
> > *To:* user@tika.apache.org
> > *Subject:* "Stream closed" error when extracting text using Tika
> > Server
> >
> > Hi everyone!
> >
> > I am using Tika Server, and I have faced a weird thing when extracting
> > text and requiring a plain text response. Tests can be found here:
> > https://github.com/hariso/tika/commit/2a0dc37a4427070360c7ebe147712d9c
> > 873a4e7b
> >
> > *Version used*: 1.15
> >
> > *File used*: Any I tried (MS Word, DOCX, PDF)
> >
> > *Method used*: Multipart upload, using Accept: text/plain
> >
> > *Expected result*: extracted text
> >
> > *Actual result*: extract text PLUS an error saying
> >
> > <ns1:XMLFault
> > xmlns:ns1="http://cxf.apache.org/bindings/xformat"><ns1:faultstring
> > xmlns:ns1="http://cxf.apache.org/bindings/xformat">java.io.IOException:
> > Stream Closed</ns1:faultstring></ns1:XMLFault>
> >
> > Looking at the code, it seems like the method used for producing text
> > is using try-with-resources
> > <https://github.com/hariso/tika/blob/2a0dc37a4427070360c7ebe147712d9c8
> > 73a4e7b/tika-server/src/main/java/org/apache/tika/server/resource/Tika
> > Resource.java#L408-L411>, and the used input stream has already been
> > closed. The method used for producing XML doesn't do it
> > <https://github.com/hariso/tika/blob/2a0dc37a4427070360c7ebe147712d
> 9c873a4e7b/tika-server/src/main/java/org/apache/tika/
> server/resource/TikaResource.java#L476>.
> >
> > In my use case, the parsed text is processed in an additional, where
> > using XML/HTML is not really desired, hence I cannot use it as a
> > workaround (at least not now).
> >
> > Any help or comments are appreciated!
> >
> > Haris
> >
>
>
> --
> Sergey Beryozkin
>
> Talend Community Coders
> http://coders.talend.com/
>

RE: "Stream closed" error when extracting text using Tika Server

Posted by "Allison, Timothy B." <ta...@mitre.org>.
Haris is correct.

The static "parse()" closes the InputStream so we shouldn't wrap the call to parse in an autoclose 

try(InputStream is = xyz) {
	TikaResource.parse(...)
}

Once I remove the autoclosing try, the test passes.


-----Original Message-----
From: Sergey Beryozkin [mailto:sberyozkin@gmail.com] 
Sent: Friday, June 2, 2017 7:20 AM
To: user@tika.apache.org
Subject: Re: "Stream closed" error when extracting text using Tika Server

Hi Tim, sorry, I'm not sure now what I was planning to fix :-), I've looked at the source again and it is not a case of InputStream returned directly from the method...
try/catch will most likely work better, though may be it would hide some issue to do with some of the parsers closing the stream early somewhere...

Thanks, Sergey
On 02/06/17 12:13, Allison, Timothy B. wrote:
> Thank you for sharing this with us.
> 
> Oddly, I’m able to reproduce this with our 2pic.docx test file, but 
> not with our “test_recursive_embedded.docx”.
> 
> Please open a ticket on our JIRA.
> 
> *From:*Haris Osmanagic [mailto:haris.osmanagic@gmail.com]
> *Sent:* Friday, June 2, 2017 6:28 AM
> *To:* user@tika.apache.org
> *Subject:* "Stream closed" error when extracting text using Tika 
> Server
> 
> Hi everyone!
> 
> I am using Tika Server, and I have faced a weird thing when extracting 
> text and requiring a plain text response. Tests can be found here:
> https://github.com/hariso/tika/commit/2a0dc37a4427070360c7ebe147712d9c
> 873a4e7b
> 
> *Version used*: 1.15
> 
> *File used*: Any I tried (MS Word, DOCX, PDF)
> 
> *Method used*: Multipart upload, using Accept: text/plain
> 
> *Expected result*: extracted text
> 
> *Actual result*: extract text PLUS an error saying
> 
> <ns1:XMLFault
> xmlns:ns1="http://cxf.apache.org/bindings/xformat"><ns1:faultstring
> xmlns:ns1="http://cxf.apache.org/bindings/xformat">java.io.IOException: 
> Stream Closed</ns1:faultstring></ns1:XMLFault>
> 
> Looking at the code, it seems like the method used for producing text 
> is using try-with-resources 
> <https://github.com/hariso/tika/blob/2a0dc37a4427070360c7ebe147712d9c8
> 73a4e7b/tika-server/src/main/java/org/apache/tika/server/resource/Tika
> Resource.java#L408-L411>, and the used input stream has already been 
> closed. The method used for producing XML doesn't do it 
> <https://github.com/hariso/tika/blob/2a0dc37a4427070360c7ebe147712d9c873a4e7b/tika-server/src/main/java/org/apache/tika/server/resource/TikaResource.java#L476>.
> 
> In my use case, the parsed text is processed in an additional, where 
> using XML/HTML is not really desired, hence I cannot use it as a 
> workaround (at least not now).
> 
> Any help or comments are appreciated!
> 
> Haris
> 


--
Sergey Beryozkin

Talend Community Coders
http://coders.talend.com/

Re: "Stream closed" error when extracting text using Tika Server

Posted by Sergey Beryozkin <sb...@gmail.com>.
Hi Tim, sorry, I'm not sure now what I was planning to fix :-), I've 
looked at the source again and it is not a case of InputStream returned 
directly from the method...
try/catch will most likely work better, though may be it would hide some 
issue to do with some of the parsers closing the stream early somewhere...

Thanks, Sergey
On 02/06/17 12:13, Allison, Timothy B. wrote:
> Thank you for sharing this with us.
> 
> Oddly, I’m able to reproduce this with our 2pic.docx test file, but not 
> with our “test_recursive_embedded.docx”.
> 
> Please open a ticket on our JIRA.
> 
> *From:*Haris Osmanagic [mailto:haris.osmanagic@gmail.com]
> *Sent:* Friday, June 2, 2017 6:28 AM
> *To:* user@tika.apache.org
> *Subject:* "Stream closed" error when extracting text using Tika Server
> 
> Hi everyone!
> 
> I am using Tika Server, and I have faced a weird thing when extracting 
> text and requiring a plain text response. Tests can be found here: 
> https://github.com/hariso/tika/commit/2a0dc37a4427070360c7ebe147712d9c873a4e7b
> 
> *Version used*: 1.15
> 
> *File used*: Any I tried (MS Word, DOCX, PDF)
> 
> *Method used*: Multipart upload, using Accept: text/plain
> 
> *Expected result*: extracted text
> 
> *Actual result*: extract text PLUS an error saying
> 
> <ns1:XMLFault 
> xmlns:ns1="http://cxf.apache.org/bindings/xformat"><ns1:faultstring 
> xmlns:ns1="http://cxf.apache.org/bindings/xformat">java.io.IOException: 
> Stream Closed</ns1:faultstring></ns1:XMLFault>
> 
> Looking at the code, it seems like the method used for producing text is 
> using try-with-resources 
> <https://github.com/hariso/tika/blob/2a0dc37a4427070360c7ebe147712d9c873a4e7b/tika-server/src/main/java/org/apache/tika/server/resource/TikaResource.java#L408-L411>, 
> and the used input stream has already been closed. The method used for 
> producing XML doesn't do it 
> <https://github.com/hariso/tika/blob/2a0dc37a4427070360c7ebe147712d9c873a4e7b/tika-server/src/main/java/org/apache/tika/server/resource/TikaResource.java#L476>.
> 
> In my use case, the parsed text is processed in an additional, where 
> using XML/HTML is not really desired, hence I cannot use it as a 
> workaround (at least not now).
> 
> Any help or comments are appreciated!
> 
> Haris
> 


-- 
Sergey Beryozkin

Talend Community Coders
http://coders.talend.com/

RE: "Stream closed" error when extracting text using Tika Server

Posted by "Allison, Timothy B." <ta...@mitre.org>.
Thank you for sharing this with us.


Oddly, I’m able to reproduce this with our 2pic.docx test file, but not with our “test_recursive_embedded.docx”.



Please open a ticket on our JIRA.


From: Haris Osmanagic [mailto:haris.osmanagic@gmail.com]
Sent: Friday, June 2, 2017 6:28 AM
To: user@tika.apache.org
Subject: "Stream closed" error when extracting text using Tika Server

Hi everyone!

I am using Tika Server, and I have faced a weird thing when extracting text and requiring a plain text response. Tests can be found here: https://github.com/hariso/tika/commit/2a0dc37a4427070360c7ebe147712d9c873a4e7b

Version used: 1.15
File used: Any I tried (MS Word, DOCX, PDF)
Method used: Multipart upload, using Accept: text/plain

Expected result: extracted text
Actual result: extract text PLUS an error saying

<ns1:XMLFault xmlns:ns1="http://cxf.apache.org/bindings/xformat"><ns1:faultstring xmlns:ns1="http://cxf.apache.org/bindings/xformat">java.io.IOException: Stream Closed</ns1:faultstring></ns1:XMLFault>

Looking at the code, it seems like the method used for producing text is using try-with-resources<https://github.com/hariso/tika/blob/2a0dc37a4427070360c7ebe147712d9c873a4e7b/tika-server/src/main/java/org/apache/tika/server/resource/TikaResource.java#L408-L411>, and the used input stream has already been closed. The method used for producing XML doesn't do it<https://github.com/hariso/tika/blob/2a0dc37a4427070360c7ebe147712d9c873a4e7b/tika-server/src/main/java/org/apache/tika/server/resource/TikaResource.java#L476>.

In my use case, the parsed text is processed in an additional, where using XML/HTML is not really desired, hence I cannot use it as a workaround (at least not now).

Any help or comments are appreciated!

Haris



Re: "Stream closed" error when extracting text using Tika Server

Posted by Sergey Beryozkin <sb...@gmail.com>.
Hi

By default, CXF JAX-RS MessageBodyWriter which deals with InputStream 
closes it immediately a copy is complete, it can be disabled, but it 
would be indeed simpler to avoid using a try-with-resources. I can fix it...

FYI, re your test code, you can do response.getEntity(String.class)

Cheers, Sergey
On 02/06/17 11:27, Haris Osmanagic wrote:
> Hi everyone!
> 
> I am using Tika Server, and I have faced a weird thing when extracting 
> text and requiring a plain text response. Tests can be found here: 
> https://github.com/hariso/tika/commit/2a0dc37a4427070360c7ebe147712d9c873a4e7b
> 
> *Version used*: 1.15
> *File used*: Any I tried (MS Word, DOCX, PDF)
> *Method used*: Multipart upload, using Accept: text/plain
> 
> *Expected result*: extracted text
> *Actual result*: extract text PLUS an error saying
> 
> <ns1:XMLFault 
> xmlns:ns1="http://cxf.apache.org/bindings/xformat"><ns1:faultstring 
> xmlns:ns1="http://cxf.apache.org/bindings/xformat">java.io.IOException: 
> Stream Closed</ns1:faultstring></ns1:XMLFault>
> 
> Looking at the code, it seems like the method used for producing text is 
> using try-with-resources 
> <https://github.com/hariso/tika/blob/2a0dc37a4427070360c7ebe147712d9c873a4e7b/tika-server/src/main/java/org/apache/tika/server/resource/TikaResource.java#L408-L411>, 
> and the used input stream has already been closed. The method used for 
> producing XML doesn't do it 
> <https://github.com/hariso/tika/blob/2a0dc37a4427070360c7ebe147712d9c873a4e7b/tika-server/src/main/java/org/apache/tika/server/resource/TikaResource.java#L476>.
> 
> In my use case, the parsed text is processed in an additional, where 
> using XML/HTML is not really desired, hence I cannot use it as a 
> workaround (at least not now).
> 
> Any help or comments are appreciated!
> 
> Haris
> 
>